You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md
+34-3Lines changed: 34 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ Reduce and scan operations are core building blocks in the world of parallel com
14
14
15
15
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
16
16
17
-
Then, I discuss a missing excution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
17
+
Then, I discuss a missing execution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
18
18
19
19
<!-- truncate -->
20
20
@@ -105,7 +105,9 @@ T inclusive_scan(T value)
105
105
106
106
In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
107
107
Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
108
-
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store.
108
+
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store with striding.
109
+
110
+
We also found shuffles and vector arithmetic to be very expensive, and so having the least amount of data exchange between invocations and pre-scanning up to 4 elements within an invocation was significantly faster.
109
111
110
112
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
111
113
@@ -261,7 +263,36 @@ The active invocations still have to execute the same instruction, but it can be
261
263
</figure>
262
264
263
265
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
264
-
It's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why that is our solution to the problem for now.
266
+
267
+
The IPC also enables starvation-free algorithms on CUDA, along with the use of mutexes where a thread that attempts to acquire a mutex is guaranteed to eventually succeed. Consider the example in the Volta whitepaper of a doubly linked list:
268
+
269
+
```cpp
270
+
__device__ voidinsert_after(Node* a, Node* b)
271
+
{
272
+
Node* c;
273
+
lock(a);
274
+
lock(a->next);
275
+
c = a->next;
276
+
277
+
a->next = b;
278
+
b->prev = a;
279
+
280
+
b->next = c;
281
+
c->prev = b;
282
+
283
+
unlock(c);
284
+
unlock(a);
285
+
}
286
+
```
287
+
288
+
The diagram shows how, with IPC, even if thread K holds the lock for node A, another thread J in the same subgroup (warp in the case of CUDA) can wait for the lock to become available and not affect K's progress.
289
+
290
+
<figure class="image">
291
+

292
+
<figcaption>Locks are acquired for nodes A and C, shown on the left, before the threads inserts node B shown on the right. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
293
+
</figure>
294
+
295
+
In our case however, it's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why subgroup execution barriers are our solution to the problem for now.
265
296
266
297
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed whether subgroup shuffle operations actually imply execution dependency, even with hours of scouring the spec.
0 commit comments