Skip to content

Commit 2fa6f6a

Browse files
committed
add another use case of ipc arch
1 parent d18d705 commit 2fa6f6a

File tree

2 files changed

+34
-3
lines changed

2 files changed

+34
-3
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 34 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Reduce and scan operations are core building blocks in the world of parallel com
1414

1515
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
1616

17-
Then, I discuss a missing excution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
17+
Then, I discuss a missing execution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
1818

1919
<!-- truncate -->
2020

@@ -105,7 +105,9 @@ T inclusive_scan(T value)
105105
106106
In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
107107
Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
108-
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store.
108+
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store with striding.
109+
110+
We also found shuffles and vector arithmetic to be very expensive, and so having the least amount of data exchange between invocations and pre-scanning up to 4 elements within an invocation was significantly faster.
109111
110112
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
111113
@@ -261,7 +263,36 @@ The active invocations still have to execute the same instruction, but it can be
261263
</figure>
262264

263265
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
264-
It's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why that is our solution to the problem for now.
266+
267+
The IPC also enables starvation-free algorithms on CUDA, along with the use of mutexes where a thread that attempts to acquire a mutex is guaranteed to eventually succeed. Consider the example in the Volta whitepaper of a doubly linked list:
268+
269+
```cpp
270+
__device__ void insert_after(Node* a, Node* b)
271+
{
272+
Node* c;
273+
lock(a);
274+
lock(a->next);
275+
c = a->next;
276+
277+
a->next = b;
278+
b->prev = a;
279+
280+
b->next = c;
281+
c->prev = b;
282+
283+
unlock(c);
284+
unlock(a);
285+
}
286+
```
287+
288+
The diagram shows how, with IPC, even if thread K holds the lock for node A, another thread J in the same subgroup (warp in the case of CUDA) can wait for the lock to become available and not affect K's progress.
289+
290+
<figure class="image">
291+
![Doubly Linked List lock](linked_list_lock.png "Doubly Linked List lock")
292+
<figcaption>Locks are acquired for nodes A and C, shown on the left, before the threads inserts node B shown on the right. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
293+
</figure>
294+
295+
In our case however, it's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why subgroup execution barriers are our solution to the problem for now.
265296
266297
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed whether subgroup shuffle operations actually imply execution dependency, even with hours of scouring the spec.
267298
40.2 KB
Loading

0 commit comments

Comments
 (0)