Skip to content

Commit 8f6ade8

Browse files
committed
much more info, discussion on topic
1 parent e39b8f6 commit 8f6ade8

File tree

2 files changed

+79
-15
lines changed

2 files changed

+79
-15
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 68 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,17 @@ title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply R
33
slug: 'subgroup-shuffle-reconvergence-on-nvidia'
44
description: "A look at the behavior behind Nabla's subgroup scan"
55
date: '2025-06-19'
6-
authors: ['keptsecret']
6+
authors: ['keptsecret', 'devshgraphicsprogramming']
77
tags: ['nabla', 'vulkan', 'article']
88
last_update:
99
date: '2025-06-19'
1010
author: keptsecret
1111
---
1212

13-
Reduce and scan operations are core building blocks in the world of parallel computing, and now Nabla has a new release with those operations made even faster for Vulkan at the subgroup and workgroup levels.
13+
Reduce and scan operations are core building blocks in the world of parallel computing, and now [Nabla has a new release](https://github.com/Devsh-Graphics-Programming/Nabla/tree/v0.6.2-alpha1) with those operations made even faster for Vulkan at the subgroup and workgroup levels.
1414

15-
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan, and then a discussion on expected reconvergence behavior after subgroup operations.
15+
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
16+
Then, I discuss a missing reconvergence behavior that was expected after subgroup shuffle operations that was only observed on Nvidia devices.
1617

1718
<!-- truncate -->
1819

@@ -56,7 +57,8 @@ Inclusive: 4 10 12 15 22 23 23 28
5657
## Nabla's subgroup scans
5758

5859
We start with the most basic of building blocks: doing a reduction or a scan in the local subgroup of a Vulkan device.
59-
Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported via SPIRV, and it's all available in Nabla.
60+
Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported.
61+
Nabla exposes this via the [GLSL compatibility header](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/glsl_compat/subgroup_arithmetic.hlsl) built of [HLSL SPIR-V inline intrinsics](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl).
6062

6163
```cpp
6264
nbl::hlsl::glsl::groupAdd(T value)
@@ -65,7 +67,7 @@ nbl::hlsl::glsl::groupExclusiveAdd(T value)
6567
etc...
6668
```
6769
68-
But wait, the SPIRV-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
70+
But wait, the SPIR-V-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
6971
So, Nabla provides emulated versions for that too, and both versions are compiled into a single templated struct call.
7072
7173
```cpp
@@ -80,6 +82,8 @@ struct reduction;
8082
```
8183

8284
The implementation of emulated subgroup scans make use of subgroup shuffle operations to access partial sums from other invocations in the subgroup.
85+
This is based on the [Kogge–Stone adder (KSA)](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), using $\log_2 n$ steps where $n$ is the subgroup size with all lanes active.
86+
It should also be noted that in cases like this where the SIMD/SIMT processor pays for all lanes regardless of whether or not they're active, the KSA design is faster than more theoretically work-efficient parallel scans like the Blelloch (which we use at the workgroup granularity).
8387

8488
```cpp
8589
T inclusive_scan(T value)
@@ -99,8 +103,9 @@ T inclusive_scan(T value)
99103
100104
In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
101105
Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
106+
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store.
102107
103-
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
108+
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
104109
105110
## An issue with subgroup sync and reconvergence
106111
@@ -110,6 +115,7 @@ Nabla also has implementations for workgroup reduce and scans that make use of t
110115
```cpp
111116
... workgroup scan code ...
112117
118+
debug_barrier()
113119
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
114120
{
115121
value = getValueFromDataAccessor(memoryIdx)
@@ -123,11 +129,13 @@ for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
123129
setValueToSharedMemory(smemIdx)
124130
}
125131
}
126-
control_barrier()
132+
workgroup_execution_and_memory_barrier()
127133
128134
... workgroup scan code ...
129135
```
130136

137+
_I should note that `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
138+
131139
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
132140
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
133141

@@ -143,6 +151,7 @@ It was even more convincing when I moved the control barrier inside the loop and
143151
```cpp
144152
... workgroup scan code ...
145153

154+
debug_barrier()
146155
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
147156
{
148157
value = getValueFromDataAccessor(memoryIdx)
@@ -155,20 +164,20 @@ for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
155164
{
156165
setValueToSharedMemory(smemIdx)
157166
}
158-
control_barrier()
167+
workgroup_execution_and_memory_barrier()
159168
}
160169

161170
... workgroup scan code ...
162171
```
163172
164173
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
165-
Particularly, the last invocation that spends some extra time writing to shared memory may have been lagging behind.
166-
It is a simple fix to the emulated subgroup reduce and scan. A subgroup barrier was enough.
174+
Particularly, the effect we're seeing is a shuffle done as if `value` is not in lockstep at the call site.
175+
We tested using a subgroup execution barrier and maximal reconvergence, and found out a memory barrier is enough.
167176
168177
```cpp
169178
T inclusive_scan(T value)
170179
{
171-
control_barrier()
180+
memory_barrier()
172181
173182
rhs = shuffleUp(value, 1)
174183
value = value + (firstInvocation ? identity : rhs)
@@ -183,10 +192,56 @@ T inclusive_scan(T value)
183192
}
184193
```
185194

195+
However, this problem was only observed on Nvidia devices.
196+
186197
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
198+
I feel I should point out that many presentations and code listings seem to give an impression subgroup shuffle operations execute in lockstep based on the very simple examples provided.
199+
For instance, [the example in this presentation](https://vulkan.org/user/pages/09.events/vulkanised-2025/T08-Hugo-Devillers-SaarlandUniversity.pdf) correctly demonstrates where invocations in a tangle are reading and storing to SSBO, but may mislead readers into not considering the Availability and Visibility for other scenarios that need it.
200+
Specifically, it does not have an intended read-after write if invocations in a tangle execute in lockstep.
201+
(With that said, since subgroup operations are SSA and take arguments "by copy", this discussion of Memory Dependencies and availability-visibility is not relevant to our problem, but just something to be aware of.)
187202

188-
However, this problem was only observed on Nvidia devices.
189-
And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
203+
### A minor detour onto the performance of native vs. emulated on Nvidia devices
204+
205+
I think this observation warrants a small discussion section of its own.
206+
The table below are some numbers from our benchmark measured through Nvidia's Nsight Graphics profiler of a subgroup inclusive scan using native SPIR-V instructions and our emulated version.
207+
208+
_Native_
209+
210+
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
211+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
212+
| 256 | 41.6 | 90.5 | 16 | 27 |
213+
| 512 | 41.4 | 89.7 | 16 | 27.15 |
214+
| 1024 | 40.5 | 59.7 | 16 | 27.74 |
215+
216+
_Emulated_
217+
218+
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
219+
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
220+
| 256 | 37.9 | 90.7 | 16 | 12.22 |
221+
| 512 | 37.7 | 90.3 | 16 | 12.3 |
222+
| 1024 | 37.1 | 60.5 | 16 | 12.47 |
223+
224+
These numbers are baffling to say the least, particularly the fact that our emulated subgroup scans are twice as fast than the native solution.
225+
It should be noted that this is with the subgroup barrier in place, not that we saw any marked decrease in performance compared to earlier versions without it.
226+
227+
An potential explanation for this may be that Nvidia has to consider any inactive invocations in a subgroup, having them behave as if they contribute the identity $I$ element to the scan.
228+
Our emulated scan instead requires people call the arithmetic in subgroup uniform fashion.
229+
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V compiler.
230+
231+
### What could cause this behavior on Nvidia? — The Independent Program Counter
232+
233+
We think a potential culprit for this could be Nvidia's Independent Program Counter (IPC) that was introduced with the Volta architecture.
234+
235+
Prior to Volta, all threads in a subgroup share the same program counter, which handles scheduling of instructions across all those threads.
236+
This means all threads in the same subgroup execute the same instruction at any given time.
237+
Therefore, when you have a branch in the program flow across threads in the same subgroup, all execution paths generally have to be executed and mask off threads that should not be active for that path.
238+
239+
With Volta up to now, each thread has its own program counter that allows it to execute independently of other threads in the same subgroup.
240+
This also provides a new possibility on Nvidia devices, where you can now synchronize threads in the same subgroup.
241+
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
242+
It's entirely possible that each subgroup shuffle operation does not run in lockstep, with the branching introduced in the loop, which would be why that is our solution to the problem for now.
243+
244+
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
190245

191246
----------------------------
192247
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._

blog/authors.yml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,18 @@ fletterio:
3636

3737
keptsecret:
3838
name: Sorakrit Chonwattanagul
39-
title: Junior Developer @ DevSH GP
39+
title: Associate Developer @ DevSH GP
4040
url: https://github.com/keptsecret/
4141
image_url: https://avatars.githubusercontent.com/u/27181108?v=4
4242
page: true
4343
socials:
44-
github: keptsecret
44+
github: keptsecret
45+
46+
devshgraphicsprogramming:
47+
name: Mateusz Kielan
48+
title: CTO of DevSH GP
49+
url: https://github.com/devshgraphicsprogramming
50+
image_url: https://avatars.githubusercontent.com/u/6894321?v=4
51+
page: true
52+
socials:
53+
github: devshgraphicsprogramming

0 commit comments

Comments
 (0)