You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reduce and scan operations are core building blocks in the world of parallel computing, and now [Nabla has a new release](https://github.com/Devsh-Graphics-Programming/Nabla/tree/v0.6.2-alpha1) with those operations made even faster for Vulkan at the subgroup and workgroup levels.
14
14
15
15
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
16
-
Then, I discuss a missing reconvergence behavior that was expected after subgroup shuffle operations that was only observed on Nvidia devices.
16
+
17
+
Then, I discuss a missing excution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
17
18
18
19
<!-- truncate -->
19
20
@@ -91,6 +92,7 @@ T inclusive_scan(T value)
91
92
rhs = shuffleUp(value, 1)
92
93
value = value + (firstInvocation ? identity : rhs)
93
94
95
+
[unroll]
94
96
for (i = 1; i < SubgroupSizeLog2; i++)
95
97
{
96
98
nextLevelStep = 1 << i
@@ -110,6 +112,7 @@ You can find all the implementations on the [Nabla repository](https://github.co
110
112
## An issue with subgroup sync and reconvergence
111
113
112
114
Now, onto a pretty significant, but strangely obscure, problem that I ran into during unit testing this prior to release.
115
+
[See the unit tests.](https://github.com/Devsh-Graphics-Programming/Nabla-Examples-and-Tests/blob/master/23_Arithmetic2UnitTest/app_resources/testSubgroup.comp.hlsl)
113
116
Nabla also has implementations for workgroup reduce and scans that make use of the subgroup scans above, and one such section looks like this.
_I should note that `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
140
+
_I should note that this is the first level of scans for the workgroup scope. It is only one step of the algorithm and the data accesses are completely independent. Thus, `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
138
141
139
142
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
140
-
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
143
+
After some more testing and debugging to try and identify the cause, I've found the conditions to be:
141
144
142
145
* using an Nvidia GPU
143
146
* using emulated versions of subgroup operations
144
147
* a decent number of iterations in the loop (in this case at least 8).
145
148
146
149
I tested this on an Intel GPU, to be sure, and the workgroup scan ran correctly.
147
-
That was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
150
+
This was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
148
151
149
152
It was even more convincing when I moved the control barrier inside the loop and it immediately produced correct scan results.
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
174
177
Particularly, the effect we're seeing is a shuffle done as if `value` is not in lockstep at the call site.
175
-
We tested using a subgroup execution barrier and maximal reconvergence, and found out a memory barrier is enough.
178
+
We tested using a subgroup execution barrier and maximal reconvergence.
179
+
Strangely enough, just a memory barrier also fixed it, which it shouldn't have as subgroup shuffles are magical intrinsics that take arguments by copy and don't really deal with accessing any memory locations (SSA form).
176
180
177
181
```cpp
178
182
T inclusive_scan(T value)
@@ -182,9 +186,11 @@ T inclusive_scan(T value)
182
186
rhs = shuffleUp(value, 1)
183
187
value = value + (firstInvocation ? identity : rhs)
184
188
189
+
[unroll]
185
190
for (i = 1; i < SubgroupSizeLog2; i++)
186
191
{
187
192
nextLevelStep = 1 << i
193
+
memory_barrier()
188
194
rhs = shuffleUp(value, nextLevelStep)
189
195
value = value + (nextLevelStep out of bounds ? identity : rhs)
190
196
}
@@ -196,24 +202,29 @@ However, this problem was only observed on Nvidia devices.
196
202
197
203
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
198
204
I feel I should point out that many presentations and code listings seem to give an impression subgroup shuffle operations execute in lockstep based on the very simple examples provided.
205
+
199
206
For instance, [the example in this presentation](https://vulkan.org/user/pages/09.events/vulkanised-2025/T08-Hugo-Devillers-SaarlandUniversity.pdf) correctly demonstrates where invocations in a tangle are reading and storing to SSBO, but may mislead readers into not considering the Availability and Visibility for other scenarios that need it.
200
-
Specifically, it does not have an intended read-after write if invocations in a tangle execute in lockstep.
207
+
208
+
Such simple examples are good enough to demonstrate the purpose of the extension, but fail to elaborate on specific details.
209
+
If it did have a read-after-write between subgroup invocations, subgroup scope memory dependencies would have been needed.
210
+
201
211
(With that said, since subgroup operations are SSA and take arguments "by copy", this discussion of Memory Dependencies and availability-visibility is not relevant to our problem, but just something to be aware of.)
202
212
203
213
### A minor detour onto the performance of native vs. emulated on Nvidia devices
204
214
215
+
Since all recent Nvidia GPUs support subgroup arithmetic SPIR-V capability, why were we using emulation with shuffles?
205
216
I think this observation warrants a small discussion section of its own.
206
217
The table below are some numbers from our benchmark measured through Nvidia's Nsight Graphics profiler of a subgroup inclusive scan using native SPIR-V instructions and our emulated version.
207
218
208
-
_Native_
219
+
#### Native
209
220
210
221
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
These numbers are baffling to say the least, particularly the fact that our emulated subgroup scans are twice as fast than the native solution.
225
-
It should be noted that this is with the subgroup barrier in place, not that we saw any marked decrease in performance compared to earlier versions without it.
236
+
It should be noted that this is with the subgroup barrier before every shuffle, we did not see any marked decrease in performance.
226
237
227
238
An potential explanation for this may be that Nvidia has to consider any inactive invocations in a subgroup, having them behave as if they contribute the identity $I$ element to the scan.
228
239
Our emulated scan instead requires people call the arithmetic in subgroup uniform fashion.
229
-
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V compiler.
240
+
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V to SASS compiler.
230
241
231
242
### What could cause this behavior on Nvidia? — The Independent Program Counter
232
243
@@ -236,25 +247,39 @@ Prior to Volta, all threads in a subgroup share the same program counter, which
236
247
This means all threads in the same subgroup execute the same instruction at any given time.
237
248
Therefore, when you have a branch in the program flow across threads in the same subgroup, all execution paths generally have to be executed and mask off threads that should not be active for that path.
238
249
250
+
<figureclass="image">
251
+

252
+
<figcaption>Thread scheduling under the SIMT warp execution model of Pascal and earlier NVIDIA GPUs. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
253
+
</figure>
254
+
239
255
With Volta up to now, each thread has its own program counter that allows it to execute independently of other threads in the same subgroup.
240
256
This also provides a new possibility on Nvidia devices, where you can now synchronize threads in the same subgroup.
257
+
The active invocations still have to execute the same instruction, but it can be at different locations in the program (e.g. different iterations of a loop).
<figcaption>Independent thread scheduling in Volta architecture onwards interleaving execution from divergent branches, using an explicit sync to reconverge threads. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
262
+
</figure>
263
+
241
264
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
242
-
It's entirely possible that each subgroup shuffle operation does not run in lockstep, with the branching introduced in the loop, which would be why that is our solution to the problem for now.
265
+
It's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why that is our solution to the problem for now.
266
+
267
+
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed whether subgroup shuffle operations actually imply execution dependency, even with hours of scouring the spec.
243
268
244
-
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations actually do not imply reconvergence in the Vulkan specification.
245
-
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed this, even with hours of scouring the spec.
269
+
So then we either have...
246
270
247
-
## What does this implication mean for subgroup operations?
271
+
## This is a gray area of the Subgroup Shuffle Spec and allowed Undefined Behaviour
248
272
249
273
Consider what it means if subgroup convergence doesn't guarantee that active tangle invocations execute a subgroup operation in lockstep.
250
274
251
-
Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle and it is known exactly what it should be.
275
+
Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle, and as a corollary, it is known exactly what it should be.
276
+
252
277
Similarly, for subgroup broadcasts, first the value being broadcast needs to computed, say from invocation K.
253
278
Even if other invocations don't run in lockstep, they can't read the value until invocation K broadcasts it if they want to read the same value (uniformity) and you know what value should be read (broadcasting invocation can check it got the same value back).
254
279
255
280
On the flip side, reductions will always produce a uniform return value for all invocations, even if you reduce a stale or out-of-lockstep input value.
256
281
257
-
Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution and memory dependency.
282
+
Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution dependency.
258
283
These operations can give different results per invocation so there's no implied uniformity, which means there's no reason to expect any constraints on their apparent lockstepness being implied transitively through the properties of the return value.
259
284
260
285
The important consideration then is how a subgroup operation is implemented.
@@ -302,5 +327,10 @@ if (subgroupAny(needs_space)) {
302
327
303
328
With all that said, it needs to be noted that one can't expect every instruction to run in lockstep, as that would negate the advantages of Nvidia's IPC.
304
329
330
+
## Or a bug in Nvidia's SPIR-V to SASS compiler
331
+
332
+
And crucially, it's impossible to know (or discuss in the case of a signed NDA) what's happening for the bug or performance regression with Nvidia.
333
+
Unlike AMD's RDNA ISAs where we can verify that the compiler is doing what it should be doing using Radeon GPU Analyzer, the generated SASS is inaccessible and neither is the compiler public.
334
+
305
335
----------------------------
306
336
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
0 commit comments