You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reduce and scan operations are core building blocks in the world of parallel computing, and now Nabla has a new release with those operations made even faster for Vulkan at the subgroup and workgroup levels.
13
+
Reduce and scan operations are core building blocks in the world of parallel computing, and now [Nabla has a new release](https://github.com/Devsh-Graphics-Programming/Nabla/tree/v0.6.2-alpha1) with those operations made even faster for Vulkan at the subgroup and workgroup levels.
14
14
15
-
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan, and then a discussion on expected reconvergence behavior after subgroup operations.
15
+
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
16
+
Then, I discuss a missing reconvergence behavior that was expected after subgroup shuffle operations that was only observed on Nvidia devices.
We start with the most basic of building blocks: doing a reduction or a scan in the local subgroup of a Vulkan device.
59
-
Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported via SPIRV, and it's all available in Nabla.
60
+
Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported.
61
+
Nabla exposes this via the [GLSL compatibility header](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/glsl_compat/subgroup_arithmetic.hlsl) built of [HLSL SPIR-V inline intrinsics](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/spirv_intrinsics/subgroup_arithmetic.hlsl).
But wait, the SPIRV-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
70
+
But wait, the SPIR-V-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
69
71
So, Nabla provides emulated versions for that too, and both versions are compiled into a single templated struct call.
70
72
71
73
```cpp
@@ -80,6 +82,8 @@ struct reduction;
80
82
```
81
83
82
84
The implementation of emulated subgroup scans make use of subgroup shuffle operations to access partial sums from other invocations in the subgroup.
85
+
This is based on the [Kogge–Stone adder (KSA)](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), using $\log_2 n$ steps where $n$ is the subgroup size with all lanes active.
86
+
It should also be noted that in cases like this where the SIMD/SIMT processor pays for all lanes regardless of whether or not they're active, the KSA design is faster than more theoretically work-efficient parallel scans like the Blelloch (which we use at the workgroup granularity).
83
87
84
88
```cpp
85
89
T inclusive_scan(T value)
@@ -99,8 +103,9 @@ T inclusive_scan(T value)
99
103
100
104
In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
101
105
Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
106
+
This is because we've found through benchmarking that the instructing the GPU to do a vector load/store results in faster performance than any attempt at coalesced load/store.
102
107
103
-
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
108
+
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/v0.6.2-alpha1/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
104
109
105
110
## An issue with subgroup sync and reconvergence
106
111
@@ -110,6 +115,7 @@ Nabla also has implementations for workgroup reduce and scans that make use of t
110
115
```cpp
111
116
... workgroup scan code ...
112
117
118
+
debug_barrier()
113
119
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
_I should note that `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
138
+
131
139
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
132
140
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
133
141
@@ -143,6 +151,7 @@ It was even more convincing when I moved the control barrier inside the loop and
143
151
```cpp
144
152
... workgroup scan code ...
145
153
154
+
debug_barrier()
146
155
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
165
-
Particularly, the last invocation that spends some extra time writing to shared memory may have been lagging behind.
166
-
It is a simple fix to the emulated subgroup reduce and scan. A subgroup barrier was enough.
174
+
Particularly, the effect we're seeing is a shuffle done as if `value` is not in lockstep at the call site.
175
+
We tested using a subgroup execution barrier and maximal reconvergence, and found out a memory barrier is enough.
167
176
168
177
```cpp
169
178
T inclusive_scan(T value)
170
179
{
171
-
control_barrier()
180
+
memory_barrier()
172
181
173
182
rhs = shuffleUp(value, 1)
174
183
value = value + (firstInvocation ? identity : rhs)
@@ -183,10 +192,56 @@ T inclusive_scan(T value)
183
192
}
184
193
```
185
194
195
+
However, this problem was only observed on Nvidia devices.
196
+
186
197
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
198
+
I feel I should point out that many presentations and code listings seem to give an impression subgroup shuffle operations execute in lockstep based on the very simple examples provided.
199
+
For instance, [the example in this presentation](https://vulkan.org/user/pages/09.events/vulkanised-2025/T08-Hugo-Devillers-SaarlandUniversity.pdf) correctly demonstrates where invocations in a tangle are reading and storing to SSBO, but may mislead readers into not considering the Availability and Visibility for other scenarios that need it.
200
+
Specifically, it does not have an intended read-after write if invocations in a tangle execute in lockstep.
201
+
(With that said, since subgroup operations are SSA and take arguments "by copy", this discussion of Memory Dependencies and availability-visibility is not relevant to our problem, but just something to be aware of.)
187
202
188
-
However, this problem was only observed on Nvidia devices.
189
-
And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
203
+
### A minor detour onto the performance of native vs. emulated on Nvidia devices
204
+
205
+
I think this observation warrants a small discussion section of its own.
206
+
The table below are some numbers from our benchmark measured through Nvidia's Nsight Graphics profiler of a subgroup inclusive scan using native SPIR-V instructions and our emulated version.
207
+
208
+
_Native_
209
+
210
+
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
These numbers are baffling to say the least, particularly the fact that our emulated subgroup scans are twice as fast than the native solution.
225
+
It should be noted that this is with the subgroup barrier in place, not that we saw any marked decrease in performance compared to earlier versions without it.
226
+
227
+
An potential explanation for this may be that Nvidia has to consider any inactive invocations in a subgroup, having them behave as if they contribute the identity $I$ element to the scan.
228
+
Our emulated scan instead requires people call the arithmetic in subgroup uniform fashion.
229
+
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V compiler.
230
+
231
+
### What could cause this behavior on Nvidia? — The Independent Program Counter
232
+
233
+
We think a potential culprit for this could be Nvidia's Independent Program Counter (IPC) that was introduced with the Volta architecture.
234
+
235
+
Prior to Volta, all threads in a subgroup share the same program counter, which handles scheduling of instructions across all those threads.
236
+
This means all threads in the same subgroup execute the same instruction at any given time.
237
+
Therefore, when you have a branch in the program flow across threads in the same subgroup, all execution paths generally have to be executed and mask off threads that should not be active for that path.
238
+
239
+
With Volta up to now, each thread has its own program counter that allows it to execute independently of other threads in the same subgroup.
240
+
This also provides a new possibility on Nvidia devices, where you can now synchronize threads in the same subgroup.
241
+
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
242
+
It's entirely possible that each subgroup shuffle operation does not run in lockstep, with the branching introduced in the loop, which would be why that is our solution to the problem for now.
243
+
244
+
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
190
245
191
246
----------------------------
192
247
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
0 commit comments