Skip to content

Commit dc0a64d

Browse files
committed
some corrections
1 parent 8ef161d commit dc0a64d

File tree

1 file changed

+8
-17
lines changed
  • blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia

1 file changed

+8
-17
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 8 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ etc...
6666
```
6767
6868
But wait, the SPIRV-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
69-
So, Nabla provides emulated versions for that too, and it's all compiled into a single templated struct call.
69+
So, Nabla provides emulated versions for that too, and both versions are compiled into a single templated struct call.
7070
7171
```cpp
7272
template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
@@ -129,7 +129,7 @@ control_barrier()
129129
```
130130

131131
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
132-
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
132+
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
133133

134134
* using an Nvidia GPU
135135
* using emulated versions of subgroup operations
@@ -163,12 +163,12 @@ for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
163163

164164
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
165165
Particularly, the last invocation that spends some extra time writing to shared memory may have been lagging behind.
166-
It is a simple fix to the emulated subgroup reduce and scan. A memory barrier was enough.
166+
It is a simple fix to the emulated subgroup reduce and scan. A subgroup barrier was enough.
167167

168168
```cpp
169169
T inclusive_scan(T value)
170170
{
171-
memory_barrier()
171+
control_barrier()
172172

173173
rhs = shuffleUp(value, 1)
174174
value = value + (firstInvocation ? identity : rhs)
@@ -185,17 +185,8 @@ T inclusive_scan(T value)
185185
186186
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
187187
188-
However, this was only a problem on Nvidia devices.
189-
And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the spec.
190-
191-
-------------------
192-
193-
P.S. you may note in the source code that the memory barrier contains the workgroup memory mask, despite us only needing sync in the subgroup scope.
194-
195-
```cpp
196-
spirv::memoryBarrier(spv::ScopeSubgroup, spv::MemorySemanticsWorkgroupMemoryMask | spv::MemorySemanticsAcquireMask);
197-
```
188+
However, this problem was only observed on Nvidia devices.
189+
And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
198190
199-
This is because unfortunately, the subgroup memory mask doesn't seem to count as a storage class, at least according to the Vulkan SPIRV validator.
200-
Only the next step up in memory level is valid.
201-
I feel like there's possibly something missing here.
191+
----------------------------
192+
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._

0 commit comments

Comments
 (0)