Skip to content

Commit e5eb98e

Browse files
committed
post pull resolve
2 parents 0411a71 + ec8a974 commit e5eb98e

File tree

20 files changed

+574
-159
lines changed

20 files changed

+574
-159
lines changed

blog/2025/2025-01-24-fft-bloom-optimized-to-the-bone-in-nabla/index.md

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,17 @@ last_update:
1111
author: Fletterio
1212
---
1313

14-
<iframe width="560" height="315" src="https://www.youtube.com/embed/IvWbIPyqE0s?si=UYAO5G_5GIMXxY7Z" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
14+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
15+
<iframe
16+
src="https://www.youtube.com/embed/IvWbIPyqE0s?si=UYAO5G_5GIMXxY7Z"
17+
title="YouTube video player"
18+
frameBorder="0"
19+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
20+
referrerPolicy="strict-origin-when-cross-origin"
21+
allowFullScreen
22+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
23+
/>
24+
</div>
1525

1626
Described as "the most important numerical algorithm of our lifetime", the FFT has applications in a plethora of domains.
1727

@@ -23,7 +33,17 @@ In this article I show how to run an FFT in Nabla, talk about different optimiza
2333

2434
First, one must know what a Fourier Transform is. It's a clever way of decomposing periodic signals into their frequency components, essentially nothing more than an orthonormal change of basis. This might weird to think about, so here's a good intro to the topic by 3B1B:
2535

26-
<iframe width="560" height="315" src="https://www.youtube.com/embed/spUNpyF58BY?si=ZlJZDmq5fLnEkjnj" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
36+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
37+
<iframe
38+
src="https://www.youtube.com/embed/spUNpyF58BY?si=ZlJZDmq5fLnEkjnj"
39+
title="YouTube video player"
40+
frameBorder="0"
41+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
42+
referrerPolicy="strict-origin-when-cross-origin"
43+
allowFullScreen
44+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
45+
/>
46+
</div>
2747

2848
Don't dwell too much on the continuous case because we're mostly interested in the [Discrete Fourier Transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) (DFT for short). It's a center piece of Digital Signal Processing. As a quick summary, the DFT is nothing but a change of basis in some vector space. Given a signal defined over some domain (spatial or temporal, usually), the "natural" representation of it is its "canonical basis decomposition" - which means mapping each point in space or time to the signal's value at that point. Thanks to Fourier, we have another very useful representation for the same signal, which involves its "spectral decomposition" - periodic functions defined over certain domains can always be written as a linear combination of some special orthogonal (w.r.t. some metric) functions over the same domain.
2949

@@ -39,13 +59,33 @@ Now you might be asking, why would I care about computing the DFT really fast? W
3959

4060
The convolution of two signals $f$ and $g$, denoted by $f * g$, is a special type of product. My favourite way of reasoning about it (and one I have surprisingly very rarely come upon) is that it's just the superposition of many copies of $f$: for each point $x$ in your space, you take a copy of $f$ centered at $x$, $f(t-x)$ (as a function of a parameter $t$), and scale it by the value of $g$ at that point, $g(x)$, then sum all of these copies together. 3B1B again has a great introductory video, although he presents convolution in a more "standard" way, which is by sliding inverted copies of one signal over the other:
4161

42-
<iframe width="560" height="315" src="https://www.youtube.com/embed/KuXjwB4LzSA?si=8Ma-72OlJ_m-0r3_" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
62+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
63+
<iframe
64+
src="https://www.youtube.com/embed/KuXjwB4LzSA?si=8Ma-72OlJ_m-0r3_"
65+
title="YouTube video player"
66+
frameBorder="0"
67+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
68+
referrerPolicy="strict-origin-when-cross-origin"
69+
allowFullScreen
70+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
71+
/>
72+
</div>
4373

4474
[The Convolution Theorem](https://en.wikipedia.org/wiki/Convolution_theorem#Periodic_convolution) states that we can perform a (circular) convolution as a Hadamard (element-wise) product in the spectral domain. This means that convolution goes from an $O(nm)$ operation ($n$ being the number of pixels of a signal and $m$ being the number of pixels of a filter) down to $O(n \log n)$ (assuming $n \ge m$): You do Forward FFT, then Hadamard product, then Inverse FFT, with the FFTs being $O(n \log n)$ and the product being $O(n)$. For small filters the FFT convolution ends up being slower, but for larger ones the speedup is massive.
4575

4676
Our Lead Build System and Test Engineer, Arkadiusz, has a Vulkanised talk giving a recap of the Convolution Theorem and the usage of the FFT in Nabla:
4777

48-
<iframe width="560" height="315" src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=qmAz8XLshpGIKFr0&amp;start=271" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
78+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
79+
<iframe
80+
src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=qmAz8XLshpGIKFr0&amp;start=271"
81+
title="YouTube video player"
82+
frameBorder="0"
83+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
84+
referrerPolicy="strict-origin-when-cross-origin"
85+
allowFullScreen
86+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
87+
/>
88+
</div>
4989

5090
## FFT Bloom
5191

@@ -158,7 +198,17 @@ $M \cdot \mathcal F(K')$ of the matrix $M$ and the spectrum of the kernel $\math
158198

159199
Once again, here's Arkadiusz talking about this:
160200

161-
<iframe width="560" height="315" src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=1ke5LgxKgDwQ-iEL&amp;start=513" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
201+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
202+
<iframe
203+
src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=1ke5LgxKgDwQ-iEL&amp;start=513"
204+
title="YouTube video player"
205+
frameBorder="0"
206+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
207+
referrerPolicy="strict-origin-when-cross-origin"
208+
allowFullScreen
209+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
210+
/>
211+
</div>
162212

163213
This has two important implications: first, that after performing the FFT of a real signal, we only need to store half of the values, since the other half are redundant. The values we store for a sequence of length $N$, for even $N$, are those indexed $0$ through $\frac N 2$, where the latter is commonly known as the Nyquist frequency.
164214

@@ -314,7 +364,17 @@ This allows us to keep a single copy of the spectrum resident in GPU memory, wit
314364

315365
What we're doing here is essentially zooming out in the spatial domain by resampling the spectrum. Once again, Arkadiusz's video does give a bit of insight into this as well.
316366

317-
<iframe width="560" height="315" src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=dVlEwrkL2zm7s5Mi&amp;start=572" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
367+
<div style={{ position: "relative", width: "100%", aspectRatio: "16/9" }}>
368+
<iframe
369+
src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=dVlEwrkL2zm7s5Mi&amp;start=572"
370+
title="YouTube video player"
371+
frameBorder="0"
372+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
373+
referrerPolicy="strict-origin-when-cross-origin"
374+
allowFullScreen
375+
style={{ width: "100%", height: "100%", position: "absolute", top: 0, left: 0 }}
376+
/>
377+
</div>
318378

319379
Since we assume (and ir our Bloom example, require) the kernel to have PoT long sides (and square, but for this discussion it could also be rectangular) it turns out that `roundUpToPoT(imageDimensions+kernelDimensions)` is exactly an integer multiple of `kernelDimensions`
320380
(of course, it might be a different multiple per axis). Let's assume
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
---
2+
title: 'help! my triangle is only 300fps!!!'
3+
slug: 'optimised-triangle'
4+
date: '2025-09-24'
5+
authors: ['jaked', 'eduameli']
6+
tags: ['faq', 'article']
7+
---
8+
9+
Don't worry! your triangle running at a mere 300 fps is perfectly normal. The purpose of this post is to try to convince you it is not
10+
a good use of your time to try to optimise hello-triangle.
11+
12+
- 300fps is still pretty fast! ~3.33ms
13+
14+
- FPS can be a misleading performance metric, as it changes non-linearly as you optimise your frame.
15+
A 10fps difference from 60 to 70fps is ~2.38ms while the difference from 300 to 310fps is ~0.107ms.
16+
To actually profile your application it is much better to use dedicated tools like [Nsight Graphics](https://docs.nvidia.com/nsight-graphics/UserGuide/) or [Tracy](https://github.com/wolfpld/tracy).
17+
18+
- Modern GPUs are very complex, and performance **does not scale linearly with scene complexity**, for example, if one triangle runs at 300fps this doesnt mean five triangles will run at 60fps.
19+
GPUs are designed to have really good throughput at the cost of latency.
20+
21+
- When rendering one single triangle, most of your frametime may just be **overhead**, this could be your window manager, driver or API state validation to name a few.
22+
23+
- **hello-triangle** is simply not a representative workload for _real applications_, which are way more complex with lots of factors affecting performance and a **compromise between speed and
24+
quality**. In order to properly judge the performance of your engine, you should at least use a test scene such as [Intel Sponza](https://www.intel.com/content/www/us/en/developer/topic-technology/graphics-research/samples.html) or
25+
[Bistro](https://developer.nvidia.com/orca/amazon-lumberyard-bistro).
26+
27+
Good luck on your journey learning graphics!
2.87 KB
Loading
71.4 KB
Loading

0 commit comments

Comments
 (0)