You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Described as "the most important numerical algorithm of our lifetime", the FFT has applications in a plethora of domains.
17
27
@@ -23,7 +33,17 @@ In this article I show how to run an FFT in Nabla, talk about different optimiza
23
33
24
34
First, one must know what a Fourier Transform is. It's a clever way of decomposing periodic signals into their frequency components, essentially nothing more than an orthonormal change of basis. This might weird to think about, so here's a good intro to the topic by 3B1B:
25
35
26
-
<iframewidth="560"height="315"src="https://www.youtube.com/embed/spUNpyF58BY?si=ZlJZDmq5fLnEkjnj"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
Don't dwell too much on the continuous case because we're mostly interested in the [Discrete Fourier Transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) (DFT for short). It's a center piece of Digital Signal Processing. As a quick summary, the DFT is nothing but a change of basis in some vector space. Given a signal defined over some domain (spatial or temporal, usually), the "natural" representation of it is its "canonical basis decomposition" - which means mapping each point in space or time to the signal's value at that point. Thanks to Fourier, we have another very useful representation for the same signal, which involves its "spectral decomposition" - periodic functions defined over certain domains can always be written as a linear combination of some special orthogonal (w.r.t. some metric) functions over the same domain.
29
49
@@ -39,13 +59,33 @@ Now you might be asking, why would I care about computing the DFT really fast? W
39
59
40
60
The convolution of two signals $f$ and $g$, denoted by $f * g$, is a special type of product. My favourite way of reasoning about it (and one I have surprisingly very rarely come upon) is that it's just the superposition of many copies of $f$: for each point $x$ in your space, you take a copy of $f$ centered at $x$, $f(t-x)$ (as a function of a parameter $t$), and scale it by the value of $g$ at that point, $g(x)$, then sum all of these copies together. 3B1B again has a great introductory video, although he presents convolution in a more "standard" way, which is by sliding inverted copies of one signal over the other:
41
61
42
-
<iframewidth="560"height="315"src="https://www.youtube.com/embed/KuXjwB4LzSA?si=8Ma-72OlJ_m-0r3_"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
[The Convolution Theorem](https://en.wikipedia.org/wiki/Convolution_theorem#Periodic_convolution) states that we can perform a (circular) convolution as a Hadamard (element-wise) product in the spectral domain. This means that convolution goes from an $O(nm)$ operation ($n$ being the number of pixels of a signal and $m$ being the number of pixels of a filter) down to $O(n \log n)$ (assuming $n \ge m$): You do Forward FFT, then Hadamard product, then Inverse FFT, with the FFTs being $O(n \log n)$ and the product being $O(n)$. For small filters the FFT convolution ends up being slower, but for larger ones the speedup is massive.
45
75
46
76
Our Lead Build System and Test Engineer, Arkadiusz, has a Vulkanised talk giving a recap of the Convolution Theorem and the usage of the FFT in Nabla:
47
77
48
-
<iframewidth="560"height="315"src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=qmAz8XLshpGIKFr0&start=271"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
This has two important implications: first, that after performing the FFT of a real signal, we only need to store half of the values, since the other half are redundant. The values we store for a sequence of length $N$, for even $N$, are those indexed $0$ through $\frac N 2$, where the latter is commonly known as the Nyquist frequency.
164
214
@@ -314,7 +364,17 @@ This allows us to keep a single copy of the spectrum resident in GPU memory, wit
314
364
315
365
What we're doing here is essentially zooming out in the spatial domain by resampling the spectrum. Once again, Arkadiusz's video does give a bit of insight into this as well.
316
366
317
-
<iframewidth="560"height="315"src="https://www.youtube.com/embed/Ol_sHFVXvC0?si=dVlEwrkL2zm7s5Mi&start=572"title="YouTube video player"frameborder="0"allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"referrerpolicy="strict-origin-when-cross-origin"allowfullscreen></iframe>
Since we assume (and ir our Bloom example, require) the kernel to have PoT long sides (and square, but for this discussion it could also be rectangular) it turns out that `roundUpToPoT(imageDimensions+kernelDimensions)` is exactly an integer multiple of `kernelDimensions`
320
380
(of course, it might be a different multiple per axis). Let's assume
Don't worry! your triangle running at a mere 300 fps is perfectly normal. The purpose of this post is to try to convince you it is not
10
+
a good use of your time to try to optimise hello-triangle.
11
+
12
+
- 300fps is still pretty fast! ~3.33ms
13
+
14
+
- FPS can be a misleading performance metric, as it changes non-linearly as you optimise your frame.
15
+
A 10fps difference from 60 to 70fps is ~2.38ms while the difference from 300 to 310fps is ~0.107ms.
16
+
To actually profile your application it is much better to use dedicated tools like [Nsight Graphics](https://docs.nvidia.com/nsight-graphics/UserGuide/) or [Tracy](https://github.com/wolfpld/tracy).
17
+
18
+
- Modern GPUs are very complex, and performance **does not scale linearly with scene complexity**, for example, if one triangle runs at 300fps this doesnt mean five triangles will run at 60fps.
19
+
GPUs are designed to have really good throughput at the cost of latency.
20
+
21
+
- When rendering one single triangle, most of your frametime may just be **overhead**, this could be your window manager, driver or API state validation to name a few.
22
+
23
+
- **hello-triangle** is simply not a representative workload for _real applications_, which are way more complex with lots of factors affecting performance and a **compromise between speed and
24
+
quality**. In order to properly judge the performance of your engine, you should at least use a test scene such as [Intel Sponza](https://www.intel.com/content/www/us/en/developer/topic-technology/graphics-research/samples.html) or
0 commit comments