Klafyvel

Building a proper archiving method for my things, episode 1

Sat, 27 Jul 2024 00:00:00 +0000

That's a topic I've been wanting to dive in for a long time. As I'm getting older, I start to have accumulated some amount of files accross, often salvaged from device to device, that are somewhat dear to me. In a way, I want to archive properly my things for the same reasons this blogpost by my friend cookie convinced me to start journaling: later in time, I want to be able to remember how things were for me.

Where do I start from?

I've got data scattered among multiple devices:

My laptop 'ewilan',
My two external hard-drives, 'shae' and, uh it does not have a name except 'TOSHIBA',
My server 'klafyvel.me', that has some various backups and hosts my emails,
My phone (an android),
Some old broken android phone where I have pictures I'd like to salvage someday,
an iPad I won at a hackathon that I do not use a lot, but still.

Most of the things are on ewilan and the two hard drives, and those are what I'll focus on first.

Merging the two drives

Because of the limited available disk space on my laptop, I've been chaotically moving stuff to those drives. In principle, things should be duplicated on these two... but I've been doing it by hand, and it's a mess. I've been doing some cleanup on the TOSHIBA hard drive, because I use it less than shae, and removed as many things I could (movies I'd already watched, useless old projects...).

Next, I needed to actually decide what to do with the remaining file: copying them if they were not already on shae, ignoring them otherwise. I am sure there are plenty of smart Linux commands to do that, but I'm dumb and lazy, so I wrote a Julia script to do it the way I want. Essentially:

it copies a file if the corresponding file does not exist on the target,
it does not copy a file whose path already exists on the target, but whose content is identical
it copies to a renamed path files that are redundant but different,
it outputs a CSV file that tells me what it did (or plans to do if running in dry mode),
it can use the output of a dry run to actually move the files, which means I can have a look at what it's going to do before breaking everything.

Thanks to ProgressMeter.jl I also have nice outputs.

✓ Indexing /run/media/klafyvel/TOSHIBA EXT/Vidéos files...    Time: 0:00:01
◑ Processing files...    Time: 0:01:25
  decision:  keep
  file:      TPS/interviews/son/MONO-021.wav
  reason:    Same file at same location

The process is quite long, so grab a book if you're gonna use the script. For my video folder, it took 42 minutes. The nice thing with having a CSV file outputed is I can have some statistics, who doesn't like statistics? And here you can see that indeed, I had a lot of duplicated stuff in my video folder:

Row │ decision  count
     │ String7   Int64
─────┼─────────────────
   1 │ keep       2646
   2 │ copy          2
   3 │ ignore        1

I am defending my PhD thesis on the 24/01!

Tue, 09 Jan 2024 00:00:00 +0000

I will be defending my PhD thesis on the 24ᵗʰ of January, at ENS Paris-Saclay in amphitheater 1Z14.

The title is "Optical Spectroscopy of Graphene Quantum Dots and Halide Perovskite Nanocrystals", and the defense will be in english. This work has been directed by Emmanuelle Deleporte, co-directed by Fabien Bretenaker, co-advised by Damien Garrot, and in collaboration with Jean-Sébastien Lauret.

To help me organize the event, can you fill this ce framadate if you want to join the "pot" after the defense?

On this page...

On this page...
Abstract
How can I come?
Zoom link.
I have a question!

Abstract

This work focuses on the optical spectroscopy of two classes of materials using fluorescence microscopy at room temperature.

First, halide perovskites, a class of semiconductors that have known a surge in interest in the last ten years because of their outstanding optoelectronic properties, making them a promising platform for photovoltaic applications, but also light emission in diodes, lasers, and quantum devices. These crystalline materials consist of corner-sharing octahedra with a metallic ion at the center, often lead, and halide ions at the corners: Cl, Br, or I. A cation completes the structure. It is either organic, for example, methylammonium (MA) or formamidinium, or inorganic, for example, cesium. In the context of light emission, halide perovskites are an excellent choice to address the problem of the green gap, that is, the lack of efficient emitters in the green region of the optical spectrum, because of the possibility to tune their band gap thanks to an informed choice of the halide during the synthesis. Moreover, because the synthesis is done at room temperature and involves soft chemistry steps, they are promising for industrial applications. The synthesis and characterization of CsPbBr₃ nanocrystals emitting in the optical spectrum's green region using a new reprecipitation-based method is reported. In particular, the nanocrystals' high calibration and good stability are highlighted.

The second part of this study is about graphene quantum dots. Those low-dimensional objects allow the opening of the band gap of graphene, making them fluorescent. These emitters are promising because their atomically-thin structure and tunability make them suitable for realizing nano-sensors. Building on the recently studied structure-properties relationship of rod-shaped graphene quantum dots, a thorough single-molecule study of highly fluorescent graphene quantum dots with 96 \(sp^2\) carbon atoms is reported. The excellent purity of the samples was highlighted. The study of the time dynamics of those single-photon emitters in a polystyrene matrix allowed estimating the characteristic times of the transient dynamic of the quantum dots.

Finally, the third part reports the study of the graphene quantum dots on a perovskite surface. The surface of perovskites is of peculiar interest for the realization of devices with these semiconductors, making it an interesting playground to use graphene quantum dots. To that end, the quantum dots were deposited on a millimetric MAPbBr₃ single-crystal surface.

As thin films deposited on the perovskite, the graphene quantum dots present photophysics compatible with the formation of excimers.
As the concentration of quantum dots on the surface is lowered, diffraction-limited spots are observed. The time-domain study of the photoluminescence reveals jumps between discrete states of the system.
The frequency-domain investigation of the intensity of photoluminescence of these diffraction-limited emitters is dominated by 1/f noise, which highly contrasts the stable, shot-noise-dominated dynamics of the single emitters when studied in a polystyrene matrix.

How can I come?

To come to ENS, see here;
To find the amphitheater, see the plan. The amphitheater is on floor no. 1 (above the "mezzanine") in the north building.

Zoom link.

A zoom link is available! Send me a message if you want it.

I have a question!

Send me an email!

Je soutiens ma thèse le 24/01!

Tue, 09 Jan 2024 00:00:00 +0000

Le 24/01/2024, je soutiendrai ma thèse à l'ENS Paris-Saclay dans l'amphithéâtre 1Z14.

La présentation s'intitule "Optical Spectroscopy of Graphene Quantum Dots and Halide Perovskite Nanocrystals" et se fera en anglais. Ce travail de thèse a été réalisé sous la direction d'Emmanuelle Deleporte, la co-direction de Fabien Bretenaker, le co-encadrement de Damien Garrot, et en collaboration avec Jean-Sébastien Lauret.

Afin de m'aider à organiser la soutenance, pouvez-vous remplir ce framadate si vous souhaitez venir au pot de thèse?

English version is here.

Sur cette page

Sur cette page
Résumé de la thèse
Comment venir ?
Suivre la présentation à distance.
Je n'ai jamais assisté à une soutenance de thèse, ça se passe comment ?
J'ai d'autres questions !

Résumé de la thèse

Ce travail se concentre sur la spectroscopie optique de deux classes de matériaux en utilisant la microscopie de fluorescence à température ambiante.

Tout d'abord, les pérovskites halogénées, une classe de semi-conducteurs qui ont connu un regain d'intérêt au cours des dix dernières années en raison de leurs propriétés optoélectroniques exceptionnelles, ce qui en fait une plate-forme prometteuse pour les applications photovoltaïques, mais aussi pour l'émission de lumière dans les diodes, les lasers et les dispositifs quantiques. Ces matériaux cristallins sont constitués d'octaèdres dont les sommets sont partagés. Un ion métallique est positionné au centre, souvent du plomb, et des ions halogénures aux sommets : Cl, Br ou I. Un cation complète la structure. Il est soit organique, par exemple le méthylammonium (MA) ou le formamidinium, soit inorganique, par exemple le césium. Dans le contexte de l'émission de lumière, les pérovskites halogénées constituent un excellent choix pour résoudre le problème du green gap, c'est-à-dire le manque d'émetteurs efficaces dans la région verte du spectre optique, en raison de la possibilité d'ajuster leur bande interdite grâce à un choix éclairé de l'halogénure lors de la synthèse. De plus, comme la synthèse se fait à température ambiante et implique des étapes de chimie simples, ils sont prometteurs pour les applications industrielles. La synthèse et la caractérisation de nanocristaux de CsPbBr₃ émettant dans la région verte du spectre optique à l'aide d'une nouvelle méthode basée sur la précipitation est rapportée. En particulier, la calibration élevé et la bonne stabilité des nanocristaux sont mis en évidence.

La deuxième partie de cette étude porte sur les boîtes quantiques de graphène. Ces objets de faible dimension permettent d'ouvrir la bande interdite du graphène, ce qui les rend fluorescents. Ces émetteurs sont prometteurs parce que leur structure atomiquement fine et leur accordabilité les rendent aptes à réaliser des nanocapteurs. En s'appuyant sur la relation structure-propriétés récemment étudiée des boîtes quantiques de graphène rectangulaires, une étude approfondie au niveau de l'objet unique de ces boîtes quantiques hautement fluorescentes avec 96 atomes de carbone \(sp^2\) est rapportée. L'excellente pureté des échantillons a été mise en évidence. L'étude de la dynamique temporelle de ces émetteurs de photons uniques dans une matrice de polystyrène a permis d'estimer les temps caractéristiques de la dynamique transitoire des points quantiques.

Enfin, la troisième partie rapporte l'étude des points quantiques de graphène sur une surface de pérovskite. La surface des pérovskites présente un intérêt particulier pour la réalisation de dispositifs avec ces semi-conducteurs, ce qui en fait un terrain de jeu intéressant pour l'utilisation des boîtes quantiques de graphène. À cette fin, les boîtes quantiques ont été déposés sur la surface de monocristaux millimétriques de MAPbBr₃.

En tant que films minces déposés sur la pérovskite, les boîtes quantiques de graphène présentent une photophysique compatible avec la formation d'excimères.
Lorsque la concentration de boîtes quantiques sur la surface est réduite, des taches limitées par la diffraction sont observées. L'étude de la photoluminescence dans le domaine temporel révèle des sauts entre des états discrets du système.
L'étude dans le domaine des fréquences de l'intensité de la photoluminescence de ces émetteurs limités par la diffraction est dominée par le bruit en 1/f, ce qui contraste fortement avec la dynamique stable, dominée par le bruit de grenaille, des émetteurs uniques lorsqu'ils sont étudiés dans une matrice de polystyrène.

Comment venir ?

Pour se rendre à l'ENS, sur le plateau de Saclay, voir ici;
Pour trouver l'amphithéâtre dans l'école, on peut utiliser le plan. L'amphithéâtre se trouve au premier étage (au dessus de l'étage mezzanine) dans le bâtiment nord.

Suivre la présentation à distance.

Il y a un lien zoom disponible! Envoyez-moi un message si vous souhaitez suivre la présentation par zoom.

Je n'ai jamais assisté à une soutenance de thèse, ça se passe comment ?

La soutenance de thèse est un examen universitaire qui permet d'accéder au grade de Docteur. En France, sauf exception, la soutenance est publique.

Question

Comment se déroule la soutenance ?

La matinée devrait se passer comme ceci:

9h30: Début de la soutenance. Présentation de mes travaux de thèse pendant environ 45 minutes,
10h30: Début des questions du jury, en commençant par les deux rapporteurs. Cette partie peut durer environ 1h30 à 2h,
12h~12h30: Le jury se retire pour délibérer, cela prend généralement une vingtaine de minutes,
Après la délibération: le président du jury annonce le résultat de l'examen, et il y a un petit discours pour remercier les personnes impliquées dans la soutenance,
Et pour terminer, il y aura un pot ouvert à toutes et tous!

Question

Qui compose le jury ?

Les membres du jury avec une voix délibérative incluent deux rapporteurs, un examinateur, et un président de jury. Le président est élu parmi les membres du jury avec une voix délibérative qui ne sont pas rapporteurs. Le jury est complété par les membres qui ne disposent pas de voix délibérative: mes encadrants et JS Lauret qui est invité.

Question

Quand peut-on entrer/sortir de la salle ?

La soutenance doit commencer à 9h30, donc il vaut mieux arriver un peu avant. ;) Une partie du public peut sortir après la présentation et avant les questions pour revenir après la délibération s'ils le souhaitent.

J'ai d'autres questions !

Envoyez-moi un mail/message/SMS/pigeon voyageur, je répondrai au mieux !

A nice approximation of the norm of a 2D vector.

Sun, 30 Oct 2022 00:00:00 +0000

While wandering on the internet, I stumbled upon Paul Hsieh's blog-post, where he demonstrates a way to approximate the norm of a vector without any call to the sqrt function. Let's see if I can reproduce the steps to derive this.

Table of contents
Setting-up the scene.
Finding a lower bound to the norm.
Finding an upper bound to the norm.
Choosing the best approximation for the norm.
Conclusion

Setting-up the scene.

Calculating the norm of a vector \((x,y)\), or a complex number \(x+iy\) means calculating \(\sqrt{x^2+y^2}\). Without loss of generality, we can set \(\sqrt{x^2+y^2}=1\). If we draw this, we get the following.

The (x, y) pairs with a euclidean norm of 1.

Finding a lower bound to the norm.

Now, the issue with the norm is that the \(\sqrt{}\) operation is expensive to compute. That's why we would like another way to approximate the norm. A first idea is to look at other norms available, indeed, what we have called "norm" so far is actually the 2-norm, also named euclidean norm. Let's have a look at two other norms : the infinity norm and the Manhattan norm.

Infinity norm is :

\[ \lVert(x,y)\rVert_\infty = \max(x,y) \]

Manhattan norm is :

\[ \lVert(x,y)\rVert_1 = |x|+|y| \]

The (x, y) pairs with a euclidean norm of 1, an infinity norm of 1 or a Manhattan norm of 1.

Now we see the Manhattan norm is indeed a lower bound for the 2-norm, even if it's rough. The Infinity norm, however, is too high. But that is not an issue, we could simply scale it up so that it is always higher than the 2-norm. The scaling factor is chosen, such as the yellow curve tangent to the circle. For that, we need it to be equal to \(\cos\frac{\pi}{4}=\frac{1}{\sqrt{2}}\).

We now have a nice lower bound of the euclidean norm!

We have a lower bound! By choosing the closest to the circle between the yellow and green curves, we get an octagon that is very close to the circle. We can define the upper bound of the circle with a function \(f\) such as:

\[ f(x,y) = \max\left(\max(x,y), \frac{1}{\sqrt{2}}(|x|+|y|)\right) \]

Note that this is different from Paul's article. You do need to take the maximum value of the two norms to select the points that are closest to the center. Generally speaking, for two norms, if one's value is higher than the other, then the former will be drawn closer to the origin when plotting the \(\text{norm}(x,y)=1\) curve.

To trace this function, note that Manhattan and infinity norms isolines cross when \(|y|=1\) and \(|x| = \sqrt{2}-1\) or \(|x|=1\) and \(|y| = \sqrt{2}-1\).

The lower bound of the norm outlined.

Finding an upper bound to the norm.

The first idea you can get from the lower bound we found is to scale it up so that the octagon corners touch the circle.

To do so, we need to find the 2-norm of one of the corners and divide \(f\) by it.

Let's take the one at \(x=1\), \(y=\sqrt{2}-1\). We have:

\[ \begin{align} \sqrt{x^2+y^2} &=& \sqrt{1 + \left(\sqrt{2}-1\right)^2}\\ &=& \sqrt{1 + 2 - 2\sqrt{2} + 1}\\ &=& \sqrt{4 - 2\sqrt{2}} \end{align} \]

Thus, the upper-bound for the 2-norm with the octagon method is \(\sqrt{4 - 2\sqrt{2}}f(x,y)\):

\[ f(x,y) \leq \sqrt{x^2+y^2} \leq \sqrt{4 - 2\sqrt{2}}f(x,y) \]

The upper and lower bounds of the norm outlined.

Choosing the best approximation for the norm.

Now, we could stick to Paul Hsieh's choice of taking the middle between the lower and the upper bounds, and it will probably be fine. But come on, let's see if it is the best choice. 😉

Formally, the problem is to find a number \(a\in[0,1]\) such as \(g\) defined as follows is the closest possible to the norm-2.

\[ \begin{align} g(x,y,a) &=& (1-a)f(x,y)+\frac{a}{\sqrt{4 - 2\sqrt{2}}}f(x,y)\\ &=& \left((1-a) + a\sqrt{4 - 2\sqrt{2}}\right)f(x,y) \end{align} \]

Let's plot this function for various values of \(a\). To make things easier, I will "unroll" the circle, and plot the norms against \(\theta\), the angle between our vector and the \(x\) axis.

Various possible approximations for the norm.

As expected, we can continuously vary our approximation between the upper and lower bounds. Notice that these functions are periodic and even. We can thus focus on the first half period to minimize the error. The first half period is when the vector is at the first octagon vertices, starting from the \(x\) axis and circling anti-clockwise.

Zooming in the part of the unit circle that is interesting for calculating the error.

To minimize the error with our approximation, we want to minimize the square error. That is:

\[ \begin{align} e(a) &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}(g(x,y,a)-1)^2\text{d}\theta \end{align} \]

Thankfully, the expression of \(f(x,y)\) and thus of \(g(x,y,a)\) should simplify a lot on the given interval. You can see on schematic above that on this interval we have, \(f(x,y)=max(|x|,|y|)=|x|=x=\cos\theta\). We can thus rewrite \(e(a)\) as follows.

\[ \begin{align} e(a) &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}(g(x,y,a)-1)^2\text{d}\theta\\ &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}\left(\left(1-a + a\sqrt{4-2\sqrt{2}}\right)\cos\theta-1\right)^2\text{d}\theta\\ &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}\left(h(a)\cos\theta-1\right)^2\text{d}\theta \end{align} \]

Where \(h(a)=\left(1-a + a\sqrt{4-2\sqrt{2}}\right)\) and \(\arctan\left(\sqrt{2}-1\right)=\frac{\pi}{8}\).

Square error against θ.

Sum square error against a.

As we can see from these plots, there is a minimal error, and though 0.5 is a reasonable choice for \(a\), we can do slightly better around 0.3.

We can explicitly calculate \(e(a)\). Let \(h(a)=(1+a(A-1))\). We have

\[ \begin{align} e(a) &=& \int_0^{\pi/8}(h(a)\cos\theta-1)^2\text{d}\theta\\ &=&h^2(a)\int_0^{\pi/8}\cos^2\theta\text{d}\theta-2h(a)\int_0^{\pi/8}\cos\theta\text{d}\theta + \frac{\pi}{8}\\ &=& h^2(a)B-2h(a)\sin\frac{\pi}{8} + \frac{\pi}{8} \end{align}\]

Where \(B=\frac{\pi}{16}+\frac{1}{4\sqrt2}\). Thus, we look for the position of the minimum, that is where \(e'(a)=0\).

\[ \begin{align} 0 &=& 2B(A-1)(1+a(A-1))-\sin\frac{\pi}{8}\\ 0 &=& 2B(A-1)(1+a(A-1)) - \frac{A}{2\sqrt2}\\ a &=& \left(\frac{A}{2B\sqrt2}-1\right)\times\frac{1}{A-1}\\ a &\approx& 0.311 \end{align} \]

Not that far from 0.3!

The maximum deviation from the result is then \(\max_\theta{|h(a)\cos\theta-1|}\). Looking for that maximum is like looking for the maximum of \(\left(h(a)\cos\theta-1\right)^2\). Long story short, the maxima can only occur on the boundaries of the allowed domain for \(\theta\), that is \(\theta=0\) or \(\theta=\pi/8\), meaning

\[ \max_\theta{|h(a)\cos\theta-1|} = \max\left(h(a)-1, \left|h(a)\frac{\sqrt{2-\sqrt{2}}}{2}-1\right|\right) \]

With our choice for \(a\), we get \(h(a)\approx 1.026\), so the maximum deviation is 0.052. That is, we have at most a 5.3% deviation from the norm-2!

Our best approximation for the euclidean norm, with the calculated maximum errors.

Conclusion

That was a fun Sunday project! Originally this was intended to be included in a longer blog-post that is yet to be finished, but I figured it was interesting enough to have its own post. The take-home message being, you can approximate the Euclidean norm of a vector with:

\[ \begin{align} \text{norm}(x,y) &=& \frac{\sqrt{2-\sqrt{2}}}{\frac{\pi}{8}+\frac{1}{2\sqrt{2}}}\max\left(\max(|x|,|y|), \frac{1}{\sqrt{2}}(|x|+|y|)\right)\\ &\approx& 1.026\max\left(\max(|x|,|y|), \frac{1}{\sqrt{2}}(|x|+|y|)\right) \end{align} \]

You'll get at most a 5.3% error. This is a bit different from what's proposed on Paul Hsieh's blog-post. Unless I made a mistake, there might be a typo on his blog!

If you are interested in playing with the code used to generate the figures in this article, have a look at the companion notebook!

As always, if you have any question, or want to add something to this post, you can leave me comment or ping me on Twitter or Mastodon.

How I over-engineered a Fast Fourier Transform for Arduino.

Sat, 15 Oct 2022 00:00:00 +0000

Everything began with me wanting to implement the Fast Fourier Transform (FFT) on my Arduino Uno for a side project. The first thing you do in such case is asked your favorite search engine for existing solutions. If you google "arduino FFT" one of the first result will be related to this instructable: ApproxFFT: The Fastest FFT Function for Arduino. As you can imagine, this could only tickle my interest: there was an existing solution to my problem, and the title suggested that it was the fastest available! And thus, on April 18ᵗʰ 2021,^[1] I started a journey that would bring me to write my own tutorial on implementing the FFT in Julia, learn AVR Assembly and write a blog post about it, about one year and a half later.

There is a companion GitHub repository where you can retrieve all the codes presented in this article.

Information

This is the long version of the story. If you are only interested in nice plots showing the speed and the accuracy of my proposed solution, please head to the dedicated instructable : Faster than the Fastest FFT for Arduino !

[1]	Yes, I went through my Firefox history database to find this date.

Table of contents
Why reinvent the wheel?
1. Because I did not know how to implement the FFT.
2. Because I thought it was possible to do better.
  1. In-place or out-of-place algorithm?
3. Trigonometry can be blazingly fast. 🚀🚀🚀 🔥🔥
Interlude: some tooling for debugging.
1. Using arduino-cli to upload your code.
2. Don't bother with communication protocols over Serial.
Fast, accurate FFT, and other floating-point trickeries.
1. A first dummy implementation of the FFT.
2. Forbidden occult arts are fun. 😈
3. Approximate floating-point FFT.
How fixed-point arithmetic came to the rescue.
1. Fixed-point multiplication.
2. Controlled result growth.
3. Trigonometry is demanding.
4. Saturating additions. (a.k.a. "Trigonometry is demanding" returns.)
5. Calculating modules with a chainsaw.
6. 16 bits fixed-point FFT.
7. 8 bits fixed-point FFT.
8. Implementing fixed-point FFT for longer inputs
Benchmarking all these solutions.
Closing thoughts.

Why reinvent the wheel?

As I said in the introduction, I explicitly researched an implementation of the FFT because I did not want to implement my own. So what changed my mind ?

Because I did not know how to implement the FFT.

Let's start with the obvious: abhilash_patel's instructable is a Great instructable. It is part of a series of instructables on implementing the FFT on Arduino, and this is his fastest accurate implementation. The instructable does a great job at explaining the big ideas behind it, with not only appropriate, but also good-looking illustrations. That is why I decided to read his code, to be certain of my good understanding of it.

And that is the exact moment I entered an infinite spiral. Not because the code was bad, even though it could use some indenting, but because I did not understand how it achieves its purpose. To my own disappointment, I realized that maybe I did not know how to implement an FFT. Sure, I had my share of lectures on the Fourier Transform, and on the Fast Fourier Transform, but the lecturers only showed us how the FFT was an algorithm with a very nice complexity through its recursive definition. But what I was looking at did not even remotely look like what I expected to see.

So I did what seemed the most sensible thing to me at the time: I spent nights reading Wikipedia pages and obscure articles on 2000s looking website to understand how the FFT was actually implemented.

About one month later, on May 23ʳᵈ, I started writing a tutorial on zestedesavoir.com : "Jouons à implémenter une transformée de Fourier rapide !", a sloppy translation of which is also available on my blog. My goal here was to write down what I had learned throughout the month, and it helped me clarify the math behind the implementation. Today, I use it as a reference when I have doubts on the implementation.

With this newly acquired knowledge on FFT implementations, I was ready to have another look at @abhilash_patel's code.

Because I thought it was possible to do better.

As I said, I was now capable of understanding the code provided by @abhilash_patel. And there I found two low-hanging fruits:

The program was weirdly mixing in-place and out-of-place algorithm,
The trigonometry computation was inefficient.

Let me state more clearly what I mean here.

In-place or out-of-place algorithm?

The FFT can either be implemented in-place or out-of-place. Implementing out-of-place of course allows you to keep the input data unchanged by the computation. However, the in-place algorithm offers several key advantages, the first, obvious, one being that it only requires the amount of space needed to store the input array.

This might not be obvious, but it also works for real-valued signals. Indeed, one might think that if you have an array of, say, float representing such a signal, its FFT would require twice the amount of space since the Fourier transform is complex-valued. The trick here is to use a key property of the Fourier transform : the Fourier transform of a real-valued signal, knowing the positive-frequencies part is enough. You can see the full explanation in my blog post on implementing the FFT in Julia.

This would help me get an FFT implementation that can run on more than 256 data points on my Arduino Uno, which the original instructable implementation cannot.^[2]

[2] Even though the code used for the benchmark cannot. This is not due to a memory size issue, but to the variable types I used for my buffers (uint8_t). I think you can understand this would be easily fixed to run the FFT on bigger samples, and since I was especially interested in benchmarks in time, I allowed myself that.

Trigonometry can be blazingly fast. 🚀🚀🚀 🔥🔥

I believe this is where the biggest improvement in benchmark-time originates from. Step 2 of the original instructable details how to use a kind of look-up table to compute very quickly the trigonometry functions. This is an efficient method if you have to implement a fast cosine or a fast sine function. However, using such a method for the FFT means forgetting a very interesting property of the algorithm : the angles for which trigonometry calculations is required do not appear at random at all. In fact for each recursion step of the algorithm, they increase by a constant amount, and always start from the same angle : 0.

This arithmetical progression of the angle allows using a simple, yet efficient formula for calculating the next sine and cosine :

\[\begin{aligned}\cos(\theta + \delta) &= \cos\theta - [\alpha \cos\theta + \beta\sin\theta]\\\sin(\theta + \delta) &= \sin\theta - [\alpha\sin\theta - \beta\cos\theta]\end{aligned}\]

With \(\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\;\beta=\sin\delta\).

I have included the derivation of these formulas in the relevant section of my tutorial.

As I said, this is most likely the biggest source of improvement in execution time, as trigonometry computation-time instantaneously becomes negligible using this trick.

Interlude: some tooling for debugging.

I am a big fan of the Julia programming language. It is my main programming tool at work, and I also use it for my hobbies. However, I believe the tips given in this section are easily transportable to other programming languages.

The main idea here is that when you start working with arrays of data, good old Serial.println is not usable anymore. Because you cannot simply evaluate the correctness of your results at a simple glance, you want to use higher level tools, such as statistical analysis or plotting libraries. And since you are also likely to want to upload your code to the Arduino often, it is convenient to be able to upload it programmatically.

This machinery allows testing all the different implementations in a reproducible way. All the examples given in this article are calculated on the following input signal.

Input signal used in the tests below.

Using `arduino-cli` to upload your code.

At the time I started this project, the new Arduino IDE wasn't available yet. If you have ever used the 1.x versions of the IDE, then you know why one would like to avoid the old IDE. Thankfully, there is a command-line utility that allows uploading code from your terminal: arduino-cli. If you take a look at the GitHub repository, you'll notice a Julia script, which purpose is to upload code to the Arduino and retrieve the results of computations and benchmarks. The upload part is simply a system call to arduino-cli.

function upload_code(directory)
    build = joinpath(workdir, directory, "build")
    ino = joinpath(workdir, directory, directory * ".ino")    build_command = `arduino-cli compile -b arduino:avr:uno -p $portname --build-path "$build" -u -v "$ino"`
    run(pipeline(build_command, stdout="log_arduino-cli.txt", stderr="log_arduino-cli.txt"))
end

Don't bother with communication protocols over Serial.

At first, I was tempted to use some fancy communication protocols for the serial link. This is not useful in our case, because you can simply reset the Arduino programmatically to ensure the synchronization of the computer and the development board, and then exchange raw binary data.

Resetting is done using the DTR pin of the port. In Julia, you can do this like this using the LibSerialPort.jl library:

function reset_arduino()
    LibSerialPort.open(portname, baudrate) do sp
        @info "Resetting Arduino"
        # Reset the Arduino
        set_flow_control(sp, dtr=SP_DTR_ON)
        sleep(0.1)
        set_flow_control(sp, dtr=SP_DTR_OFF)
        sp_flush(sp, SP_BUF_INPUT)
        sp_flush(sp, SP_BUF_OUTPUT)
    end
end

Because your computer can now reset the Arduino at will, you can easily ensure the synchronization of your board. That means the benchmark script knows when to read data from the Arduino.

Then, the Arduino would send data to the computer like this:

Serial.write((byte*)data, sizeof(fixed_t)*N);

This way, the array data is sent directly through the serial link as a stream of raw bytes. We don't bother with any form of encoding.

On the computer side, you can easily read the incoming data:

data = zeros(retrieve_datatype, n_read)
read!(sp, data)

Where sp is an object created by LibSerialPort.jl when opening a port.

You can then happily analyze your data, it's DataFrames.jl and Makie.jl time !

Fast, accurate FFT, and other floating-point trickeries.

My first approach was to re-use as much as I could the code I wrote for my FFT tutorial in Julia. That's why I started working with floating-point arithmetic. This also was convenient because it kept away some issues like overflowing numbers, that I had to address once I started working with fixed-point arithmetic.

A first dummy implementation of the FFT.

As I said, my first implementation was a simple, stupid translation of one of the codes presented in my Julia tutorial. I did not even bother with writing optimized trigonometry functions, I just wanted something that worked as a basis for other implementations. The code is fairly simple and can be viewed here.

As expected, this gives almost error-free results.

Module of approximate floating-point FFT on Arduino. Comparison with reference implementation.

Forbidden occult arts are fun. 😈

Now let's move on to more interesting stuffs. The first obvious improvement you can make on the base implementation is fast trigonometry, and that's what yields the biggest improvement in terms of speed. Then, I decided to mess around with IEEE-754 to write my own approximate routines for float multiplication, halving and modulus calculation. The idea is always the same: treat IEEE-754 representation of a floating-point number as its logarithm. This does give weird-looking implementations though. I have written several posts on Zeste-de-Savoir explaining how all these work. It is in French, but I trust you can make DeepL run!

"Approximer rapidement le carré d'un nombre flottant" explains how to square a number using its floating-point representation.
"IEEE 754: Quand votre code prend la float" explains how the IEEE-754 representation of a number looks alike it's logarithm.
"Multiplications avec Arduino: jetons-nous à la float" explains how the approximate multiplication of two floating-point numbers can be efficiently calculated.

Approximate floating-point FFT.

Without further delay, here is a sneak preview of the result I got with the approximate floating-point FFT. For a full benchmark, you will have to wait for the end of this article! The code is available here.

Module of approximate floating-point FFT on Arduino. Comparison with reference implementation.

How fixed-point arithmetic came to the rescue.

Rather than endlessly optimizing the floating-point implementation, I decided to change my approach. The main motivation being: Floats are actually overkill for our purpose. Indeed, they have the ability to represent numbers with a good relative precision over enormous ranges. However, when calculating FFTs the range output variables may cover can indeed vary, but not that much. And most importantly, it varies predictably. This means a fixed-point representation can be used. Also, because of their amazing properties Floats actually take a lot of space in the limited RAM available on a microcontroller. And finally, I want to be able to run FFTs on signal read from Arduino's ADC. If my program can deal with int-like data types, then it'll spare me the trouble of converting from integers to floating-points.

Fixed-point multiplication.

I first played with the idea of implementing a fixed-point FFT because I realized the AVR instruction set gives us the fmul instruction, dedicated to multiplying fixed-point numbers. This means we can use it to have a speed-efficient implementation of the multiplication, that should even beat the custom float one.

I wrote a blog-post on Zeste-de-Savoir (in French) on implementing the fixed-point multiplication. It is based on the proposed implementation in the AVR instruction set manual.

/* Signed fractional multiply of two 16-bit numbers with 32-bit result. */
fixed_t fixed_mul(fixed_t a, fixed_t b) {
  fixed_t result;
  asm (
      // We need a register that's always zero
      "clr r2" "\n\t"
      "fmuls %B[a],%B[b]" "\n\t" // Multiply the MSBs
      "movw %A[result],__tmp_reg__" "\n\t" // Save the result
      "fmul %A[a],%A[b]" "\n\t" // Multiply the LSBs
      "adc %A[result],r2" "\n\t" // Do not forget the carry
      "movw r18,__tmp_reg__" "\n\t" // The result of the LSBs multipliplication is stored in temporary registers
      "fmulsu %B[a],%A[b]" "\n\t" // First crossed product
                                  // This will be reported onto the MSBs of the temporary registers and the LSBs
                                  // of the result registers. So the carry goes to the result's MSB.
      "sbc %B[result],r2" "\n\t"
      // Now we sum the cross product
      "add r19,__tmp_reg__" "\n\t"
      "adc %A[result],__zero_reg__" "\n\t"
      "adc %B[result],r2" "\n\t"
      "fmulsu %B[b],%A[a]" "\n\t" // Second cross product, same as first.
      "sbc %B[result],r2" "\n\t"
      "add r19,__tmp_reg__" "\n\t"
      "adc %A[result],__zero_reg__" "\n\t"
      "adc %B[result],r2" "\n\t"
      "clr __zero_reg__" "\n\t"
      :
      [result]"+r"(result):
      [a]"a"(a),[b]"a"(b):
      "r2","r18","r19"
  );
  return result;
}

Obviously, you can also create the same function for 8-bits fixed-point arithmetic.

fixed8_t fixed_mul_8_8(fixed8_t a, fixed8_t b) {
  fixed8_t result;  asm (
    "fmuls %[a],%[b]" "\n\t"
    "mov %[result],__zero_reg__" "\n\t"
    "clr __zero_reg__" "\n\t"
    :
    [result]"+r"(result):
    [a]"a"(a),[b]"a"(b)
  );
  return result;
}

As you can see, this requires writing some assembly code because the fmul instruction is not directly accessible from C. However, even though it is fairly simple, this limits the implementation to AVR platforms. You might still get some reasonably efficient code by implementing everything in pure C, and extend the implementation to other platforms.

Controlled result growth.

As I said before, the FFT grows predictably. First, we can see that the final Fourier transform is bounded. Recall that the FFT is an algorithm to compute the Discrete Fourier Transform (DFT), which is written:

\[\begin{aligned} X[k] &=& \sum_{n=0}^{N-1}x[n]e^{-2i\pi nk/N} \end{aligned}\]

Where \(X\) is the discrete Fourier transform of the input signal \(x\) of size \(N\). From that we have:

\[\begin{aligned} |X[k]| &\leq \left|\sum_{n=0}^{N-1}x[n]e^{-2i\pi nk/N}\right|\\ &\leq \sum_{n=0}^{N-1}\left|x[n]e^{-2i\pi nk/N}\right| \\ &\leq \sum_{n=0}^{N-1}\left|x[n]\right|\\ &\leq N\times\max_n|x[n]| \end{aligned}\]

In our case, because we use the Q0f7 fixed point format, the input signal \(x\) is in the range \([-1,1]\). That means the components of the DFT are within range \([-N,N]\). Note that these bounds are attained for some signals, e.g. a constant input.

With that, we know how to scale the result of the FFT so that it can be stored. But what about the intermediary steps ? How do we ensure that the intermediary values stay within range? You may recall from the blog post explaining FFT this kind of "butterfly" diagrams:

Butterfly diagram of an FFT on 8 points input signal. Each column represents a step in the algorithm, and each line is a case of the array. The various polygons identify cases that are part of the same subdivision of the array, and the arrows show how we combine them to go the next step of the algorithm.

This diagram also shows you that each step of the algorithm actually performs some FFTs on input signals of smaller sizes. That means our bounding rule applies for intermediary signals, given that we plug the right size of input signal in the formula! Notice how at each step, corresponding sub-FFTs have a size of \(2^{i}\), where \(i\) is the number of the step, starting at 0. That basically means that if we scale down the signal between each step by dividing it by a factor of two, we will keep the signal bounded in \([-1,1]\) at each step!

Note that this does not mean we get the optimal scale for every input signal. For example, signals which are poorly periodic would have a lot of low module Fourier coefficients, and would not fully take advantage of the scale offered by our representation. I did some tests scaling the array only when it was needed, and did not notice many changes in terms of execution times, so that's something you might want to explore if your project requires it.

Trigonometry is demanding.

If all you have is a hammer, everything looks like a nail.

~ Abraham Maslow

Once I had fixed-point arithmetic working, I started wanting to use it everywhere. But I quickly encountered an issue: trigonometry stopped working.

The reason is simple, 8-bits precision is not enough for trigonometry calculations when we approach the small angles. The key point here, is that the precision needed for fixed-point calculation of trigonometry functions depends on the size of the input array. Recall from section Trigonometry can be blazingly fast. 🚀🚀🚀 🔥🔥 that we need to precompute values for \(\alpha\) and \(\beta\), where

\[\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\quad\beta=\sin\delta\]

And \(\delta\) is the angle increment by which we want to increase the angle of the complex number we are summing with in the FFT. This angle depends on \(N\), the total length of the input array, and is equal to \(\frac{2\pi}{N}\). That means we need to be able to represent at least \(2\sin^2\frac{\pi}{N}\) for trigonometry to work. For \(N=256\), this is approximately equal to \(0.000301\). Unfortunately, the lowest number one can represent using Q0f7 fixed point representation, that is with 7 bits in the fractional part, is \(2^{-7}=0.0078125\). That is why even for the 8 bit fixed point FFT, trigonometry calculations are performed using 16 bits fixed point arithmetic.

This limit on trigonometry also explains why the code presented here is not usable "as is" for very long arrays. Indeed, while 512 cases-long arrays could be handled using 16-bits trigonometry, the theoretical limit for an Arduino Uno would be 1024 cases-long arrays (because RAM is 2048 bytes, and we need some space for temporary variables), and that would require 32-bits trigonometry, which I did not implement.

Saturating additions. (a.k.a. "Trigonometry is demanding" returns.)

One other issue with trigonometry I did not see coming is its sensitivity to overflow. Since there is basically no protection against it, overflowing a fixed-point representation of a number flips the sign. In the case of trigonometry this is especially annoying, because that means we add a \(\pi\) phase error for even the slightest error when values are close to one. And trust me, it took me some time to understand where the error was coming from.

To mitigate this, I had to implement my own addition, that saturates to one instead of flipping the sign when overflow happens. The trick here is to use the status register (SREG) of the microcontroller to detect overflow. Again this requires doing the addition in assembly, as the check needs to happen right after the addition was performed, and there is no way to tell what the compiler might do between the addition and the actual check.

Checking overflow is done using the brvc instruction (Branch if Overflow Cleared), and the function for 16-bits saturating addition goes like this:

/* Fixed point addition with saturation to ±1. */
fixed_t fixed_add_saturate(fixed_t a, fixed_t b) {
  fixed_t result;
  asm (
      "movw %A[result], %A[a]" "\n\t"
      "add %A[result],%A[b]" "\n\t" 
      "adc %B[result],%B[b]" "\n\t" 
      "brvc fixed_add_saturate_goodbye" "\n\t"
      "subi %B[result], 0" "\n\t"
      "brmi fixed_add_saturate_plus_one" "\n\t"
      "fixed_add_saturate_minus_one:" "\n\t" 
      "ldi %B[result],0x80" "\n\t"
      "ldi %A[result],0x00" "\n\t"
      "jmp fixed_add_saturate_goodbye" "\n\t"
      "fixed_add_saturate_plus_one:" "\n\t"
      "ldi %B[result],0x7f" "\n\t"
      "ldi %A[result],0xff" "\n\t"
      "fixed_add_saturate_goodbye:" "\n\t"
      :
      [result]"+d"(result):
      [a]"r"(a),[b]"r"(b)
  );  return result;
}

One might be tempted to use this routine for every single addition performed in the program. This is actually useless, since additions in the actual FFT algorithm will not overflow thanks to scaling, if they are done in a sensible order (check the code if you want to see how!).

Calculating modules with a chainsaw.

After a lot of wandering on the Internets, I ended up using Paul Hsieh's technique for computing approximate modules of vectors. However, while writing this article I discovered some mistakes and things that could be improved in his article, so I ended up writing a dedicated article on this, showing how you can minimize the mean square error, and get at most a 5.3% error.

The main idea is that you can approximate the unit circle using a set of well-chosen octagons. That reminds me of what a rough cylinder carved using a chainsaw might look like, hence the name of this section.

One of the figures of the article on approximating the norm. Look at how this look like something carved using a chainsaw!

16 bits fixed-point FFT.

Enough small talk, time for some action! You can find here the code for 16-bits fixed-point FFT. The benchmark is available at the end of this article, but in the meantime here is the error comparison against reference implementation.

Calculated module of the Fourier transform of the input signal using 16-bits fixed-points arithmetic for various input signal lengths. Comparison with reference implementation.

8 bits fixed-point FFT.

And now the fastest FFT on Arduino that I implemented, the 8-bits fixed-point FFT! As for previous implementations, you can find the code here. Below is a comparison of the calculated module of the FFT against a reference implementation.

Calculated module of the Fourier transform of the input signal using 8-bits fixed-points arithmetic for various input signal lengths. Comparison with reference implementation.

Implementing fixed-point FFT for longer inputs

The Arduino Uno has 2048 bytes of RAM. But because this implementation of the FFT needs an input array whose length is a power of two, and because you need some space for variables,^[3] the limit would be a 1024 bytes long FFT. But the code presented here would have to be modified a bit (not that much). From where I am standing I see two major issues:

As discussed previously, trigonometry would need 32-bits arithmetic. That means you would need to implement the multiplication and saturating addition for those numbers.
The buffers are single bytes right now, so you would need to upgrade them to 16-bits buffers.

Once those two issues, and the inevitable hundreds of other issues I did not think of are addressed, I don't see why one could not perform FFT on 1024 bytes-long input arrays.

[3]	Although I am sure a very determined person would be able to fit all the temporary variables in registers and calculate a 2048 bytes-long FFT. Do it, I vouch for you, you beautiful nerd!

Benchmarking all these solutions.

I won't go into the details of how I do the benchmarks here, it's basically just using the Arduino micros() function. I present here only two benchmarks: how much time is required to run the FFT, and how "bad" the result is, measured with the mean squared error. Now, this is not the perfect way to measure the error made by the algorithm, so I do encourage you to have a look at the different comparison plots above. You will also notice that ApproxFFT seems to perform poorly in terms of error for small-sized input arrays. This is because it does not compute the result for frequency 0, so the error is probably over-estimated. Overall, I think it is safe to say that ApproxFFT and Fixed16FFT introduce the same amount of errors in the calculation. Notice how ExactFFT is literally billions times more precise than the other FFT algorithms. For 8-bits algorithms, the quantization mean squared error is \({}^1/{}_3 LSB^2\approx2\times10^{-5}\), which means there are still sources of error introduced in the algorithm other than simple quantization. The same goes for ApproxFFT and Fixed16FFT, where the quantization error is approximately \(3\times10^{-10}\).

Mean-square error benchmark. The y-axis has a logarithmic scale, so you can see how much better `ExactFFT` performs!

Execution time is where my implementations truly shine. Indeed, you can see that for 256 bytes-long input array, Fixed8FFT only needs about 12 ms to compute the FFT, when it takes 52ms for ApproxFFT to do the same. And if you need the same level of precision as what ApproxFFT offers, you can use Fixed16FFT, which only needs about 30ms to perform the computation. It's worth noticing that FloatFFT is not far behind, with only 67ms needed to compute the 256 bytes FFT. Of course Exact FFT takes much longer.

Execution time benchmark. `Fixed8FFT` is truly fast!

Closing thoughts.

It has been a fun journey! I had a lot of fun and "ha-ha!" moments when debugging all these implementations. As I wrote before, there are ways to improve them, either by making Fixed8FFT able to handle longer input arrays, or writing a custom-made addition for floating-point number to speed-up FloatFFT. I don't know if I will do it in the near future, as this whole project was just intended to be a small side-project, which ended-up bigger than expected.

As always, feel free to contact me if you need any further detail on this. You can join me on mastodon, or on GitHub, or even through the comment section below! In the meantime, have fun with your projects. :)

Modeling a honeycomb grid in FreeCAD

Thu, 04 Aug 2022 00:00:00 +0000

Information

This was originally a Twitter thread, but it is easier to read here.

Someone asked me how to make a honeycomb grid in @FreeCADNews. Here's how I do it, and bonus it's parametric! ⬇️

Sorry, your browser doesn't support embedded videos.

A nicely animated plate with a honeycomb cut.

Let's start with a simple plate with four holes. I give a name to each dimension in the sketcher so that I can re-use them later.

Sketching the plate.

Extruding it.

Then I create a new body and start sketching on the XY plane. For this example I wanted to constrain the hexagon side, so a bit of trigonometry is needed to get the width of each hexagon. I also decided here that the separation between hexagons would be about 2mm.

Sketching the first hexagon of the pattern

Extruding it.

The two construction lines will serve as directions to which we repeat the hexagon. Notice how I also link the pad length of the new solid with the plate pad length. Then we head to the Create MultiTransform tool in Part Design, and start a first LinearPattern. We need it a bit longer than the width of the plate since we will duplicate the hexagons sideways. Any "big" number will do, but a bit of trigonometry gives me the exact length.

Using MultiTransform to expand the pattern to the right.

Then using another LinearPattern I can complete the line of hexagons. Since our pattern is symmetric I could also have used a symmetry tool. As before I use one of the construction lines for the direction of the pattern.

Expanding the pattern to the left.

Now I do the other direction! Using another LinearPattern, the second construction line, and a bit of trigonometry (again).

Expanding the pattern to the top.

The number of occurrences is given by Length / <>.hexagon_sep . Freecad will round that to the nearest integer, if you're not happy with that, you can mess around with ceil and floor. Then, once again I can complete the pattern.

Expanding the pattern to the bottom.

Let's create another body using the sketcher. It will represent the area where I want the honeycomb pattern to be present. I can re-use the dimensions I set for the base plate using their name.

Sketching the area where the honeycomb pattern will be cut.

Extruding it.

One body remaining! We want some of the hexagons to be full. So let's create a body representing these. It re-uses the dimensions of the first hexagon.

Sketching an hexagon looking exactly like the first one.

Extruding it.

Now I want to repeat the body a certain amount of time to fill some of the hexagons. Once again MultiTransform is our friend.

Expanding the new hexagon pattern to the right...

... then to the left.

Notice that I used the dimension from the honeycomb pattern to match the correct positions of the hexagon. Also, everything being parametric, I can simply change the number of hexagons by setting the Occurrences parameter of LinearPatter004. At this stage, I have four bodies. I named them main_plate, hexagons, allowed_cut_zone and text_zone. Let's combine them cleverly using boolean operations!

`main_plate`

`hexagons`

`allowed_cut`

`text_zone`

First, let's remove the text zone from the allowed cut, using PartDesign's boolean operation.

Combining `allowed_cut` and `text_zone`.

Resulting geometry.

Then I can create the cut zone, which is the intersection between the allowed cut zone and the hexagons.

Combining the previous geometry with `hexagons`.

This is the final pattern we want to cut from the original plate.

Finally, I can do the cutting, by taking the difference between the base plate and the cut zone.

Combining the pattern with the original plate.

Resulting cut plate.

I just need to add some text using the Draft workbench... whoops, the text zone is a bit too big, good thing that our model is parametric, so we can easily change its size. 😬

What a messy boy I am.

And there you have it!

Our nice and clean result.

If you want to mess around with the model, it is available here.

Have fun!

Let's play at implementing a fast Fourier transform!

Sat, 12 Feb 2022 00:00:00 +0000

The Fourier transform is an essential tool in many fields, be it in Physics, Signal Processing, or Mathematics. The method that is probably the most known to calculate it numerically is called the FFT for Fast Fourier Transform. In this little tutorial, I propose to try to understand and implement this algorithm in an efficient way. I will use the language Julia, but it should be possible to follow using other languages such as Python or C. We will compare the results obtained with those given by the Julia port of the FFTW library.

This tutorial is intended for people who have already had the opportunity to encounter the Fourier transform, but who have not yet implemented it. It is largely based on the third edition of Numerical Recipes^[1], which I encourage you to consult: it is a gold mine.

Information

This content was originally publisher on zestedesavoir.com in French. This is a quick translation (using Deepl and a few manual modifications). If something seems off please tell me, as it is likely an error coming from the translation step. You can even open an issue on Github, or create a pull-request to fix the issue !

Information

This page used to be generated dynamically, but the benchmarks would break every so often because of that. It is now generated statically. The current page was generated with the following julia setup:

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  LD_PRELOAD = /usr/lib64/libstdc++.so.6
pkg> st
Status `/tmp/jl_AEhNcq/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [13f3f980] CairoMakie v0.12.4
  [7a1cc6ca] FFTW v1.8.0
  [65edfddc] SixelTerm v1.3.0

The full code for this article is available here.

[1]	William H. Press, Saul A. Teukolsky, William T. Vetterling, & Brian P. Flannery. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing (3rd ed.). Cambridge University Press.

Table of contents
Some reminders on the discrete Fourier transform
1. The Fourier transform
2. From the Fourier transform to the discrete Fourier transform
3. Calculating the discrete Fourier transform
4. Why a fast Fourier transform algorithm?
Implementing the FFT
1. My first FFT
2. Analysis of the first implementation
3. Calculate the reverse permutation of the bits
4. My second FFT
5. The special case of a real signal
  1. Property 1: Compute the Fourier transform of two real functions at the same time
6. Property 2 : Compute the Fourier transform of a single function
7. Calculation in place
8. An FFT for the reals
9. Optimization of trigonometric functions

Some reminders on the discrete Fourier transform

The discrete Fourier transform is a transformation that follows from the Fourier transform and is, as its name indicates, adapted for discrete signals. In this first part I propose to discover how to build the discrete Fourier transform and then understand why the fast Fourier transform is useful.

The Fourier transform

This tutorial is not intended to present the Fourier transform. However, there are several definitions of the Fourier transform and even within a single domain, several are sometimes used. We will use the following: for a function \(f\), its Fourier transform \(\hat{f}\) is defined by:

\[ \hat{f}(\nu) = \int_{-\infty}^{+\infty}f(x)e^{-2i\nu x}\text{d}x \]

From the Fourier transform to the discrete Fourier transform

As defined in the previous section, the Fourier transform of a signal is a continuous function of the variable \(\nu\). However, to represent any signal, we can only use a finite number of values. To do this we proceed in four steps:

We sample (or discretize) the signal to analyze. This means that instead of working on the function that associates the value of the signal with the variable \(x\), we will work on a discrete series of values of the signal. In the case of the FFT, we sample with a constant step. For example if we look at a temporal signal like the value of a voltage read on a voltmeter, we could record the value at each tic of a watch.
We window the discretized signal. This means that we keep only a finite number of points of the signal.
We sample the Fourier transform of the signal to obtain the discrete Fourier transform.
We window the discrete Fourier transform for storage.

I suggest you to reason on a toy signal which will have the shape of a Gaussian. This makes the reasoning a little simpler because the Fourier transform of a real Gaussian is also a real Gaussian^[2], which simplifies the graphical representations.

The signal which will be used as an example

More formally, we have:

\[ f(x) = e^{-x^2},\;\hat{f}(\nu)=\sqrt{\pi}e^{-(\pi\nu)^2} \]

Let's first look at the sampling. Mathematically, we can represent the process by the multiplication of the signal \(f\) by a Dirac comb of period \(T\), \(ш_T\). The Dirac comb is defined as follows:

\[ ш_T(x) = \sum_{k=-\infty}^{+\infty}\delta(x-kT) \]

With \(\delta\) the Dirac distribution. Here is the plot that we can obtain if we represent \(f\) and \(g=ш_T\times f\) together:

The signal and the sampled signal.

The Fourier transform of the new \(g\) function is written ^[3] :

\[ \begin{aligned} \hat{g}(\nu) &= \int_{-\infty}^{+\infty} \sum_{k=-\infty}^{+\infty} \delta(x-kT) f(x) e^{-2i\pi x \nu} \text{d}x \\ &= \sum_{k=-\infty}^{+\infty}\int_{-\infty}^{+\infty}\delta(x-kT) f(x) e^{-2i\pi x \nu}\text{d}x \\ &= \sum_{k=-\infty}^{+\infty}f(kT)e^{-2i\pi kT\nu} \end{aligned} \]

If we put \(f[k]=f(kT)\) the sampled signal and \(\nu_{text{ech}} = \frac{1}{T}\) the sampling frequency, we have:

\[ \hat{g} = \sum_{k=-\infty}^{+\infty}f[k]e^{-2i\pi k\frac{\nu}{\nu_{\text{ech}}}} \]

If we plot the Fourier transform of the starting signal \(\hat{f}\) and that of the sampled signal \(\hat{g}\), we obtain the following plot:

Fourier transform of the signal and its sampled signal

Information

We notice that the sampling of the signal has led to the periodization of its Fourier transform. This leads to an important property in signal processing: the Nyquist-Shanon criterion, and one of its consequences, spectrum aliasing. I let you consult the Wikipedia article about this if you are interested, but you can have a quick idea of what happens if you draw the previous plot with a too large sampling: the bells of the sampled signal transform overlap.

Fourier transform of the signal and its sampled signal, illustrating aliasing.

We can then look at the windowing process. There are several methods that each have their advantages, but we will focus here only on the rectangular window. The principle is simple: we only look at the values of \(f\) for \(x\) between \(-x_0\) and \(+x_0\). This means that we multiply the function \(f\) by a gate function \(\Pi_{x_0}\) which verifies:

\[ \Pi_{x_0}(x) = \begin{aligned} 1 & \;\text{if}\; x\in[-x_0,x_0] \\ 0 & \;\text{else} \end{aligned} \]

Graphically, here is how we could represent \(h\) and \(f\) together.

Signal sampled and windowed

Concretely, this is equivalent to limiting the sum of the Dirac comb to a finite number of terms. We can then write the Fourier transform of \(h=Pi_{x_0} \times ш_T \times f\) :

\[ \hat{h}(\nu) = \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{\nu}{\nu_{\text{ech}}}} \]

Information

The choice of windowing is not at all trivial, and can lead to unexpected problems if ignored. Here again I advise you to consult the associated Wikipedia article if needed.

We can now proceed to the last step: sampling the Fourier transform. Indeed, we can only store a finite number of values on our computer and, as defined, the function \(\hat{h}\) is continuous. We already know that it is periodic, with period \(\nu_{\text{ech}}\), so we can store only the values between \(0\) and \(\nu_{\text{ech}}\). We still have to sample it, and in particular to find the adequate sampling step. It is clear that we want the sampling to be as "fine" as possible, in order not to miss any detail of the Fourier transform! For this we can take inspiration from what happened when we sampled \(f\): its Fourier transform became periodic, with period \(\nu_{\text{ech}}\). Now the inverse Fourier transform (the operation that allows to recover the signal from its Fourier transform) has similar properties to the Fourier transform. This means that if we sample \(\hat{h}\) with a sampling step \(\nu_s\), then its inverse Fourier transform becomes periodic with period \(1/\nu_s\). This gives a low limit on the values that \(\nu_s\) can take ! Indeed, if the inverse transform has a period smaller than the width of the window (\(1/\nu_s < 2x_0\)), then the reconstructed signal taken between \(-x_0\) and \(x_0\) will not correspond to the initial signal \(f\) !

So we choose \(\nu_s = \frac{1}{2x_0}\) to discretize \(\hat{h}\). We use the same process of multiplication by a Dirac comb to discretize. In this way we obtain the Fourier transform of a new function \(l\) :

\[ \begin{aligned} \hat{l}(\nu) = \sum_{n=-\infty}^{+\infty} \delta(\nu-n\nu_s) \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{n\nu_s}{\nu_{\text{ech}}}} \end{aligned} \]

This notation is a bit complicated, and we can be more interested in \(\hat{l}[n]=\hat{l}(n\nu_s)\) :

\[ \begin{aligned} \hat{l}[n] = \hat{l}(n\nu_s) &=& \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{n\nu_s}{\nu_{\text{ech}}}}\\ &=& \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \end{aligned} \]

To get the last line, I re-indexed \(f[k]\) to start at 0, noting \(N\) the number of samples. I then assumed that the window size corresponded to an integer number of samples, i.e. that \(2x_0 = N\times T\), which is rewritten as \(N\times \nu_s = \nu_{\text{ech}}\). This expression is the discrete Fourier transform of the signal.

Sampling the Fourier transform of the sampled signal to obtain the discrete Fourier transform

Information

We can see that the sampling frequency does not enter into this equation, and there are many applications where we simply forget that this frequency exists.

Question

There is one last point to clarify: this discrete transform is defined for an infinite (discrete) number of values of \(n\). How to store it on our computer ?

This problem is solved quite simply by windowing the discrete Fourier transform. Since the transform has been periodized by the sampling of the starting signal, it is enough to store one period of the transform to store all the information contained in it. The choice which is generally made is to keep all the points between O and \(\nu_{\text{ech}}\). This allows to use only positive \(n\), and one can easily reconstruct the plot of the transform if needed by inverting the first and the second half of the computed transform. In practice (for the implementation), the discrete Fourier transform is thus given by :

\[ \boxed{ \forall n=0...(N-1),\; \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} } \]

To conclude on our example function, we obtain the following plot:

Windowing of the discrete Fourier transform for storage

Calculating the discrete Fourier transform

So we have at our disposal the expression of the discrete Fourier transform of a signal \(f\):

\[ \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \]

This s the expression of a matrix product which would look like this:

\[ \hat{f} = \mathbf{M} \cdot f \]

with

\[ \mathbf{M} = \begin{pmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & e^{-2i\pi 1 \times 1 / N} & e^{-2i\pi 2 \times 1 / N} & \dots & e^{-2i\pi 1\times (N-1)/N} \\ 1 & e^{-2i\pi 1 \times 2 \times 1 / N} & e^{-2i\pi 2 \times 2 / N} & \ddots & \vdots\\ \vdots & \vdots & \ddots & \ddots & e^{e-2i\pi (N-2)\times (N-1) / N}\\ 1 & e^{-2i\pi (N-1) \times 1/N} & \dots & e^{e-2i\pi (N-1) \times (N-2) / N} & e^{-2i\pi (N-1)\times (N-1) / N} \end{pmatrix} \]

Those in the know will notice that this is a Vandermonde matrix on the roots of the unit.

So this calculation can be implemented relatively easily!

function naive_dft(x)
  N = length(x)
  k = reshape(0:(N-1), 1, :)
  n = 0:(N-1)
  M = @. exp(-2im * π * k * n / N)
  M * x
end

Information

The macro @. line 5 allows to vectorize the computation of the expression it encompasses (exp(-2im * π * k * n / N)). Indeed the function exp and the division and multiplication operators are defined for scalars. This macro is used to inform Julia that he should apply the scalar operations term by term.

And to check that it does indeed give the right result, it is enough to compare it with a reference implementation:

using FFTW

a = rand(1024)
b = fft(a)
c = naive_dft(a)
b ≈ c

The last block evaluates to true, which confirms that we are not totally off the mark!

Information

I use the ≈ operator to compare rather than == to allow for small differences, especially because of rounding errors on floats.

However, is this code effective? We can check by comparing the memory footprint and execution speed.

using BenchmarkTools

@benchmark fft(a) setup=(a = rand(1024))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.290 μs … 333.143 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.453 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.263 μs ±  12.066 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%                  ▁▇▆█▁▄
  ▂▂▁▁▂▂▁▁▂▂▁▂▁▁▃▆███████▆▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  18.3 μs         Histogram: frequency by time         98.7 μs < Memory estimate: 32.55 KiB, allocs estimate: 6.

@benchmark naive_dft(a) setup=(a = rand(1024))

BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  35.577 ms … 571.900 ms  ┊ GC (min … max):  0.00% … 88.11%
 Time  (median):     40.820 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   50.326 ms ±  53.354 ms  ┊ GC (mean ± σ):  11.18% ± 10.10%    ▂█
  ▅▅██▅▆▆▅▄▄▁▄▁▁▁▄▁▁▄▁▅█▄▆▄▃▃▄▅▅▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  35.6 ms         Histogram: frequency by time         80.2 ms < Memory estimate: 16.03 MiB, allocs estimate: 4.

Information

As you can see, the maximum execution time of the reference implementation is two orders of magnitude higher than the average and median execution time. This is due to Julia's Just in time (JIT) compilation. If we were writing a real Julia library we could consider optimizing our code to compile quickly. We will just ignore the maximum execution time in this tutorial, which is only the compilation time for the first execution of the code. I refer you to the BenchmarkTools.jl documentation for more information.

So our implementation is really slow (about 10,000 times) and has a very high memory footprint (about 500 times) compared to the benchmark implementation! To improve this, we will implement the fast Fourier transform.

Why a fast Fourier transform algorithm?

Before getting our hands dirty again, let's first ask the question: is it really necessary to try to improve this algorithm?

Before answering directly, let us look at some applications of the Fourier transform and the discrete Fourier transform.

The Fourier transform has first of all a lot of theoretical applications, whether it is to solve differential equations, in signal processing or in quantum physics. It also has practical applications in optics and in spectroscopy.

The discrete Fourier transform also has many applications, in signal analysis, for data compression, multiplication of polynomials or the computation of convolution products.

Our naive implementation of the discrete Fourier transform has a time and memory complexity in \(\mathcal{O}(N^2)\) with \(N\) the size of the input sample, this is due to the storage of the matrix and the computation time of the matrix product. Concretely, if one wished to analyze a sound signal of 3 seconds sampled at 44kHz with data stored on simple precision floats (4 bytes), it would thus be necessary approximately \(2\times(44000\times3)^2\times 4\approx100\;000\;000\;000\) bytes of memory (a complex number is stored on 2 floats) We can also estimate the time necessary to make this calculation. The median time for 1024 points was 38.367 ms. For our 3 seconds signal, it would take about \(38.867\times\left(\frac{44000\times3}{1024}\right)^2\approx 637\;537\) milliseconds, that is more than 10 minutes !

One can easily understand the interest to reduce the complexity of the calculation. In particular the fast Fourier transform algorithm (used by the reference implementation) has a complexity in \(\mathcal{O}(N\log N)\). According to our benchmark, the algorithm processes a 1024-point input in 23.785µs. It should therefore process the sound signal in about \(23.785\times\frac{44000\times\log(44000\times3)}{1024\times\log1024}\approx 5\;215\) microseconds, that is to say about 120000 times faster than our algorithm. We can really say that the fast of Fast Fourier Transform is not stolen !

[2]	Gaussians are said to be eigenfunctions of the Fourier transform.

[3]	It should be justified here that we can invert the sum and integral signs.

We saw how the discrete Fourier transform was constructed, and then we naively tried to implement it. While this implementation is relatively simple to implement (especially with a language like Julia that facilitates matrix manipulations), we also saw its limitations in terms of execution time and memory footprint.

It's time to move on to the FFT itself!

Implementing the FFT

In this part we will implement the FFT by starting with a simple approach, and then making it more complex as we go along to try to calculate the Fourier transform of a real signal in the most efficient way possible. To compare the performances of our implementations, we will continue to compare with the FFTW implementation.

My first FFT

We have previously found the expression of the discrete Fourier transform :

\[ \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \]

The trick at the heart of the FFT algorithm is to notice that if we try to cut this sum in two, separating the even and odd terms, we get (assuming \(N\) is even), for \(n < N/2\) :

\[ \begin{aligned} \hat{f}[n] &= \sum_{k=0}^{N}f[k]e^{-2i\pi k\frac{n}{N}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi 2m\frac{n}{N}} + \sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi (2m+1)\frac{n}{N}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n}{N/2}} + e^{-2i\pi n/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n}{N/2}}\\ &= \hat{f}^\text{even}[n] + e^{-2i\pi n/N}\hat{f}^\text{odd}[n] \end{aligned} \]

where \(\hat{f}^\text{even}\) and \(\hat{f}^\text{odd}\) are the Fourier transforms of the sequence of even terms of \(f\) and of the sequence of odd terms of \(f\). We can therefore compute the first half of the Fourier transform of \(f\) by computing the Fourier transforms of these two sequences of length \(N/2\) and recombining them. Similarly, if we compute \(\hat{f}[n+N/2]\) we have :

\[ \begin{aligned} \hat{f}[n+N/2] &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n+N/2}{N/2}} + e^{-2i\pi(n+N/2)/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n+N/2}{N/2}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n}{N/2}} - e^{-2i\pi n/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n}{N/2}}\\ &= \hat{f}^\text{even}[n] - e^{-2i\pi n/N}\hat{f}^\text{odd}[n] \end{aligned} \]

This means that by computing two Fourier transforms of length \(N/2\), we are able to compute two elements of a Fourier transform of length \(N\). Assuming for simplicity that \(N\) is a power of two^[4], this naturally draws a recursive implementation of the FFT. According to the master theorem, this algorithm will have complexity \(\mathcal{O}(N\log_2 N)\), which is much better than the first naive algorithm we implemented, which has complexity in \(\mathcal{O}(N^2)\).

function my_fft(x)
  # Stop condition, the TF of an array of size 1 is this same array.
  if length(x) <= 1
    x
  else
    N = length(x)
    # Xᵒ contains the TF of odd terms and Xᵉ that of even terms.
    # The subtlety being that Julia's tablals start at 1 and not 0.
    Xᵒ = my_fft(x[2:2:end])
    Xᵉ = my_fft(x[1:2:end])
    factors = @. exp(-2im * π * (0:(N/2 - 1)) / N)
    [Xᵉ .+ factors .* Xᵒ; Xᵉ .- factors .* Xᵒ]
  end
end

We can check as before that code gives a fair result, then compare its runtime qualities with the reference implementation.

@benchmark fft(a) setup=(a = rand(1024))

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.262 μs … 67.122 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.877 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.290 μs ±  3.125 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%     ▇█▂
  ▁▂▇███▆▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  15.3 μs         Histogram: frequency by time        27.9 μs < Memory estimate: 32.55 KiB, allocs estimate: 6.

@benchmark my_fft(a) setup=(a = rand(1024))

BenchmarkTools.Trial: 3399 samples with 1 evaluation.
 Range (min … max):  983.141 μs … 586.032 ms  ┊ GC (min … max):  0.00% … 99.48%
 Time  (median):       1.113 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):     1.464 ms ±  10.067 ms  ┊ GC (mean ± σ):  17.52% ±  9.04%  ██▅▂▁▄▂▁▂▁                                                    ▁
  ████████████▇▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅ █
  983 μs        Histogram: log(frequency) by time       7.58 ms < Memory estimate: 989.12 KiB, allocs estimate: 10230.

We can see that we have improved the execution time (by a factor of 8) and the memory footprint of the algorithm (by a factor of 13), without getting closer to the reference implementation.

Analysis of the first implementation

Let's go back to the previous code:

function my_fft(x)
  # Stop condition, the TF of an array of size 1 is this same array.
  if length(x) <= 1
    x
  else
    N = length(x)
    # Xᵒ contains the TF of odd terms and Xᵉ that of even terms.
    # The subtlety being that Julia's tablals start at 1 and not 0.
    Xᵒ = my_fft(x[2:2:end])
    Xᵉ = my_fft(x[1:2:end])
    factors = @. exp(-2im * π * (0:(N/2 - 1)) / N)
    [Xᵉ .+ factors .* Xᵒ; Xᵉ .- factors .* Xᵒ]
  end
end

And let's try to keep track of the memory allocations. For simplicity, we can assume that we are working on an array of 4 elements, [f[0], f[1], f[2], f[3]]. The first call to my_fft keeps in memory the initial array, then launches the fft on two sub-arrays of size 2: [f[0], f[2]] and [f[1], f[3]], then recursive calls keep in memory before recombining the arrays [f[0]] and [f[2]] then [f[1]] and [f[3]]. At most, we have \(log_2(N)\) arrays allocated with sizes divided by two each time. Not only do these arrays take up memory, but we also waste time allocating them!

However, if we observe the definition of the recurrence we use, at each step \(i\) (i.e. for each array size, \(N/2^i\)), the sum of the intermediate array sizes is always \(N\). In other words, this gives the idea that we could save all these array allocations and use the same array all the time, provided that we make all the associations of arrays of the same size at the same step.

Schematically we can represent the FFT process for an array with 8 elements as follows:

Illustration of the FFT process. The colors indicate if an element is treated as an even array (red) or an odd array (green). The geometrical shapes allow to associate the elements which are in the same subarray. The multiplicative coefficients applied to the odd elements are also represented. This somewhat complicated diagram is the key to what follows. Feel free to spend some time to understand it.

How to read this diagram? Each column corresponds to a depth of the recurrence of our first FFT. The leftmost column corresponds to the deepest recurrence: we have cut the input array enough to arrive at subarrays of size 1. These 8 sub-tables are symbolized by 8 different geometrical shapes. We then go to the next level of recurrence. Each pair of sub-tables of size 1 must be combined to create a sub-table of size 2, which will be stored in the same memory cells as the two sub-tables of size 1. For example, we combine the subarray ▲ that contains \(f[0]\) and the subarray ◆ that contains \(f[4]\) using the formula demonstrated earlier to form the array \([f[0] + f[4], f[0] - f[4]]\), which I call in the following ◆, and store the two values in position 0 and 4. The colors of the arrows allow us to distinguish those bearing a coefficient (which correspond to the treatment we give to the subarray \(\hat{f}^{\text{odd}}\) in the formulas of the previous section). After having constructed the 4 sub-tables of size 2, we can proceed to a new step of the recurrence to compute two sub-tables of size 4. Finally the last step of the recurrence combines the two subarrays of size 4 to compute the array of size 8 which contains the Fourier transform.

Based on this scheme we can think of having a function whose main loop would calculate successively each column to arrive at the final result. In this way, all the calculations are performed on the same array and the number of allocations is minimized! There is however a problem: we see that the \(\hat{f}[k]\) do not seem to be ordered at the end of the process.

In reality, these \(\hat{f}[k]\) are ordered via a reverse bit permutation. This means that if we write the indices \(k\) in binary, then reverse this writing (the MSB becoming the LSB^[5]), we obtain the index at which \(\hat{f}[k]\) is found after the FFT algorithm. The permutation process is described by the following table in the case of a calculation on 8 elements.

\(k\)	Binary representation of \(k\)	Reverse binary representation	Index of \(\hat{f}[k]\)
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

If we know how to calculate the reverse permutation of the bits, we can simply reorder the array at the end of the process to obtain the right result. However, before jumping on the implementation, it is interesting to look at what happens if instead we reorder the input array via this permutation.

Diagram of the FFT with a permuted input. The colors and symbols are the same as in the first illustration

We can see that by proceeding in this way we have a simple ordering of the sub-tables. Since in any case it will be necessary to proceed to a permutation of the table, it is interesting to do it before the calculation of the FFT.

Calculate the reverse permutation of the bits

We must therefore begin by being able to calculate the permutation. It is possible to perform the permutation in place simply once we know which elements to exchange. Several methods exist to perform the permutation, and a search in Google Scholar will give you an overview of the wealth of approaches.

We can use a little trick here: since we are dealing only with arrays whose size is a power of 2, we can write the size \(N\) as \(N=2^p\). This means that the indices can be stored on \(p\) bits. We can then simply calculate the permuted index via binary operations. For example if \(p=10\) then the index \(797\) could be represented as: 1100011101.

We can separate the inversion process in several steps. First we exchange the 5 most significant bits and the 5 least significant bits. Then on each of the half-words we invert the two most significant bits and the two least significant bits (the central bits do not change). Finally on the two bits words that we have just exchanged, we exchange the most significant bit and the least significant bit.

An example of implementation would be the following:

bit_reverse(::Val{10}, num) = begin
  num = ((num&0x3e0)>>5)|((num&0x01f)<<5)
  num = ((num&0x318)>>3)|(num&0x084)|((num&0x063)<<3)
  ((num&0x252)>>1)|(num&0x084)|((num&0x129)<<1)
end

An equivalent algorithm can be applied for all values of \(p\), you just have to be careful not to change the central bits anymore when you have an odd number of bits in a half word. In the following there is an example for several word lengths.

bit_reverse(::Val{64}, num) = bit_reverse(Val(32), (num&0xffffffff00000000)>>32)|(bit_reverse(Val(32), num&0x00000000ffffff)<<32)
bit_reverse(::Val{32}, num) = bit_reverse(Val(16), (num&0xffff0000)>>16)|(bit_reverse(Val(16), num&0x0000ffff)<<16)
bit_reverse(::Val{16}, num) = bit_reverse(Val(8), (num&0xff00)>>8)|(bit_reverse(Val(8), num&0x00ff)<<8)
bit_reverse(::Val{8}, num) = bit_reverse(Val(4), (num&0xf0)>>4)|(bit_reverse(Val(4), num&0x0f)<<4)
bit_reverse(::Val{4}, num) =bit_reverse(Val(2), (num&0xc)>>2)|(bit_reverse(Val(2), num&0x3)<<2)
bit_reverse(::Val{3}, num) = ((num&0x1)<<2)|((num&0x4)>>2)|(num&0x2)
bit_reverse(::Val{2}, num) = ((num&0x2)>>1 )|((num&0x1)<<1)
bit_reverse(::Val{1}, num) = num

Then we can do the permutation itself. The algorithm is relatively simple: just iterate over the array, calculate the inverted index of the current index and perform the inversion. The only subtlety is that the inversion must be performed only once per index of the array, so we discriminate by performing the inversion only if the current index is lower than the inverted index.

function reverse_bit_order!(X, order)
  N = length(X)
  for i in 0:(N-1)
    j = bit_reverse(order, i)
    if i


My second FFT
We are now sufficiently equipped to start a second implementation of the FFT. The first step will be to compute the reverse bit permutation. Then we will be able to compute the Fourier transform following the scheme shown previously. To do this we will store the size \(n_1\) of the sub-arrays and the number of cells \(n_2\) in the global array that separate two elements of the same index in the sub-arrays. The implementation can be done as follows:
function my_fft_2(x)
  N = length(x)
  order = Int(log2(N))
  @inbounds reverse_bit_order!(x, Val(order))
  n₁ = 0
  n₂ = 1
  for i=1:order # i done the number of the column we are in.
    n₁ = n₂ # n₁ = 2ⁱ-¹
    n₂ *= 2 # n₂ = 2ⁱ
    
    step_angle = -2π/n₂
    angle = 0
    for j=1:n₁ # j is the index in Xᵉ and Xᵒ
      factors = exp(im*angle) # z = exp(-2im*π*(j-1)/n₂)
      angle += step_angle # a = -2π*(j+1)/n₂
      
      # We combine the element j of each group of subarrays
      for k=j:n₂:N
        @inbounds x[k], x[k+n₁] = x[k] + factors * x[k+n₁], x[k] - factors * x[k+n₁]
      end
    end
  end
  x  
end


 Information


  There are two small subtleties due to Julia: arrays start numbering at 1, and we use the @inbounds macro to speed up the code a bit by disabling array overflow checks. 

We can again measure the performance of this implementation. To keep the comparison fair, the fft! function should be used instead of fft, as it works in place.
@benchmark fft!(a) setup=(a = rand(1024) |> complex)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.501 μs … 156.742 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     20.942 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.874 μs ±   8.820 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%    █▇▁
  ▂▇████▅▄▄▄▃▂▅▄▂▃▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  16.5 μs         Histogram: frequency by time           52 μs < Memory estimate: 304 bytes, allocs estimate: 4.
@benchmark my_fft_2(a) setup=(a = rand(1024) .|> complex)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  46.957 μs … 152.061 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     50.283 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.079 μs ±  10.720 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%   ▅ █▃▁▆▂ ▄  ▃  ▂   ▁    ▂    ▁    ▁                          ▁
  ██▇██████████████▇▇█▇▇▇▇█▇▆▆██▅▅▅▆█▆▆▆▇▇██▇▇█▇▆▆▆▇▅▇▅▅▆▅▅▅▅▆ █
  47 μs         Histogram: log(frequency) by time      98.7 μs < Memory estimate: 0 bytes, allocs estimate: 0.
We have significantly improved our execution time and memory footprint. We can see that there are zero bytes allocated (this means that the compiler does not need to store the few intermediate variables in RAM), and that the execution time is close to that of the reference implementation.
The special case of a real signal
So far we have reasoned about complex signals, which use two floats for storage. However in many situations we work with real value signals. Now in the case of a real signal, we know that \(\hat{f}\) verifies \(\hat{f}(-\nu) = \overline{\hat{f}(\nu)}\). This means that half of the values we calculate are redundant. Although we calculate the Fourier transform of a real signal, the result can be a complex number. In order to save storage space, we can think of using this half of the array to store complex numbers. For this, two properties will help us.
Property 1: Compute the Fourier transform of two real functions at the same time
If we have two real signals \(f\) and \(g\), we can define the complex signal \(h=f+ig\). We then have:
\[
\hat{h}[k] = \sum_{n=0}^{N-1}(f[n]+ig[n])e^{-2i\pi kn/N}
\]
We can notice that 
\[
\begin{aligned}
\overline{\hat{h}[N-k]} &= \overline{\sum_{n=0}^{N-1}(f[n]+ig[n])e^{-2i\pi (N-k)n/N}}\\
&=\sum_{n=0}^{N-1}(f[n]-ig[n])e^{-2i\pi kn/N}
\end{aligned}
\]
Combining the two we have
\[
\begin{aligned}
\hat{f}[k] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N-k]})\\
\hat{g}[k] &= -\frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N-k]})\\
\end{aligned}
\]
Property 2 : Compute the Fourier transform of a single function
The idea is to use the previous property by using the signal of the even and the odd elements. In other words for \(k=0...N/2-1\) we have \(h[k]=f[2k]+if[2k+1]\).
Then we have:
\[
\begin{aligned}
\hat{f}^{\text{even}}[k] &= \sum_{n=0}^{N/2-1}f[2n]e^{-2i\pi kn/(N/2)}\\
\hat{f}^{\text{odd}}[k] &= \sum_{n=0}^{N/2-1}f[2n+1]e^{-2i\pi kn/(N/2)}
\end{aligned}
\]
We can recombine the two partial transforms. For \(k=0...N/2-1\) :
\[
\begin{aligned}
\hat{f}[k] &= \hat{f}^{\text{even}}[k] + e^{-2i\pi k/N}\hat{f}^{\text{odd}}[k]\\
\hat{f}[k+N/2] &= \hat{f}^{\text{even}}[k] - e^{-2i\pi k/N}\hat{f}^{\text{odd}}[k]
\end{aligned}
\]
Using the first property, we then have:
\[
\begin{aligned}
\hat{f}[k] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N/2-k]}) - \frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N/2-k]})e^{-2i\pi k/N} \\
\hat{f}[k+N/2] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N/2-k]}) + \frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N/2-k]})e^{-2i\pi k/N}
\end{aligned}
\]
Calculation in place
The array \(h\), which is presented previously, is complex-valued. However the input signal is real-valued and twice as long. The trick is to use two cells of the initial array to store a complex element of \(h\). It is useful to do the calculations with complex numbers before starting to write code. For the core of the FFT, if we note \(x_i\) the array at step \(i\) of the main loop, we have:
\[
\begin{aligned}
\text{Re}(x_{i+1}[k]) &= \text{Re}(x_{i}[k]) + \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) - \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\
\text{Re}(x_{i+1}[k]) &= \text{Re}(x_{i}[k]) + \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) - \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\\\
\text{Re}(x_{i+1}[k+n_1]) &= \text{Re}(x_{i}[k]) - \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) + \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\
\text{Re}(x_{i+1}[k+n_1]) &= \text{Re}(x_{i}[k]) - \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) + \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\
\end{aligned}
\]
With the organization we choose, we can replace \(\text{Re}(x[k])\) with \(x[2k]\) and \(\text{Im}(x[k])\) with \(x[2k+1]\). We also note that we can replace \(\text{Re}(x[k+n_1])\) with \(x[2(k+n_1)]\) or even better \(x[2k+n_2]\).
The last step is the recombination of \(h\) to find the final result. The formula in property 2 is rewritten after an unpleasant but uncomplicated calculation:
\[
\begin{aligned}
\text{Re}(\hat{x}[k]) &= 1/2 \times (\text{Re}(h[k]) + \text{Re}(h[N/2-k]) +
\text{Im}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Re}(h[k])\text{Im}(e^{-2i\pi
k/N})... \\&...+ \text{Im}(h[N/2-k])\text{Re}(e^{-2i\pi k/N}) -
\text{Re}(h[N/2-k])\text{Im}(e^{-2i\pi k/N})\\
\text{Im}(\hat{x}[k]) &= 1/2 \times (\text{Im}(h[k]) - \text{Im}(h[N/2-k]) -
\text{Re}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[k])\text{Im}(e^{-2i\pi
k/N})...\\&... + \text{Re}(h[N/2-k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[N/2-k])\text{Im}(e^{-2i\pi k/N})
\end{aligned}
\]
There is a particular case where this formula does not work: when \(k=0\) we leave the array \(h\) which contains only \(N/2\) elements. However we can use the symmetry of the Fourier Transform to see that \(h[N/2]=h[0]\). The case \(k=0\) then simplifies enormously:
\[
\begin{aligned}
\text{Re}(\hat{x}[0]) &= \text{Re}(h[0]) + \text{Im}(h[0])\\
\text{Im}(\hat{x}[0]) &= 0
\end{aligned}
\]
To perform the calculation in place, it is useful to be able to calculate \(\hat{x}[N/2-k]\) at the same time that we calculate \(\hat{x}[k]\). Reusing the previous results and the fact that \(e^{-2i\pi(N/2-k)/N}=-e^{2i\pi k/N}\), we find:
\[
\begin{aligned}
\text{Re}(\hat{x}[N/2-k]) &= 1/2 \times \Big(\text{Re}(h[N/2-k]) + \text{Re}(h[k]) -
\text{Im}(h[N/2-k]]\text{Re}(e^{-2i\pi k/N})...\\&... +
\text{Re}(h[N/2-k])\text{Im}(e^{-2i\pi k/N}) -
\text{Im}(h[k])\text{Re}(e^{-2i\pi k/N}) - \text{Re}(h[k])\text{Im}(e^{-2i\pi
k/N})\Big)\\\text{Im}(\hat{x}[N/2-k]) &= 1/2 \times \Big(\text{Im}(h[N/2-k]) - \text{Im}(h[k]) +
\text{Re}(h[N/2-k])\text{Re}(e^{-2i\pi k/N})...\\&... +
\text{Im}(h[N/2-k])\text{Im}(e^{-2i\pi k/N}) -
\text{Re}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[k])\text{Im}(e^{-2i\pi
k/N})\Big)
\end{aligned}
\]
After this little unpleasant moment, we are ready to implement a new version of the FFT!
An FFT for the reals
Since the actual computation of the FFT is done on an array that is half the size of the input array, we need a function to compute the inverted index on 9 bits to be able to continue testing on 1024 points.
bit_reverse(::Val{9}, num) = begin
  num = ((num&0x1e0)>>5)|(num&0x010)|((num&0x00f)<<5)
  num = ((num&0x18c)>>2)|(num&0x010)|((num&0x063)<<2)
  ((num&0x14a)>>1)|(num&0x010)|((num&0x0a5)<<1)
end


 


  To complete the other methods of bit_reverse we can use the following implementations:
bit_reverse(::Val{31}, num) = begin
bit_reverse(Val(15), num&0x7fff0000>>16)| (num&0x00008000) |(bit_reverse(Val(7),num&0x00007fff)<<16)
end
bit_reverse(::Val{15}, num) = bit_reverse(Val(7), (num&0x7f00)>>8)| (num&0x0080)|(bit_reverse(Val(7),num&0x007f)<<8)
bit_reverse(::Val{7}, num) = bit_reverse(Val(3), (num&0x70)>>4 )| (num&0x08) |(bit_reverse(Val(3), num&0x07)<<4)


To take into account the specificities of the representation of the complexes we use, we implement a new version of reverse_bit_order.
function reverse_bit_order_double!(x, order)
  N = length(x)
  for i in 0:(N÷2-1)
    j = bit_reverse(order, i)
    if i

This leads to the new FFT implementation.
function my_fft_3(x)
  N = length(x) ÷ 2
  order = Int(log2(N))
  @inbounds reverse_bit_order_double!(x, Val(order))
  
  n₁ = 0
  n₂ = 1
  for i=1:order # i done the number of the column we are in.
    n₁ = n₂ # n₁ = 2ⁱ-¹
    n₂ *= 2 # n₂ = 2ⁱ
    
    step_angle = -2π/n₂
    angle = 0
    for j=1:n₁ # j is the index in Xᵉ and Xᵒ
      re_factor = cos(angle)
      im_factor = sin(angle)
      angle += step_angle # a = -2π*j/n₂
      
      # We combine element j from each group of subarrays
      @inbounds for k=j:n₂:N
        re_xₑ = x[2*k-1]
        im_xₑ = x[2*k]
        re_xₒ = x[2*(k+n₁)-1]
        im_xₒ = x[2*(k+n₁)]
        x[2*k-1] = re_xₑ + re_factor*re_xₒ - im_factor*im_xₒ
        x[2*k] = im_xₑ + im_factor*re_xₒ + re_factor*im_xₒ
        x[2*(k+n₁)-1] = re_xₑ - re_factor*re_xₒ + im_factor*im_xₒ
        x[2*(k+n₁)] = im_xₑ - im_factor*re_xₒ - re_factor*im_xₒ      
      end
    end
  end
  # We build the final version of the TF
  # N half the size of x
  # Special case n=0
  x[1] = x[1] + x[2]
  x[2] = 0  
  
  step_angle = -π/N
  angle = step_angle
  @inbounds for n=1:(N÷2)
    re_factor = cos(angle)
    im_factor = sin(angle)
    re_h = x[2*n+1]
    im_h = x[2*n+2]
    re_h_sym = x[2*(N-n)+1]
    im_h_sym = x[2*(N-n)+2]
    x[2*n+1] = 1/2*(re_h + re_h_sym + im_h*re_factor + re_h*im_factor + im_h_sym*re_factor - re_h_sym*im_factor)
    x[2*n+2] = 1/2*(im_h - im_h_sym - re_h*re_factor + im_h*im_factor + re_h_sym*re_factor + im_h_sym*im_factor)
    x[2*(N-n)+1] = 1/2*(re_h_sym + re_h - im_h_sym*re_factor + re_h_sym*im_factor - im_h*re_factor - re_h*im_factor)
    x[2*(N-n)+2] = 1/2*(im_h_sym - im_h + re_h_sym*re_factor + im_h_sym*im_factor - re_h*re_factor + im_h*im_factor)
    angle += step_angle
  end
  x
end
We can now check the performance of the new implementation:
@benchmark fft!(x) setup=(x = rand(1024) .|> complex)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.628 μs … 455.827 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.664 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.799 μs ±   5.338 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%      ▃██▅▁ ▂▂
  ▁▁▂▃█████▆██▆▃▃▃▂▂▂▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▁▁▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁ ▂
  16.6 μs         Histogram: frequency by time         29.3 μs < Memory estimate: 304 bytes, allocs estimate: 4.
@benchmark my_fft_3(x) setup=(x = rand(1024))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  29.024 μs … 129.484 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     33.435 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.648 μs ±   5.990 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%   ▅ █  ▁      ▅
  ▁█▁█▄▁█▄▂▇█▂▁█▄▂▁▄▂▁▁▃▁▁▁▁▃▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  29 μs           Histogram: frequency by time         59.8 μs < Memory estimate: 0 bytes, allocs estimate: 0.
This is a very good result!
Optimization of trigonometric functions
If we analyze the execution of my_fft_3 using Julia's profiler, we can see that most of the time is spent computing trigonometric functions and creating the StepRange objects used in for loops. The second problem can be easily circumvented by using while loops. For the first one, in Numerical Recipes we can read (section 5.4 "Recurrence Relations and Clenshaw's Recurrence Formula", page 219 of the third edition):

If your program's running time is dominated by evaluating trigonometric functions, you are probably doing something wrong.  Trig functions whose arguments form a linear sequence \(\theta = \theta_0 + n\delta, n=0,1,2...\) ,  are efficiently calculated by the recurrence 

\[\begin{aligned}\cos(\theta + \delta) &= \cos\theta - [\alpha \cos\theta + \beta\sin\theta]\\\sin(\theta + \delta) &= \sin\theta - [\alpha\sin\theta - \beta\cos\theta]\end{aligned}\]

Where \(\alpha\) and \(\beta\) are the precomputed coefficients \(\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\;\beta=\sin\delta\)


 


  This can be shown using the classical trigonometric identities:
\[
\begin{aligned}
\cos(\theta+\delta) =& \cos\theta\cos\delta - \sin\theta\sin\delta\\
=& \cos\theta\left[2\cos^2\frac{\delta}{2} - 1\right] - \sin\theta\sin\delta\\
=& \cos\theta\left[2(1-\sin^2\frac{\delta}{2}) - 1\right] - \sin\theta\sin\delta\\
=& \cos\theta - [\underbrace{\sin^2\frac{\delta}{2}}_{=\alpha}\cos\theta + \underbrace{\sin\delta}_{=\beta}\sin\theta]
\end{aligned}
\]
And with \(\sin x = \cos(x-\frac{\pi}{2})\), we have directly the second formula. 

This relation is also interesting in terms of numerical stability. We can directly implement a final version of our FFT using these relations.
function my_fft_4(x)
  N = length(x) ÷ 2
  order = Int(log2(N))
  @inbounds reverse_bit_order_double!(x, Val(order))
  
  n₁ = 0
  n₂ = 1
  
    i=1
  while i<=order # i done the number of the column we are in.
    n₁ = n₂ # n₁ = 2ⁱ-¹
    n₂ *= 2 # n₂ = 2ⁱ
    
    step_angle = -2π/n₂
    α = 2sin(step_angle/2)^2
    β = sin(step_angle)
    cj = 1
    sj = 0
    j = 1
    while j<=n₁ # j is the index in Xᵉ and Xᵒ
      # We combine the element j from each group of subarrays
      k = j
      @inbounds while k<=N
        re_xₑ = x[2*k-1]
        im_xₑ = x[2*k]
        re_xₒ = x[2*(k+n₁)-1]
        im_xₒ = x[2*(k+n₁)]
        x[2*k-1] = re_xₑ + cj*re_xₒ - sj*im_xₒ
        x[2*k] = im_xₑ + sj*re_xₒ + cj*im_xₒ
        x[2*(k+n₁)-1] = re_xₑ - cj*re_xₒ + sj*im_xₒ
        x[2*(k+n₁)] = im_xₑ - sj*re_xₒ - cj*im_xₒ       
        
        k += n₂
      end
      # We compute the next cosine and sine.
      cj, sj = cj - (α*cj + β*sj), sj - (α*sj-β*cj)
      j+=1
    end
    i += 1
  end
  # We build the final version of the TF
  # N half the size of x
  # Special case n=0
  x[1] = x[1] + x[2]
  x[2] = 0  
  
  step_angle = -π/N
  α = 2sin(step_angle/2)^2
  β = sin(step_angle)
  cj = 1
  sj = 0
  j = 1
  @inbounds while j<=(N÷2)
    # We calculate the cosine and sine before the main calculation here to compensate for the first
    # step of the loop that was skipped.
    cj, sj = cj - (α*cj + β*sj), sj - (α*sj-β*cj)
    
    re_h = x[2*j+1]
    im_h = x[2*j+2]
    re_h_sym = x[2*(N-j)+1]
    im_h_sym = x[2*(N-j)+2]
    x[2*j+1] = 1/2*(re_h + re_h_sym + im_h*cj + re_h*sj + im_h_sym*cj - re_h_sym*sj)
    x[2*j+2] = 1/2*(im_h - im_h_sym - re_h*cj + im_h*sj + re_h_sym*cj + im_h_sym*sj)
    x[2*(N-j)+1] = 1/2*(re_h_sym + re_h - im_h_sym*cj + re_h_sym*sj - im_h*cj - re_h*sj)
    x[2*(N-j)+2] = 1/2*(im_h_sym - im_h + re_h_sym*cj + im_h_sym*sj - re_h*cj + im_h*sj)
    
    j += 1
  end
  x
end
We can check that we always get the right result: 
a = rand(1024)
b = fft(a)
c = my_fft_4(a)
real.(b[1:end÷2]) ≈ c[1:2:end] && imag.(b[1:end÷2]) ≈ c[2:2:end]
true
In terms of performance, we finally managed to outperform the reference implementation!
@benchmark fft!(x) setup=(x = rand(1024) .|> complex)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.991 μs … 660.685 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.090 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.367 μs ±   7.360 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%     ▁▂▁▂▇█▃
  ▃▆████████▆▅▅▄▄▄▄▃▃▄▄▃▃▃▃▂▃▃▃▃▂▂▃▃▂▂▂▃▃▃▃▃▃▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  17 μs           Histogram: frequency by time         31.9 μs < Memory estimate: 304 bytes, allocs estimate: 4.
@benchmark my_fft_4(x) setup=(x = rand(1024))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.025 μs … 77.621 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.654 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.781 μs ±  2.731 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%  █ █▂▃▅ ▅▃ ▅▁ ▂▃  ▄  ▁▃▁ ▁▄▂ ▂ ▄▂                            ▂
  █▇████▅██▁██▃██▃▆██▆███▇███▄████▄▄▁▃▄▄▅▄▁▄▄▅▃▁▄▄▃▄▄▃▄▄▄▄▁▄▅ █
  12 μs        Histogram: log(frequency) by time      25.4 μs < Memory estimate: 0 bytes, allocs estimate: 0.

    
        [4]
        In practice we can always reduce to this case by stuffing zeros.
    


    
        [5]
        MSB and LSB are the acronyms of Most Significant Bit and Least Significant Bit. In a number represented on \(n\) bits, the MSB is the bit that carries the information on the highest power of 2 (\(2^{n-1}\)) while the LSB carries the information on the lowest power of 2 (\(2^0\)). Concretely the MSB is the leftmost bit of the binary representation of a number, while the LSB is the rightmost.
    



If we compare the different implementations proposed in this tutorial as well as the two reference implementations, and then plot the median values of execution time, memory footprint and number of allocations, we obtain the following plot:

  
 Benchmark of the different solutions: median
values.
I added the function FFTW.rfft which is supposed to be optimized for real. We can see that in reality, unless you work on very large arrays, it does not bring much performance.
We can see that the last versions of the algorithm are very good in terms of number of allocations and memory footprint. In terms of execution time, the reference implementation ends up being faster on very large arrays.
How can we explain these differences, especially between our latest implementation and the implementation in FFTW? Some elements of answer:

FFTW solves a much larger problem. Indeed our implementation is "naive" for example in the sense that it can only work on input arrays whose size is a power of two. And even then, only those for which we have taken the trouble to implement a method of the bit_reverse function. The reverse bit permutation problem is a bit more complicated to solve in the general case. Moreover FFTW performs well on many types of architectures, offers discrete Fourier transforms in multiple dimensions etc... If you are interested in the subject, I recommend this article^[6] which presents the internal workings of FFTW.

The representation of the complex numbers plays in our favor. Indeed, we avoid our implementation to do any conversion, this is seen in particular in the test codes where we take care of recovering the real part and the imaginary part of the transform:


real.(b[1:end÷2]) ≈ c[1:2:end] && imag.(b[1:end÷2]) ≈ c[2:2:end]
true

Our algorithm was not thought of with numerical stability in mind. This is an aspect that could still be improved. Also, we did not test it on anything other than noise. However, the following block presents some tests that suggest that it "behaves well" for some test functions.



 



function test_signal(s)
b = fft(s)
c = my_fft_4(s)
real.(b[1:end÷2]) ≈ c[1:2:end] && imag.(b[1:end÷2]) ≈ c[2:2:end]
endt = range(-10, 10; length=1024)
y = @. exp(-t^2)
noise = rand(1024)
test_signal(y .+ noise)
true
t = range(-10, 10; length=1024)
y = @. sin(t)
noise = rand(1024)
test_signal(y .+ noise)
true


These simplifications and special cases allow our implementation to gain a lot in speed. This makes the implementation of FFTW all the more remarkable, as it still performs very well!

    
        [6]
        Frigo, Matteo & Johnson, S.G.. (2005). The Design and implementation of FFTW3. Proceedings of the IEEE. 93. 216 - 231. 10.1109/JPROC.2004.840301.
    

At the end of this tutorial I hope to have helped you to understand the mechanisms that make the FFT computation work, and to have shown how to implement it efficiently, modulo some simplifications. Personally, writing this tutorial has allowed me to realize the great qualities of FFTW, the reference implementation, that I use every day in my work!
This should allow you to understand that for some use cases, it can be interesting to implement and optimize your own FFT. An application that has been little discussed in this tutorial is the calculation of convolution products. An efficient method when convolving signals of comparable length is to do so by multiplying the two Fourier transforms and then taking the inverse Fourier transform. In this case, since the multiplication is done term by term, it is not necessary that the Fourier transform is ordered. One could therefore imagine a special implementation that would skip the reverse bit permutation part.
Another improvement that could be made concerns the calculation of the inverse Fourier transform. It is a very similar calculation (only the multiplicative coefficients change), and can be a good exercise to experiment with the codes given in this tutorial.
Finally, I want to thank @Gawaboumga, @Næ, @zeqL and @luxera for their feedback on the beta of this tutorial, and @Gabbro for the validation on zestedesavoir.com!

\(k\)	Binary representation of \(k\)	Reverse binary representation	Index of \(\hat{f}[k]\)
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

\(k\)	Binary representation of \(k\)	Reverse binary representation	Index of \(\hat{f}[k]\)
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7

Klafyvel

Building a proper archiving method for my things, episode 1

Where do I start from?

Merging the two drives

I am defending my PhD thesis on the 24/01!

On this page...

Abstract

How can I come?

Zoom link.

I have a question!

Je soutiens ma thèse le 24/01!

Sur cette page

Résumé de la thèse

Comment venir ?

Suivre la présentation à distance.

Je n'ai jamais assisté à une soutenance de thèse, ça se passe comment ?

J'ai d'autres questions !

A nice approximation of the norm of a 2D vector.

Table of contents

Setting-up the scene.

Finding a lower bound to the norm.

Finding an upper bound to the norm.

Choosing the best approximation for the norm.

Conclusion

How I over-engineered a Fast Fourier Transform for Arduino.

Table of contents

Why reinvent the wheel?

Because I did not know how to implement the FFT.

Because I thought it was possible to do better.

In-place or out-of-place algorithm?

Trigonometry can be blazingly fast. 🚀🚀🚀 🔥🔥

Interlude: some tooling for debugging.

Using arduino-cli to upload your code.

Don't bother with communication protocols over Serial.

Fast, accurate FFT, and other floating-point trickeries.

A first dummy implementation of the FFT.

Forbidden occult arts are fun. 😈

Approximate floating-point FFT.

How fixed-point arithmetic came to the rescue.

Fixed-point multiplication.

Controlled result growth.

Trigonometry is demanding.

Saturating additions. (a.k.a. "Trigonometry is demanding" returns.)

Calculating modules with a chainsaw.

16 bits fixed-point FFT.

8 bits fixed-point FFT.

Implementing fixed-point FFT for longer inputs

Benchmarking all these solutions.

Closing thoughts.

Modeling a honeycomb grid in FreeCAD

Let's play at implementing a fast Fourier transform!

Table of contents

Some reminders on the discrete Fourier transform

The Fourier transform

From the Fourier transform to the discrete Fourier transform

Calculating the discrete Fourier transform

Why a fast Fourier transform algorithm?

Implementing the FFT

My first FFT

Analysis of the first implementation

Calculate the reverse permutation of the bits

My second FFT

The special case of a real signal

Property 1: Compute the Fourier transform of two real functions at the same time

Property 2 : Compute the Fourier transform of a single function

Calculation in place

An FFT for the reals

Optimization of trigonometric functions

Using `arduino-cli` to upload your code.

\(k\)	Binary representation of \(k\)	Reverse binary representation	Index of \(\hat{f}[k]\)
0	000	000	0
1	001	100	4
2	010	010	2
3	011	110	6
4	100	001	1
5	101	101	5
6	110	011	3
7	111	111	7