Tobias van der Werff

I Hacked My Dehumidifier to Control it Over WiFi

2025-03-25T00:00:00+00:00

See disccusion on Reddit.

—

Many modern devices are connected to the internet – thermostats, light bulbs, vacuum cleaners, you name it. It’s certainly convenient to control your house’s thermostat using your phone, or to have a robot vacuum your house. It’s also easy to take this too far, but it’s still a fun gimmick to control your devices from an app on your smartphone.

Something I’ve been wondering lately, is, how hard would it be to make your own smart devices? They typically harvest lots of data about you which they subsequently transmit over the internet, which is less than ideal for privacy reasons. Instead, if you make your own smart device, you’d have full control over what the device can and can’t do. This means you wouldn’t have to install spyware on your phone or give out all your personal information just to turn on your CO2 monitor.

This CO2 monitor with thousands of positive reviews demanded that I create an account, download their app and allow precise location information before it would report the amount of CO2 in my room. – Andrej Karpathy (I love calculator)

Making my own smart device seemed like a fun challenge. I’ve been teaching myself electronics in the past year, so as a learning project, I wanted to turn one of my “dumb” household appliances into one that could be monitored and controlled remotely.

First, I needed to pick a device to work on. I recently bought a dehumidifier, which is a device that extracts moisture from the air to control humidity levels. (I live in a house with a bad humidity problem.) This particular device does not have any native wireless capabilities, so it seemed like a good candidate for my project.

This poor dehumidifier will be my lab rat for this project.

The basic idea would be to turn the device on and off remotely, using some kind of wireless protocol. So, let’s say, I want to turn the dehumidifier on, but I’m too lazy to get off the couch. All I would need to do to control it is to send a command using my phone or laptop. This way, I don’t have to walk five meters, which means I can preserve precious energy for other tasks.

The first order of business was to open the dehumidifier up to see what we’re dealing with. On the top of device is a numeric display, four buttons, and some small LEDs showing various status information, like the speed of the fan, or whether the water tank needs to be emptied.

The top of the dehumidifier.

After removing about ten screws and unlocking some plastic clips, the inside of the device looks like this. The image shows a fan at the top, and on the bottom an enclosure with an LED screen (for displaying the humidity level), next to four spring-controlled buttons for interacting with the device.

Just for some background, here’s the basic TL;DR for how a dehumidifier works. First, it uses a fan to draw in humid air. This air is passed over a cooling surface, which condenses the moisture into liquid water. The condensed water collects in a tank or is drained away. After the moisture is removed, the now-dry air is reheated and released back into the room by the fan. Obviously, this mechanism is actuated by a bunch of electronic components in the device, which we can access by opening it up.

Coming back to the disassembling, we can see the main circuitry of the device when we remove its enclosure:

Much of the circuitry in the top of the image is dedicated to stepping down a mains 220-240 Volt AC input into a lower voltage DC output. The mains power comes in from the left of the image and gets transformed to a 5V DC power required for the fan and other things like the button controls.

The main thing I was interested in though, were the spring-based buttons next to the LED display. These four springs correspond to the four buttons the user can press on the top of the device. They are normally covered with a plastic surface, and actuated by pressing down on the plastic at the location of the button.

In order not to over-complicate things, I wanted to focus on controlling the power button specifically. My idea was as follows: If I could somehow simulate the electrical signal that the spring connected to the power button generates, I could use this to remotely trigger the on/off switch.

To find out the mechanism by which the power button operates, I first had to figure out what the spring was connected to on the other side of the PCB. Disconnecting the PCB and turning it over reveals the main microcontroller as well as the wiring around it:

This board is relatively straightforward. You have a microcontroller in the middle, which acts as the brains of the device, sending and receiving signals to other components. Then, you mostly have a bunch of resistors and capacitors around the microcontroller, with traces leading to the LEDs and buttons on the other side of the board. It’s hard to figure out exactly what the microcontroller is doing (it’s quite literally a black box) so instead, we can look at the wiring to and from the chip to figure out what is going on.¹

We can start by looking at the points where the springs are connected, shown in the image below. The connection on the right leads to the spring-based power button we’re interested in.

If you look closely at the image, there is a connection (also called a trace) along the top of the board, coming from the spring connection, through a resistor, and into the microcontroller. What this most likely means is that the power signal for the device gets sent on this trace. Therefore, we can start by manually injecting a signal here to figure out how we can trigger the power signal ourselves.

I was definitely confused by the spring-based buttons, and it took me a while to figure out how they worked. At first I thought that pressing down on the spring would somehow make contact with an underlying conductive path, which would pull a specific signal high or low (depending on what the microcontroller responds to). But one of the most confusing things was that I could trigger the switch simply by making contact with the pads, even without using any electrical signals. For example, by touching the pad with the leads of my multimeter when it was turned off, or with my finger. It was clear that somehow, the switch was extremely sensitive to any kind of contact or electrical signal, no matter how small.

After some research, I figured out the spring is actually a capacitive touch sensor. They’re a bit confusing to understand, but the basic mechanism is that when your finger makes contact with the button, it forms a tiny capacitor with the spring, which is picked up by the circuit and interpreted as a button press. These kinds of sensors are commonly used for appliances that require some kind of “touch” buttons as a replacement for mechanical push switches.

Doing an online search quickly revealed similar looking sensors, like this product from AliExpress:

Unfortunately, it’s apparently not straightforward to control these capacitive touch sensors if you want to modify them. Luckily, I found this great YouTube video that shows a circuit design which is capable of controlling them nonetheless. This is what the circuit diagram looks like:

The basic idea is that the two 1N4148 diodes act as switches. When the mechanical switch is closed, the two diodes are forward biased and pass a tiny bit of current to the capacitive sensor, which triggers it. When the switch is open, the point between the two diodes has very high impedance, and no current can pass. The video explains the circuit in more detail.

I put together a breadboard prototype and lo and behold, I was now able to successfully trigger the switch!

Testing the circuit…

I added a 2n3904 transistor to the circuit to act as an electronic switch rather than a mechanical one, with an added resistor at the base pin.² I then put all the components on a perfboard and soldered them together. The result looks like this:

Now I just needed a way to send a signal to the dehumidifier wirelessly. After some research, I settled on the ESP32 microcontroller. This device is somewhat similar to an Arduino, but with integrated Bluetooth and WiFi. You can write custom programs in, for example, C or Python, and load them onto the microcontroller using a USB connection. I was actually amazed to find out how cheap these little devices are – you can pick them up for not much more than $5 on AliExpress.

The program for the ESP32 is quite simple. It spawns a basic HTTP server which allows it to receive incoming requests over the WiFi network. Whenever it receives a request, it sends a signal from one of its GPIO pins (in this case GPIO 5) to the base of the transistor to trigger it, which sends the signal to the capacitive sensor through the custom circuit I just showed you.

The final circuit diagram looks like this:

I actually found out that there’s a mistake in this circuit, which I found out only after I soldered it in place. Luckily for me, it still worked.³

After some more testing to verify that things worked correctly, it was time to solder everything together. The perfboard needs to connect to the ESP32, and they both need to receive 5V power from the dehumidifier. Luckily for me, there was an unused port on the dehumidifier PCB which gave me a 5V power source.⁴

Not the prettiest soldering, but hey, it works!

I also had to find a place to mount the new components. There was a neat little empty space under the PCB that could nicely fit all my components.

The components placed into the dehumidifier. The ESP32 is on the left, and the custom circuit on the right.

After putting back the other PCB, the new components now snugly fit into the box:

I created a simple web app (which is just a big green button) for controlling the device. It works surprisingly well! Here’s a demo of the final result:

Your browser does not support the video tag.

Conclusion

This was a fun project to work on. If I want to expand the functionality, I could add more sensors to the ESP32, e.g. for temperature and humidity, and turn the web interface into a proper dashboard. I have a bag of cheap sensors from AliExpress lying around – maybe I’ll try connecting those at some point.

It can be pretty intimidating to modify a device like this because there’s lots of unknown unknowns (who knows what you could break?), especially for someone like me with limited experience in electronics. Needless to say, I was pleasantly surprised by how well the whole thing works! It’s great to see all the individual pieces come together like that.

Another option may have been to listen on the UART communication sent on the Rx and Tx pins to figure out what the microcontroller is doing. ↩
I actually had to try out a bunch of different resistor values to find one that sufficiently saturated the transistor. I initially tried to calculate this resistor value using theory, but eventually came to the conclusion it was much more effective to simply measure the voltage drop across the transistor using a multimeter in order to find the best resistor value. ↩
I had contact with Leo, the author of the YouTube video, and apparently I used his circuit in a way that he didn’t intend at all. This is because his circuit was designed to connect capacitively to the circuit, not with a direct connection, which was what I was using. But somehow, it still works! I asked Claude 3.7 to analyze this circuit and it says transient voltage changes may be the reason why this ciruit still works. According to Leo, a better solution would be to put a small capacitor (20 pF or so) in series with the wire to the sensor, to block DC to the dehumidifier PCB. Anyway, you should probably take my circuit design with a grain of salt, because I barely know what I’m doing. ↩
Fun fact: this port was labeled “WFI” (i.e. WiFi?). Perhaps the original design of the dehumidifier actually included WiFi control! ↩

The Age of AI-Powered Curiosity

2024-12-01T00:00:00+00:00

When I was a kid, I used to ask my mother all kinds of questions about the world, asking things like Why is the sky blue? or Why does the sun disappear at night? Although my mother is a patient woman, she would usually get tired of my never-ending stream of why questions pretty quickly. Usually, this meant that she would shut down the conversation with a simple “because that’s just how it is”.

Obviously my mom is not to blame for not knowing every single random fact I was interested in at that age. Since I could not find most of the answers I was looking for, I remember fantasizing about having a magical electronic device – a handheld device, like a Game Boy – that could answer all my questions about the world. I would simply type in a question and get the answer back from the device. No subject would be off limits. Such an oracle of knowledge seemed to me like true magic, and I never seriously considered that such a device could actually exist.

But the magical technology I dreamt of as a kid actually exists in the world we live in today.

Enter LLMs

What I envisioned then exists today in the form of large language models – extremely large neural networks trained on massive amounts of internet data. This training has given these models vast knowledge of the world. To the degree that information can be found on the internet, LLMs have memorized it. The addition of RLHF (Reinforcement Learning from Human Feedback) has made interaction with these models as accessible as having a conversation with another person. The result is nothing short of spectacular, and I believe the world has yet to realize the true potential of this technology.

One of my favorite use cases for LLMs in day-to-day life is to use them as a tool for exploration and curiosity-based research. This can be for the purpose of solving a concrete problem, or just being curious about how things work and wanting to dive deeper. For instance, I recently found myself wondering about several questions:

“Why does money become worth less over time?”
“What makes kerosone suitable for airplane fuel?”
“Why don’t we have electric airplanes?”

If you were to do an internet search, especially for very specific questions, you would typically need some time to find a good answer. However, these days, I simply ask an LLM:

Claude 3.5 Sonnet answering my random questions about the world. Source: Perplexity.

The best part: I can get answers to questions like this in a matter of seconds. I’m only limited by my typing speed, or I can even state my questions out loud by using text-to-speech on most LLM providers. One of the absolute best parts about this is that you can ask as many detailed follow-up questions as you want. Learning becomes personalized, since we can all ask the questions we personally find most interesting. Above all, this makes learning so much more fun. With an LLM assistant, you can have an incredibly short feedback loop that allows you to dive into any aspect of the answer you find most interesting.

This short feedback loop is in stark contrast to how information used to be accessed for most of human history. If we think about the evolution of the speed of information retrieval, I imagine it would have gone something like this:

Era	Method of Information Retrieval	Estimated Time
Pre-Printing Press (before 1450)	Find an expert human to answer your question	Weeks-years
Print Era (1450-1990)	Search through books	Hours-days
Internet Age (Post-2000s)	Google search and browse websites	Minutes
AI Age (2020-present)	Ask an LLM	Seconds

What this means is that access to information used to be highly dependent on your geographical location, whereas in current times, all you need is an internet connection and a few seconds of your time! Furthermore, in the age of AI, the quality of the information retrieval will also be higher, since answers are synthesized directly based on questions.¹

Finally – and this might sound a bit silly –, I think another great aspect of LLMs is that they don’t judge you. I imagine most people can relate to the feeling of not asking certain basic questions because we don’t want to appear dumb or ignorant. With an AI assistant, this is not a problem – it can be as patient and as compassionate as you want it to be, and it won’t judge you in any way. This makes it easier for me to ask questions I’m curious about, no matter how basic they might be.

It’s still early days

There tends to be lots of talk about the progress of AI in years to come. However, there might not be enough appreciation of what AI can already do for us today. I imagine a similar thing happened back when the internet was invented. Wikipedia founder Jimmy Wales noted that the technical capability for an online encyclopedia existed well before its creation. The breakthrough wasn’t technological – it was conceptual, requiring a shift in thinking about collaborative content creation. Similarly, even if AI were to stay at exactly the same level as it is today, we will still find lots of applications for it that will make all of our lives better. In other words, we are still in the early stages of figuring out how to effectively use this weird new technology.

Hallucinations are another aspect that tend to be brought up as a major obstacle for making LLMs useful. However, I think this is not as relevant for exploratory questions like the ones I mentioned earlier. If you ask an LLM about well-established topics or factual matters, it tends to do fine 99% of the time. Moreover, hallucination tends to be more of an issue for smaller models, whereas the frontier models (like ChatGPT 4, or Claude 3.5) suffer much less from this problem. Additionally, if you use a service like Perplexity for web search (which I use every day), you can simply check the cited sources for verification.

Conclusion

Working with LLMs is like working with alien technology – there’s no manual for it. So it’s always important to keep a critical mindset when evaluating an LLMs output. But with some training and awareness of how LLMs function, I believe there’s hardly a limit to how useful they can be.

LLMs have helped me regain some of the sense of wonder I used to have as a kid when I pestered my mother with random questions about the universe. Learning doesn’t have to be boring, and LLMs are making this abundantly clear. We are truly at the beginning of a new age of curiosity, and I’m looking forward to all the incredible applications that AI will make possible.

Tyler Cowen: How to read a book using o1

The way I sometimes try to explain this is that it’s like having a superhuman friend who has memorized the internet, whom you can directly ask any question. ↩

CUDA C++: Using __CUDA_ARCH__ the Right Way

2024-11-01T00:00:00+00:00

TL;DR: Read this section of the CUDA C++ Programming Guide before using the __CUDA_ARCH__ macro so that you’re aware of cases where it’s problematic.

—

In the last year or so, I’ve become quite interested in low-level GPU programming for performance optimization of neural networks. Diving into CUDA programming has been a fun (and challenging!) way to explore the magic behind what makes neural nets run fast on GPUs.¹ I tend to do a lot of Python programming, so stepping down into the depths of low-level GPU programming has been, well… illuminating. If there’s one thing I’ve noticed, it’s that learning to write high-performance CUDA code is like learning to program in hard mode. In my experience, one does not simply “learn” CUDA – one merely becomes less incompetent over time. The devil tends to be in the details when you write CUDA code, and the details are far too easy to miss if you’re not careful.

In this post, I want to highlight a specific detail of CUDA C/C++ that can easily lead to mistakes: the __CUDA_ARCH__ macro. It’s caused me a fair bit of confusion in the last few days, so this post acts as an overview of my findings on how to use it while avoiding some of its pitfalls.

What is the __CUDA_ARCH__ macro?

The __CUDA_ARCH__ macro is used to write code that behaves differently depending on your GPU architecture. NVIDIA tends to release a new GPU architecture every two years or so — which may provide functionality that previous generations did not — so it’s quite useful to have the ability to write architecture-specific code.

Inside device code (CUDA code), the __CUDA_ARCH__ macro expands to the compute capability of the GPU that you’re compiling for. For example, on an NVIDIA A100, which has compute capability sm80, __CUDA_ARCH__ will expand to 800. For a GeForce RTX 4090, which has compute capability sm89, __CUDA_ARCH__ will expand to 890. We can use the value of __CUDA_ARCH__ to conditionally include code that may only work for specific GPU architectures.

As a real-world example, here’s a snippet from Flash Attention that sets certain traits depending on whether the GPU architecture is at least Ampere (sm80):

template<...>
struct Flash_kernel_traits {

#if defined(__CUDA_ARCH__) &&  __CUDA_ARCH__ >= 800
    using Element = elem_type;
    static constexpr bool Has_cp_async = true;
#else
    using Element = cutlass::half_t;
    static constexpr bool Has_cp_async = false;
#endif

}

Note the use of Has_cp_async: The cp_async operation requires sm80 or higher, so the Has_cp_async boolean is a way to set the correct instructions at compile time by using preprocessor directives.

The problem: undefined behavior

Now, what I want to highlight is the cases where using __CUDA_ARCH__ is actually very problematic, and needs to be avoided. In particular, there are various situations where using it actually leads to undefined behavior. Luckily, the CUDA C++ programming guide has a section that indicates the precise four situations where using __CUDA_ARCH__ leads to undefined behavior². These are³:

Setting type signatures for __global__ functions and/or __device__ and __constant__ variables based on __CUDA_ARCH__.
Instantiating function templates for __global__ functions based on __CUDA_ARCH__.
In separate compilation, using __CUDA_ARCH__ to conditionally define a function or variable with external linkage.
In separate compilation, using __CUDA_ARCH__ in headers such that different objects could contain different behavior.

For example, the following code block violates rule no. 1:

#if !defined(__CUDA_ARCH__)
typedef int mytype;
#else
typedef double mytype;
#endif

__global__ void foo(mytype in, // problem: foo's type depends on __CUDA_ARCH__
                    mytype *ptr)
{
  *ptr = in;
}

What exactly happens when you write code that violates any of these four rules? The docs provide the answer:

The compiler does not guarantee that a diagnostic will be generated for the unsupported uses of __CUDA_ARCH__ described above.

In other words, your code can now exhibit arbitrary behavior, including crashing, producing incorrect results, or even behaving as expected by coincidence. What’s worse, the compiler won’t even tell you that something is wrong.

Someone from the GPU Mode Discord channel said it best:

This is probably the scariest kind of thing you can have in C/C++: UB NDR (Undefined Behaviour, No Diagnostic Required)

Scary stuff. If you take away nothing else from the current post, let it be this: read the aforementioned section of the CUDA C++ programming guide before you find yourself using __CUDA_ARCH__ in any kind of non-trivial way.

Case study: torchao FP6 kernel

To give you an impression of how things can go wrong, let’s look at a case study. I have been contributing to torchao in recent months, a library for PyTorch native quantization and sparsity for training and inference. Specifically, I have been focusing on torchaos integration with FP6, a high-performance CUDA kernel for 6-bit quantization. Most of the details aren’t important – I just want to highlight a small part of the code. Here’s a simplified version of what the structure of the FP6 code looked like at some point:

#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 800

template<int EXPONENT, int MANTISSA>
__global__ void fpx_kernel(const uint4* weights, const half* scales,
                           const half* in_feats, half* out_feats,
                           const size_t M, const size_t N, const size_t K)
{
    // CUDA code here ...
}


torch::Tensor fp6_forward_cuda(
    torch::Tensor   _in_feats,
    torch::Tensor   _weights,
    torch::Tensor   _scales)
{
    // Setup code here ...
    
    // Call the CUDA kernel
    fpx_kernel<3, 2><<<grid_dim, block_dim, stream>>>(
        weights, scales, in_feats, out_feats, M, N, K);
    
    return out_feats;
}

#endif

The idea here is that the FP6 kernel only supports sm80 and higher, so we use __CUDA_ARCH__ to only include the FP6 code when __CUDA_ARCH__ >= 800 or when __CUDA_ARCH__ is undefined (which is the case for host code). This avoids compilation problems when using GPUs with compute capability older than sm80.

The code above worked fine on sm80 GPUs and higher. However, when we called the above kernel on a sm75 GPU, we noticed that it behaved inconsistently. Sometimes, the code would error out saying that the kernel wasn’t defined (which was the expected outcome). However, other times, it would actually run without errors! (while producing garbage output.)⁴

If we refer to the four rules I mentioned earlier from the CUDA C++ Programming Guide, it appears that we are violating rule no. 2. To quote the docs:

If a __global__ function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether __CUDA_ARCH__ is defined and regardless of the value of __CUDA_ARCH__.

Note that during compilation, the nvcc compiler makes a distinction between device code (code that runs on the GPU) and host code (code that runs on the CPU). Host code is forwarded to a standard C++ compiler (like gcc or cl), similar to any regular C++ program. The device code, on the other hand, is compiled by nvcc, the NVIDIA CUDA compiler. Afterwards, the device code is embedded into the host object file as a “fat binary”.

In the case of the FP6 code shown above, it would appear that the following is happening when we compile for an sm75 GPU:

For the compilation of host code, the #if directive will evaluate to true (since !defined(__CUDA_ARCH__) is true for host code), which means the fpx_kernel<3, 2> function template is instantiated.
For the compilation of device code, the #if directive will evaluate to false (since __CUDA_ARCH__ < 800), which means the fpx_kernel<3, 2> function template is not instantiated.

This means there is a problematic divergence between the host and device code: The host code includes a __global__ function symbol that the device code does not. What happens as a result is undefined behavior, i.e. anything can happen. This explained the inconsistent behavior of the kernel, which went away after addressing this issue.⁵

The Lindy Effect: Explaining the Longevity of Legacy Software

2024-08-01T00:00:00+00:00

“Wanted: people who want to learn a programming language from 1959,” headlined a recent Dutch newspaper article. The language in question is COBOL, a language so old that most of the programmers who know it are either dead or retired. Nonetheless, banks, government institutions, and other large organizations still rely on this ancient language for some of their most critical infrastructure.

How is it possible that a language like COBOL is still used today? What has prevented these organizations from upgrading their software, like, several decades ago? I recently discovered a useful mental model for thinking about this: The Lindy effect.

The Lindy Effect

The Lindy effect states that the longer an idea or technology has survived, the longer it is likely to stay alive. Nassim Taleb highlights this concept in his book Antifragile, where he explains that for non-perishable things like ideas, music, and technology, age is a sign of robustness, which in turn is a positive indicator of future longevity.

“If a book has been in print for forty years, I can expect it to be in print for another forty years. But, and that is the main difference, if it survives another decade, then it will be expected to be in print another fifty years. This, simply, as a rule, tells you why things that have been around for a long time are not “aging” like persons, but “aging” in reverse. Every year that passes without extinction doubles the additional life expectancy. This is an indicator of some robustness. The robustness of an item is proportional to its life!“ — Nassim Taleb

The typical framing of the Lindy effect is that those ideas that endure over time likely do so because they retain their usefulness, and therefore have long-term value.

The Lindy Effect Applied to Software

The Lindy effect provides an interesting framework for understanding the longevity of software. In particular, I’d like to discuss the rise and decline of programming languages as a manifestation of a similar effect. Why do some programming languages survive, and others don’t? For example, C and Pascal are both programming languages that originated in the 1970s. However, C remains widely used today, while Pascal has largely fallen out of favor.¹

The typical Lindy explanation for this discrepancy relates to a difference in quality. The fact that C has survived this long and is still widely used indicates its durability and lasting value. By contrast, Pascal has faced more issues and has largely been replaced by more modern languages.

But quality is not the only reason why programming languages survive. Once a language gains sufficient traction, it’s existence can become self-reinforcing. Some software gets ingrained deeply enough into the infrastructure of a society to practically guarantee its future existence. The obsolescence of these critical pieces of software will only occur if there is a conscious and deliberate effort to replace them, which becomes less likely over time.

It helps to understand that software derives much of its greatest strength from its ability to be layered on top of other software. Over time, software becomes increasingly powerful, and increasingly complex. This is generally a positive thing. For instance, a video game programmer does not need to write their own physics simulation—they can use Unreal Engine. Similarly, a web developer can use a framework like React or Django to speed up their work.

However, there is a clear downside to this increasing complexity that relates to critical points of failure. Modern software is like a tower of stacked dependencies, and the oldest software tends to be at the bottom. As we create more software, the impact of the software at the bottom grows over time, as does the impact of changing it. This is the kind of software that becomes a critical point of failure for much of our digital infrastructure, where the smallest bugs or security flaws can have a gigantic ripple effect. Recent examples of this include the the Log4j bug from 2021, as well as the Crowdstrike bug from this year that caused global outages.

xkcd

Over time, the effect of changing this kind of software becomes so large that making any serious modifications can be practically impossible. The massive complexity, scale, and interdependencies involved lead to a growing inertia—a resistance to change the existing software. A financial institution like a bank will think twice before touching the software that handles all of their financial transactions (written in COBOL)—the consequences of a mistake are simply too great.²

So if we think back to the example of COBOL, and why it is still being used to this day, we can understand why: It is too deeply ingrained into our existing infrastructure. Lots of dependencies and complexity have formed over time, which means that as a programmer, you just might crash the global financial system if you forget to add a semicolon somewhere.³

Bad Design Ossifies Software

Poor software design makes the problem of dependencies worse. There is a reason why software engineering best practices advocate for things like modularity and separation of concerns: it minimizes dependencies across a codebase. Problems arise when software modules make lots of assumptions about other modules they interact with, creating rigidity and an inability to change anything without lots of unforeseen consequences.

This usually follows a common pattern. Someone creates a temporary fix (hack), followed by lots of changes built on top of the original fix, making the original fix increasingly hard to remove. Organizations that take software health seriously eventually service this kind of technical debt and resolve the problematic dependencies. But organizations that don’t do this accumulate lots of problematic dependencies over time. Eventually, their codebase is such a jumbled mess of spaghetti code that no one dares change it (after all, who knows what will happen?).

Even well-designed software can be difficult to replace. Translating a codebase from one programming language to another leaves plenty of room for new and unforeseen bugs. This is the kind of thing that banks (rightfully) fear when they consider replacing their financial transactions software.

Old Software Ain’t So Bad

Old software isn’t inherently a bad thing and can provide lots of value. For instance, the older a piece of software is, the more likely it is that mistakes and bugs have been removed from it. Furthermore, new technology does not necessarily perform better than old technology—in fact, the opposite often holds true in software.

For example, Python is built on top of C. Although C is, by modern standards, a horrible language that no one should use, it is at least fast. As long as there are C programmers around who can maintain the old infrastructure, there isn’t much of a problem. Similarly, Fortran is still used in numerical analysis software because it is highly efficient for those kinds of workloads. Same thing with COBOL: It is highly optimized for financial transaction systems.

Instead, old software becomes problematic when it becomes increasingly hard to maintain over time. In the case of a language like COBOL, there are fewer and fewer engineers who can understand and modify the existing software. This reinforces the Lindy effect—no engineers are available to replace the old software because the engineers that are available need to maintain it.

Conclusion

I believe that the rise and decline of programming languages is a natural phenomenon. Priorities change, paradigms shift, and this is reflected in the languages we use as software engineers. However, the languages that stick around are worth studying. Sometimes, their durability is merely a sign of technological inertia and complicated dependencies. Other times, a more positive version of the Lindy effect may be at play—their durability might hint at design principles that are worth preserving.

C is still ranked the 3rd most popular programming language as of writing this. ↩
On a related note, this is also why it’s so hard to upgrade the electric power grid, or why we are still using the QWERTY keyboard layout despite the existence of more efficient layouts. ↩
Needless to say, you wouldn’t push straight to production for any kind of serious infrastructure software. ↩

Why You Should Take a Personality Assessment

2024-07-10T00:00:00+00:00

To become acquainted with oneself is a terrible shock. - Carl Jung

Human beings are confusing. Granted, this shouldn’t come as a great shock to most people. Personally, it only takes me a few minutes of watching TikTok to be convinced that the human mind is an inexplicable mystery. What’s especially striking, though, is how often we are equally confused about ourselves as we are about others. When we get confused about others’ behaviors, we can at least say, “Well, I can’t read their thoughts”. But this excuse doesn’t work for our own thoughts and actions. After all, we literally can read our own minds.

I have realized in recent years that most people could benefit from learning more about themselves. Although human psychology is elusive, we can strive to understand it through standardized personality testing. In my opinion, everyone should do a personality test at least once in their life.

What is a personality test? It is an attempt to describe human psychology into measurable variables, such as extroversion or neuroticism. Usually, they are conducted through self-report questionnaires. Originally, these tests were used for personnel selection in the armed forces, back in the 1920s. Since then, a wide variety of personality scales have been developed.

Some of the more common personality frameworks include the Big Five and the Myers-Briggs Type Indicator. It’s worth trying out a few different alternatives to find out which works best for you – personally, I’ve gained a lot by learning about the Big Five.¹ There are various online tests available. For example, a detailed, but paid ($10) Big Five test is the Understand Myself assessment. There are also free alternatives, such as here and here. Some examples of Big Five alternatives include PrinciplesYou and 16Personalities.

Questionable life choices

Personally, I could have benefited from more self-knowledge back when I was in high school. Back then, I had somehow come to the conclusion that what I wanted to do with my life was to become an accountant. Becoming an accountant sounded like a good idea to me because 1) you made a lot of money, and 2) it was high status and sounded important. In hindsight, this was a terrible decision. The environment was an incredibly bad fit for me and I could not relate to my classmates whatsoever. It took me just three months to drop out of my studies and find myself back where I started.

Looking back on this experience, I realize that a poor grasp of my own personality was to blame for my misguided career choice. I had a clear plan for what my career would look like, but I had a poor grasp of how I would fit into that plan.

As Mike Tyson once said, “everybody has a plan, until they get punched in the face”. In my case, fresh out of high school, life immediately handed me an uppercut to the jaw and put me back in my place. I eventually ended up in the right place, studying computer science and artificial intelligence, and being much happier in that environment.

This experience made me realize that earlier in life, it is especially valuable to do a good amount of exploration, trying out different things to discover what you like and don’t like. Before committing to important life choices, you ideally want to test your plan with no strings attached, so you can experiment quickly and adjust your plans if things don’t work out as expected. In my case, it probably would have been sufficient to spend a summer internship at an accountancy firm to realize that my personality was a bad fit for the average personality type there.

Although it’s not uncommon for a teenager to be confused about who they are or want to be, it’s not entirely uncommon for adults to struggle with their identity as well. So, how can personality tests help?

First of all, knowing your personality type can help you to navigate career choices more effectively. For example, if you’re strongly introverted, you’re probably better off not becoming a sales person. If you’re highly disagreeable, a career as a nurse is probably not the most logical choice. If you’re high in neuroticism, you might want to avoid high stress jobs like day trading. While there are exceptions, your personality traits should already give you a pretty good idea of what not to do with your life.

Secondly, learning about personality types is a great way to collaborate more effectively with other people. It’s easier to understand people who are similar to you, but much harder to understand those who are radically different. Having a mental framework of personality is useful because at some point, we all collaborate with people who do not share our personality traits. If we don’t make an effort to understand their behavior, frustration and miscommunication are bound to occur.

Here’s an example. I recently found out about the difference between “linear” and “lateral” thinkers. A linear thinker tries to solve problems step-by-step in a structured way, while a lateral thinker tends to take a big-picture view and connects ideas in more intuitive ways. To a linear thinker, a lateral thinker can come across as chaotic and unstructured. Conversely, a linear thinker can come across as rigid and dogmatic to a lateral thinker. At the same time, these two ways of thinking are complementary, because they each have blind spots. Linear thinkers can compensate for the blind spots of lateral thinkers, and vice versa.

Finally, relationships are greatly helped by awareness of personality types. Finding a suitable partner tends to be a question of finding someone who’s personality type is at least somewhat compatible with yours. The more divergent your personality type is to that of your partner, the harder it will be to understand each other, which will inevitably lead to friction. This is not to say that one should strive to find a partner who is exactly like them, but having some overlap in terms of personality traits tends to be better than having no overlap at all.

Above all, knowledge of one’s personality should serve as a compass for making life decisions, both at the micro and macro level.

Someone who has used personality testing with great effect is Ray Dalio – one of the world’s most successful investors. He uses personality testing extensively as a tool to understand both himself and the people around him. In his company, Bridgewater, they use an idea called baseball cards to describe the personality types of employees. These baseball cards contain various personality scores obtained from personality tests, providing perspective on someone’s strengths and weaknesses, as well as their communication style. According to Dalio, striving to understand the psychology of each employee has been a crucial part of the success of the company.

Where to go from here

If there’s one thing I’ve learned, it’s that getting to know yourself is not easy – it takes time and effort to do well. In the meantime, we can use personality testing as a useful tool to bootstrap the process.

I encourage everyone to take a personality assessment at least once. It shouldn’t take more than an hour or two, and you just might gain lifelong benefits from it. That seems like a pretty good investment to me.

Aside from the Big Five test itself, I’ve also learned a lot from Jordan Peterson’s lectures on this topic. ↩

Knowledge Worth Learning

2024-06-17T00:00:00+00:00

What makes someone motivated to learn new things? In my opinion, trying to answer this question is crucial to success in a knowledge economy. Most people have a learning style which is unique to them. For example, I consider myself a highly pragmatic person. I like to remind myself of this from time to time because it helps me to ground my thinking and better understand my motivations and drives. I see learning new things as something that is highly practical, akin to an investment in myself that leads to a positive return in the future.

For example, if I’m working on a programming problem, I enjoy learning about an algorithm or design pattern that helps me solve that problem better. Other times, I enjoy listening to podcasts about science-backed techniques for improving personal health (I’m looking at you, Andrew Huberman). I therefore get the greatest enjoyment from knowledge that is useful and actionable.

Although I like to think this pragmatic approach to learning has served me well, it also has its limitations. In particular, it biases my interest towards knowledge that seems useful at first glance. This can be sub-optimal, because the usefulness of knowledge is not always immediately obvious. The question then becomes, how to identify knowledge that is useful?

What makes knowledge useful?

The idea of useful knowledge reminds me of an amusing scene from Sherlock Holmes. In this scene, Dr. Watson is surprised to learn that Holmes does not know that the earth goes around the sun, as Holmes finds such knowledge irrelevant:

His ignorance was as remarkable as his knowledge. Of contemporary literature, philosophy and politics he appeared to know next to nothing. Upon my quoting Thomas Carlyle, he inquired in the naïvest way who he might be and what he had done. My surprise reached a climax, however, when I found incidentally that he was ignorant of the Copernican Theory and of the composition of the Solar System. That any civilized human being in this nineteenth century should not be aware that the earth travelled round the sun appeared to me to be such an extraordinary fact that I could hardly realize it. […] He said that he would acquire no knowledge which did not bear upon his object. Therefore all the knowledge which he possessed was such as would be useful to him.

I think Holmes’s approach to knowledge raises an intriguing question: How can one determine upfront whether some piece of information will turn out to be useful? In the case of trivia about the Solar System, I think you can be pretty sure that this will not be directly useful for daily life (unless, of course, it’s your job to know). However, in many other cases, the distinction is much more subtle.

Here’s an example. When I was studying computer science at university, I had to implement an algorithm for sorting an array of numbers. The thing to understand is that you would never implement a sorting algorithm yourself for any real-world use case. Whatever your choice of programming language is, you can be pretty sure that it already includes an efficient implementation of a sorting algorithm. In other words, writing a sorting algorithm is mostly a theoretical exercise. At the time, it therefore made zero sense to me why I would have to learn and write an implementation of a sorting algorithm. Why not just use the existing implementations, which are clearly better, and not worry about it? I simply could not imagine the practical use of such an exercise.

Later on, I realized the value: Even though you might never write another sorting algorithm again in your life, there is a good chance you will learn useful things in the process. Unfortunately, the value of such knowledge is often not immediately obvious. If I had focused only on the immediately useful, I would have missed out on learning knowledge that eventually proved useful for future projects.

To illustrate this point, here are some useful insights I gained from list sorting:

Binary search, a valuable technique for efficiently searching an array of values (and derived from list sorting), and something I have used for various software projects.
Being able to recognize sorting problems helps to solve some problems more effectively. Sorting is pretty common in many real-world applications, and recognizing a sorting problem can help you choose the right solution for specific situations. For example, some sorting algorithms, such as insertion sort, are a better choice when a list is nearly sorted.
Learning about sorting introduces you to related topics such as big-O notation, recursion, and algorithm memory usage. These concepts are ubiquitous in computer science and will likely be useful in other contexts.
Job interviews: Knowledge of list sorting and time/space complexity is something that is sometimes tested in job interviews for software engineering roles.

To summarize, learning about sorting algorithms may not be immediately useful, but it has a high ROI in the medium to long term. Moreover, this type of knowledge lays a foundation that makes future learning easier and more efficient. This is what I want to discuss next.

The value of foundational knowledge

As you can see, my goal of acquiring useful knowledge is not as straightforward as I might have hoped. Another complication is that unlike university, the real world is messy, and the most useful skills for succeeding in it arguably share little overlap with the skills needed to succeed in university. So, how does one determine which knowledge is worth learning?

Let me first state the obvious by saying that this is highly dependent on your goals. My focus here is on knowledge workers such as software engineers, who often benefit from increasing their knowledge and skills over time. Such a person might also switch domains every now and then, forcing him or her to learn new things regularly.

As a general principle, I would argue there is great value in learning things that are not too problem-specific.

This is the difference between shallow and deep (or foundational) knowledge. Shallow knowledge will help you only in very specific situations, whereas foundational knowledge is less problem-specific but carries over to many other situations, giving you a broader conceptual framework. Elon Musk has referred to this distinction in terms of a knowledge tree:

It is important to view knowledge as sort of a semantic tree – make sure you understand the fundamental principles, i.e. the trunk and big branches, before you get into the leaves/details or there is nothing for them to hang on to.

An example of the distinction between relatively shallow and deeper knowledge is understanding the specific syntax of a programming language versus learning the underlying principles of how computers operate. The former gives you a basic capacity to do useful work, whereas the latter provides knowledge that is independent of any programming language and can be applied regardless of the language you use. For example, let’s say you’re writing a computer program that adds numbers to an array. If you understand what happens in the computer’s memory when you append an element to an array, you can write a better program by recognizing that a linked list or hash map is more efficient than an array in certain situations. This will make you a better programmer, no matter what language you use.

Seeking out useful knowledge can be a trap, because it can focus your attention on the shallow knowledge that is immediately useful in the short term, while diverting your attention from the foundational knowledge you may need to learn in order to reach a deeper understanding of a topic. In other words, your knowledge tree might be missing the trunk and branches on which the leaves hang. Solid foundational knowledge is useful because it allows you to apply first principles thinking, enabling you to analyze underlying patterns better, which can lead to new insights. ¹

The problem is that this is hard. First, it can be difficult to identify what foundational knowledge you’re missing. By definition, foundational knowledge is far removed from specific problem instances, meaning you need to have enough insight to realize which foundational principles are worth studying. Second, it takes more effort to learn foundational knowledge than to learn shallow knowledge. The difficulty lies in deciding what to learn and how to allocate your time effectively. A lack of understanding of underlying principles and foundational algorithms may lead you to re-invent the wheel and waste a lot of time doing so. Moreover, since the emphasis in modern society is often on short-term thinking, taking the time to learn foundational knowledge can be incredibly valuable.

I think this is worth repeating: Most people don’t do this. The reason is straightforward: It is hard and takes a lot of time and effort. Instead, it is often tempting to look for hacks: health hacks, productivity hacks, etc., where the focus is on quick and maximally useful solutions. The problem with this approach is that it foregoes the opportunity to learn more fundamental principles.

This way of thinking can be applied to many areas of life. Take nutrition as another example. Instead of following the latest diet fad, you are probably better off studying what exactly makes a diet “healthy” in the first place. If you start from the foundational principles that constitute a healthy diet–such as a certain amount of protein, vitamins, minerals, calories, etc.–you have a great deal of flexibility in designing a diet that works best for your personal circumstances. This is tremendously empowering, but also takes a good deal of effort.

On the other hand, partly because it’s so demanding, learning foundational knowledge is not always necessary. Sometimes, a quick hack is all you need. For example, taking a vitamin D supplement in winter is a quick and easy way to address the fact that many people are vitamin D deficient in winter due to a lack of sunlight exposure. You don’t need to know much more about it than that to reap the benefits.

My impression is that the need for foundational knowledge increases as you try to solve harder and more novel problems, whether in your professional or personal life. This is because if a problem has been solved many times before, there is a good chance someone has documented the solution and shared it with others. If not, it usually pays off to go a few levels deeper than the specifics of a problem might seem to require, to gain a new perspective on the problem. Examples of this include: doing novel research, building a startup or product that addresses a difficult problem, or designing a nutrition plan that works well for you.

Frankly, I still find it quite difficult to determine what information is worth learning. Paradoxically, to know how useful a piece of information is, you might have to learn it first. Additionally, what is useful to one person may not be to someone else, so copying the same trajectory as others is also not an ideal strategy.

I find inspiration in the people I look up to when I consider their approach to learning. For example, my impression is that the most impressive software engineers tend to share a learning mindset. They strive to understand systems deeply rather than superficially. A learning mindset might just be the most important meta-strategy: Make it a habit to expand your knowledge and skills, striving for deep, non-superficial understanding. As long as you keep learning, you’ll learn useful knowledge along the way.

Dan Luu on what to learn
Peter Drucker - Managing Oneself

Success in the knowledge economy comes to those who know themselves—their strengths, their values, and how they best perform
Andrej Karpathy on how to become an expert:

How to become expert at thing:
1 iteratively take on concrete projects and accomplish them depth wise, learning “on demand” (ie don’t learn bottom up breadth wise)
2 teach/summarize everything you learn in your own words
3 only compare yourself to younger you, never to others
Andrej Karpathy on the shortification of learning:

There are a lot of videos on YouTube/TikTok etc. that give the appearance of education, but if you look closely they are really just entertainment. This is very convenient for everyone involved : the people watching enjoy thinking they are learning (but actually they are just having fun). The people creating this content also enjoy it because fun has a much larger audience, fame and revenue. But as far as learning goes, this is a trap. This content is an epsilon away from watching the Bachelorette. It’s like snacking on those “Garden Veggie Straws”, which feel like you’re eating healthy vegetables until you look at the ingredients.

Learning is not supposed to be fun. It doesn’t have to be actively not fun either, but the primary feeling should be that of effort. It should look a lot less like that “10 minute full body” workout from your local digital media creator and a lot more like a serious session at the gym. You want the mental equivalent of sweating. It’s not that the quickie doesn’t do anything, it’s just that it is wildly suboptimal if you actually care to learn.
…
So for those who actually want to learn. Unless you are trying to learn something narrow and specific, close those tabs with quick blog posts. Close those tabs of “Learn XYZ in 10 minutes”. Consider the opportunity cost of snacking and seek the meal - the textbooks, docs, papers, manuals, longform. Allocate a 4 hour window. Don’t just read, take notes, re-read, re-phrase, process, manipulate, learn.
Gian Segato: Learning takes effort, otherwise it’s just entertainment
Ben Kuhn on exploring new things

One of my favorite examples of this is Andrej Karpathy making the connection between LLMs and operating systems – two things which, at first glance, have little to do with each other. ↩

CNN vs. Vision Transformer: A Practitioner’s Guide to Selecting the Right Model

2024-05-15T00:00:00+00:00

Vision Transformers (ViTs) have become a popular model architecture in computer vision research, excelling in a variety of tasks and surpassing Convolutional Neural Networks (CNNs) in most benchmarks. As practitioners, we often face the dilemma of choosing the right architecture for our projects. This blog post aims to provide guidelines for making an informed decision on when to use CNNs versus ViTs, backed by empirical evidence and practical considerations.

For those in a hurry, here is a decision tree to quickly choose between CNNs and ViTs for real-world projects. Alternatively, you can jump ahead to the conclusion for a list of practical recommendations.

A decision tree for deciding between CNN and ViT for real-world projects

This guide is primarily intended for practitioners and researchers in the field of computer vision who are interested in understanding the trade-offs between Convolutional Neural Nets and Vision Transformers for various applications. For the most part, I am assuming a background in machine learning concepts and familiarity with terminology related to deep learning architectures and training procedures. Also note that even though I have tried to synthesize some of the most relevant research here, I may have missed important insights that I should have incorporated. Please don’t hesitate to contact me if you spot any mistakes or things I missed!

Architectural Differences

Let’s start with a brief comparison of the two architectures. I will explain only the essential information, as there are plenty of resources available to learn more about Vision Transformers (the original paper is a good start). Since the Vision Transformer architecture is largely identical to the original Transformer encoder architecture by Vaswani et al. (2017), I will use the terms Transformer and Vision Transformer interchangeably.

The Vision Transformer architecture. It is largely identical to the original Transformer architecture by Vaswani et al. (2017). (Image source: Dosovitskiy et al. 2020)

Transformers are flexible architectures with minimal inductive priors, meaning they make few assumptions about input data. In contrast, CNNs assume that nearby pixels are related (locality) and that different parts of an image are processed similarly (weight sharing). These assumptions, inherent to the convolution operator, help CNNs learn effectively with limited training data.

Transformers, on the other hand, have very few inductive biases. This means they have to learn more from the training data, thereby necessitating larger training datasets. They can outperform CNNs when trained on sufficient data, but struggle to learn meaningful representations with small datasets, underperforming other architectures (more on this later).

While CNNs start from the assumption that nearby pixels are related, the Vision Transformer makes no such assumption, considering the relationship of all pixels to each other with equal weight. This can lead to a better understanding of global relationships in an image, which a CNN might not capture because of its locality bias. Therefore, at a certain data threshold, inductive biases become a liability, rather than an asset. Transformers are highly scalable because they are minimally constrained by assumptions baked into the architecture.

Neural network architectures can be seen as existing on a spectrum of inductive biases, from weak to strong. ViTs occupy the lower end of the spectrum, while CNNs occupy the higher end. Depending on how well the inductive priors can be learned from the training data, one might choose an architecture with fewer or more inductive biases. For example, there are hybrid architectures which combine CNNs and ViTs into a single architecture. Such an architecture would sit in the middle of the inductive biases spectrum, with enough priors to avoid requiring a huge amount of training data, while still preserving some of the learning flexibility of the Transformer architecture.

Finally, it is worth mentioning that Transformers have had significant success due to self-supervised learning. This is a paradigm in which the model learns to extract meaningful representations from unlabeled data by solving pretext tasks such as predicting missing patches or identifying transformed images. Since Transformers are so data-hungry, self-supervised learning is an excellent way to scale up training datasets, as no labels are required. It leads to general-purpose representations that can be fine-tuned for specific downstream tasks with less labeled data. The most notable success stories are from NLP (e.g., BERT, GPT), but it is becoming increasingly common in computer vision as well. ViTs are the most common choice for self-supervised pretraining in computer vision (see, e.g., DINOv2, MAE), but CNNs can also be used.

In summary:

Vision Transformers are highly scalable but require large datasets to learn effectively. They are most effective when scaled up to large sizes (or very large sizes). Self-supervised learning can enable such large-scale training, although supervised pretraining is also still quite common.
CNNs have strong inductive biases (locality, weight sharing), allowing them to perform well with limited data. They are less scalable than ViTs, but outperform ViTs in smaller pretraining data regimes.

Details of Vision Transformer model variants. ViT-Huge is not included here, which uses 632M parameters. (Image source: Steiner et al. 2021)

What does the research say?

Let’s now look at empirical studies studying and comparing CNNs and ViTs, to get an idea of their relative strengths and weaknesses.

I will note here that some head-to-head comparisons between CNNs and ViTs can be misleading, due to inconsistent comparison standards. Differences in model size, training dataset, augmentation strategies, and training length are all confounders that make a direct comparison difficult. For example, comparing a ResNet-50 to a ViT-B model (done here) is not a fair comparison, because the ViT-B model has 3.5x the number of parameters as the ResNet-50 model.

Data Requirements

Pretraining on ImageNet-1k has been standard practice for quite some time, typically followed by finetuning on a downstream task. Recent studies have shown the benefits of (supervised) pretraining on larger datasets than ImageNet-1k, such as ImageNet-21k and JFT-300M, even with noisier labels (Kolesnikov et al. 2019, Dosovitskiy et al. 2020, Steiner et al. 2021). This is true for both CNNs and ViTs, although CNNs such as ResNet require certain modifications, such as replacing batch normalization, in order to scale effectively (Kolesnikov et al. 2019). Thus, pretraining on more data is preferred for both ViT and CNN, even if the data is somewhat noisy.

Dosovitskiy et al. (2020) compare the results of pretraining on progressively larger datasets: ImageNet-1k, ImageNet-21k and JFT-300M (1.3M, 14M, and 300M images, respectively), finding that ViTs underperform BiT ResNets (Kolesnikov et al. 2019) when trained on ImageNet-1k, but outperform BiT ResNets when trained on ImageNet-21k or JFT-300M. This can also be seen in the figure below (from the same paper), which shows ImageNet classification performance as a function of pretraining dataset size. ResNets perform better with smaller pretraining datasets but plateau sooner than ViT, which performs better with larger pretraining.

Performance on ImageNet as a function of pretraining size. (Image source: Dosovitskiy et al. 2020)

These results indicate that ViTs are more suitable for large-scale pretraining than ResNets, as their performance scales better on larger pretraining datasets. Although it is not entirely clear at which data scale ViTs start to surpass CNNs, I would put this threshold at about 10 million images for supervised pretraining (mostly based on the ImageNet-21k versus ImageNet-1k results).

In later studies, it has been shown that a ViT can be trained effectively on ImageNet-1k as well, using an improved training recipe with stronger data augmentation (Touvron et al. 2020, Steiner et al. 2021). ViT’s reliance on large datasets makes it benefit from stronger data augmentation during training compared to CNNs.

Transferability

Let’s now explore the transferability of CNNs and ViTs, i.e., how well their representations transfer to new domains. Transferability is a crucial factor for real-world applications, where compute and training data is often limited.

As discussed, ViTs require a large amount of pretraining data to show benefits compared to CNNs. One might conclude that without a massive training dataset, CNNs are the better option. However, in real-world projects, transfer learning — initializing a model from a pretrained checkpoint — is preferred over training from scratch. Even though some studies show limited benefits of transfer learning in rare situations (He et al. 2018), starting from a pretrained model almost never hurts. It usually provides faster convergence, better performance, and higher sample efficiency (see e.g. Kolesnikov et al. 2019, Steiner et al. 2021, Kornblith et al. 2018, Raghu et al. 2019). This is especially relevant since most popular models have pretrained checkpoints available, which should be used as initial weights for a model when starting any new computer vision project. For instance, Steiner et al. 2021 show that across a wide range of datasets, even if the downstream data of interest appears to be only weakly related to the data used for pretraining, transfer learning remains the best available option for training ViTs. Starting from a pretrained model should be the preferred choice 95% of the time, especially when working with small or mid-sized datasets. Training from scratch is rarely justified, requiring (1) a large domain gap between the pretraining and target task, and (2) a large amount of domain-specific data for (pre-)training.

This means that the decision between CNN and ViT should not necessarily be dictated by the size of the training dataset you have access to, but rather, the availability of a model checkpoint pretrained on a large dataset.

Let’s examine how well ViT and CNN representations transfer. The literature shows that pretraining on more data yields more generic ViT models that transfer better to various datasets (Steiner et al. 2021, Zhou et al. 2021). Furthermore, off-the-shelf ViT features transfer well to new domains, outperforming ResNets (Naseer et al. 2021). This makes a ViT pretrained on large amounts of data a good choice for transfer learning.

The plot below shows the transfer learning performance of ViT on the VTAB specialized benchmark, a suite of datasets consisting of medical and aerial images for classification, where each task contains only 800 training images. The plot shows that pretraining on more data is beneficial, as ImageNet-21k pretraining outperforms ImageNet-1k pretraining.

Pretraining on more data yields more transferable ViT models on average, as measured on the VTAB specialized suite, consisting of medical and aerial images. (Image source: Steiner et al. 2021)

Pretraining performance correlates quite well with downstream finetuning performance for both CNNs and ViTs, with occasional exceptions. For example, weight decay and other regularizers can correlate negatively with pretraining performance but positively with downstream performance (Zhai et al. 2021, Kornblith et al. 2018). However, as a general heuristic for maximizing performance, starting from the model with the highest pretraining performance is most likely the best choice for transfer learning. Since the correlation between pretraining and finetuning performance is strong, but not 100%, then, if time or compute is not a limiting factor, one can try out several pretrained models to find the highest performing model for the downstream task.

Based on these findings, we can conclude that ViTs exhibit strong transfer learning performance, as is the case for CNNs. Compared to ResNets, ViTs appear to transfer better on average. In general, when pretraining data is limited, CNNs are most effective. As pretraining data is scaled up, there is reason to believe ViTs exhibit better transfer learning performance. However, there is also reason to believe CNNs can transfer equally well when using the same training tricks as ViTs.

Robustness

Having discussed transferability, let’s now examine the robustness of CNNs and ViTs. Robustness to image corruptions, such as those shown in the figure below, is particularly relevant when data drift is a concern in deployment settings.

ViTs show robustness to various image corruptions (Image source: Naseer et al. 2021)

There is evidence that ViTs are more robust to occlusions, adversarial and natural perturbations, and distribution shifts (Naseer et al. 2021, Paul et al. 2021, Xie et al 2021, Zhou et al. 2022, Shao et al. 2021, Zhang et al. 2021), with the self-attention architecture design playing an important role in achieving stronger generalization (Bai et al. 2021). On the other hand, the more recent ConvNeXt CNN architecture might achieve similar robustness to ViTs (Pinto et al. 2022).

One explanation for the robustness of ViTs to corruptions such as occlusion is a higher bias towards shapes rather than local textures and backgrounds, compared to CNNs (Naseer et al. 2021, Zhang et al. 2021). This observation aligns well with the idea that self-attention more effectively captures global relationships, while convolutions are biased towards local interactions. While the receptive field of a CNN expands gradually with increasing depth, self-attention can flexibly adjust its receptive field as needed in any part of the network. For example, see the figure below, which shows that a ViT integrates information across the entire image even in the lowest layers.

Average pixel distance across which attention is computed at various layers of a ViT network. Image width is 224 pixels. (Image source: Dosovitskiy et al. 2020)

Model Efficiency

Note: I realize this section could be supported with a bit more data. I might expand on it in the future.

Having examined robustness, let’s now consider the efficiency of CNNs and ViTs. Model efficiency is an important consideration, especially in applications where computational resources are limited. When it comes to model efficiency, several factors must be considered, such as FLOPs, power consumption, and memory consumption. Importantly, a distinction can be made between efficiency during model training and efficiency during inference (at deployment time).

When it comes to specialized architectures emphasizing model efficiency, CNNs are arguably more mature. For example, CNN architectures like MobileNet, SqueezeNet, and EfficientNet are designed to be lightweight and efficient, making them suitable for embedded or real-time applications. Additionally, there are various techniques to reduce model size and improve inference efficiency without significant performance loss, such as pruning, quantization, and knowledge distillation. These techniques can be applied to both CNNs and ViTs. See also this paper for an interesting comparison of lightweight backbones (Table 3).

Dosovitskiy et al. 2020 show that ViTs are more sample efficient to train than ResNets when trained on the same computational budget (see figure below). ViTs use approximately 2-4× less compute to attain the same performance. This can be beneficial when pretraining on large datasets, as the model could converge faster.

ViT is more sample efficient than ResNet. (Image source: Dosovitskiy et al. 2020)

Note also that due to the popularity of the Transformer architecture in fields like NLP, optimization of Transformers is a rapidly evolving subject. For an impression of inference optimization, see this post by Lilian Weng.

Limitations

Most studies cited here evaluate CNNs and ViTs on image classification, most commonly on ImageNet. This is not representative of other tasks such as dense prediction.
Many comparison studies compare ViTs to ResNet variants, such as BiT. This is perhaps not an entirely fair comparison, since there are more recent CNN architectures such as RegNet and ConvNeXt.

Summary and Recommendations

The decision tree at the top of this page should provide a useful guideline for choosing a model based on various considerations. Based on the data discussed above, here is a summary of my recommendations for choosing between CNNs and ViTs.

Transfer learning from a pretrained model should be the preferred choice 95% of the time. This holds for both CNNs and ViTs and is especially true when working with small or mid-sized datasets.
Pick a pretrained model checkpoint with the highest upstream performance. CNNs and ViTs both transfer well, which means that the decision between the two architectures should be made by picking the model that performs best during pretraining.
Pick a model checkpoint trained on more upstream data. This holds for both CNNs and ViTs. For example, pick a model trained on ImageNet-21k instead of ImageNet-1k, or a model trained on a large unlabeled dataset in a self-supervised way.
Pick the largest model that fits your hardware and latency limitations. This holds for both CNNs and ViTs. Larger models outperform smaller models when trained on sufficient data, and transfer performance correlates highly with pretraining performance. An exception would be when your target task is simple enough not to require a large model.
Prefer CNN if development time is an important factor. CNNs are a more mature architecture than ViTs, which can make it easier to work with due to existing frameworks and training recipes that are tried and tested.
Prefer CNN for embedded and real-time applications. This is because there is a more mature ecosystem of tools available for CNNs.
Prefer CNN on tasks where pretrained checkpoints are not available, or when checkpoints pretrained on datasets larger than ImageNet-1k are not available. CNNs are the best choice when large scale pretraining is not an option.
Prefer ViT if robustness to image corruptions and/or data drift is a concern. ViTs have been shown to be relatively robust to such perturbations, possibly because ViTs are biased towards shapes, whereas CNNs are biased towards local textures and backgrounds.
When using a CNN, consider using a modern CNN architecture such as ConvNeXt
Consider trying out modified ViT architectures:
- Hybrid CNN/ViT architecture
- ViT with vision priors, to reduce the risk of overfitting. See, e.g., Swin and its derivatives

If I were to summarize all this in a few sentences, I would say that CNNs are a practical and high-performing choice for many real-world applications. For the most part, ViTs are worth trying if you want to try and squeeze out maximal performance. Also, Vision Transformers are worth keeping an eye on due to developments in self-supervised learning and multi-modal training, where the flexibility of the Transformer architecture is quite useful.

Finally, I have omitted a more in-depth discussion on applying ViT for dense prediction tasks such as object detection and instance segmentation. This comes with several caveats, especially when it comes to pretraining. I might write a part 2 of this post at some point where I discuss these things in more depth.

References

[1] Vaswani et al. “Attention Is All You Need”, 2017.
[2] Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, 2020.
[3] Steiner et al. “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers”, 2021.
[4] Kolesnikov et al. “Big Transfer (BiT): General Visual Representation Learning”, 2019.
[5] Touvron et al. “Training data-efficient image transformers & distillation through attention”, 2020.
[6] He et al. “Rethinking ImageNet Pre-training”, 2018.
[7] Kornblith et al. “Do Better ImageNet Models Transfer Better?”, 2018.
[8] Raghu et al. “Transfusion: Understanding Transfer Learning for Medical Imaging”, 2019.
[9] Zhou et al. “ConvNets vs. Transformers: Whose Visual Representations are More Transferable?”, 2021.
[10] Naseer et al. “Intriguing Properties of Vision Transformers”, 2021.
[11] Zhai et al. “Scaling Vision Transformers”, 2021.
[12] Paul et al. “Vision Transformers are Robust Learners”, 2021.
[13] Xie et al. “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers”, 2021.
[14] Zhou et al. “Understanding The Robustness in Vision Transformers”, 2022.
[15] Shao et al. “On the Adversarial Robustness of Vision Transformers”, 2021.
[16] Zhang et al. “Delving Deep into the Generalization of Vision Transformers under Distribution Shifts”, 2021.
[17] Bai et al. “Are Transformers More Robust Than CNNs?”, 2021.
[18] Liu et al. “A ConvNet for the 2020s”, 2022.
[19] Pinto et al. “An Impartial Take to the CNN vs Transformer Robustness Contest”, 2022.

Extracting High-Quality Keyframes from Videos Using FFmpeg

2024-05-07T00:00:00+00:00

I recently found out that when extracting individual frames from a video, not all frames are of the same quality. This is because temporal compression is employed to store videos in a more efficient way, without needing to store each individual frame in its full size. It exploits the similarities between neighboring frames to reduce the amount of data needed to store the video. This can lead to artifacts such as blurriness when frames are extracted as individual images.

FFmpeg has a way to extract only high quality frames from a video, by extracting keyframes: Keyframes are stored as complete images within the video stream and should have higher quality than intermediate frames. There are two ways of extracting keyframes in ffmpeg:

Using the -skip_frame nokey option:
```
 ffmpeg -skip_frame nokey -i input.mp4 -vsync vfr -frame_pts true out_%03d.png
```
- skip_frame nokey skips all non-keyframes
- vsync vfr discards unused frames to avoid duplicates
- frame_pts true uses the frame index for naming the output images
Using the -vf "select=eq(pict_type,I)" filter:
```
 ffmpeg -i input.mp4 -vf "select=eq(pict_type\,I)" -vsync vfr out_%03d.png
```
- select='eq(pict_type,I)' selects only frames where the picture type is I (keyframe)

I prefer option 1, because the resulting frame names contain the exact position of the frame within the video (because of the -frame_pts true flag).

Using Albumentations in Detectron2

2024-04-13T00:00:00+00:00

I have recently been using Detectron2 to train deep learning models for object detection and instance segmentation. This works great, but I found that there is one area in which Detectron2 is lacking: data augmentation. Although it has a decent API for data transforms, it has a limited selection of useful data augmentations. Normally, a dedicated data augmentation library such as Albumentations would be my first choice for this. Unfortunately, unlike similar libraries such as mmdetection, Detectron2 does not have any built-in integration with Albumentations.

So instead, I started looking for custom ways to integrate Albumentations into Detectron2. Since Detectron2 already has its own data transforms API, I expected integration with Albumentations to be relatively straightforward. However, it turned out to be a little more complicated than I expected. Therefore, I’m sharing my findings here in case they are useful to anyone else.

The goal: Albumentations wrapper in Detectron2

Because Detectron2 has some useful data transformations, the goal here is to use the Detectron2 transforms API as well as the Albumentations API in the same pipeline. It should be possible to integrate this into a standard Detectron2 pipeline. Here’s what it should look like:

import detectron2.data.transforms as T
import albumentations as A

augmentations = [
    T.Albumentations(A.HorizontalFlip(p=0.5)),
    T.Albumentations(A.RandomBrightnessContrast(p=0.5)),
    T.FixedSizeCrop(crop_size=(256, 256)),
]

Let’s say you’re using the standard DatasetMapper class from Detectron2. The augmentations could then be passed to it like this:

from detectron2.data import DatasetMapper

mapper = DatasetMapper(cfg, augmentations=augmentations)

Or, here’s an example of a custom training dataloader using the DefaultTrainer class:

import detectron2.data.transforms as T
from detectron2.engine import DefaultTrainer
from detectron2.data import build_detection_train_loader, DatasetMapper
import albumentations as A

def get_augmentations():
    return [
        T.Albumentations(A.HorizontalFlip(p=0.5)),
        T.Albumentations(A.RandomBrightnessContrast(p=0.5)),
        T.FixedSizeCrop(crop_size=(256, 256)),
    ]

class Trainer(DefaultTrainer):
    @classmethod
    def build_train_loader(cls, cfg):
        mapper = DatasetMapper(
            cfg, is_train=True, augmentations=get_augmentations(),
        )
        return build_detection_train_loader(cfg, mapper=mapper)

In the next sections, I will go through an explanation of how this can be achieved. If you just want the code, skip to the last section, while making sure to read the limitations section.

How does Detectron2 use data transforms?

Let’s start by looking at the detectron2.data.transforms API. Detectron2 uses two primary classes for data augmentation: Augmentation and Transform (defined here and here). The Augmentation class acts as a wrapper around the Transform class. The primary difference between the two classes is that calling Transform is expected to be deterministic, whereas Augmentation can be non-deterministic.

The reason for this design choice (or at least one of the reasons) appears to be that in a standard Detectron2 pipeline, data transformations are applied not in one single place, but in several. This can be seen in the DatasetMapper class, which takes metadata for a single image as input and maps it into a format used by the model. Here are the relevant code snippets:

# From detectron2/data/dataset_mapper.py
def __call__(self, dataset_dict):
    # Note that `dataset_dict` represents Metadata of a single image

    # ...

    # Line 163
    aug_input = T.AugInput(image, sem_seg=sem_seg_gt)
    transforms = self.augmentations(aug_input)
    image, sem_seg_gt = aug_input.image, aug_input.sem_seg

    # ...

    # Line 188
    if "annotations" in dataset_dict:
        self._transform_annotations(dataset_dict, transforms, image_shape)

At a high level, this code shows that data transformations are applied in two places:

Line 163-165: image-level annotations (image + semantic segmentation mask)
Line 188-189: instance annotations (bounding boxes + instance segmentations)

In both these places, the same transforms variable is used to transform the data. Now consider the following: The data transforms need to be applied in the same way (i.e. deterministically) in order for the image and its annotations to maintain consistency. For example, when using a random rotation transform, if the image was rotated by 15 degrees, but the bounding boxes were rotated by 25 degrees, then clearly this would lead to inconsistency between the image and the bounding boxes. Instead, the image and bounding boxes need to be rotated by the same amount to maintain alignment. Therefore, transforms needs to lead to the same result each time it is applied.

Determinism also comes with drawbacks. In many cases, it is not ideal to use only deterministic transforms in a deep learning pipeline. Data augmentation is a powerful tool because it can transform data in random ways, to increase the diversity of the dataset. For example, for a single image, you can randomly vary the brightness level, to make your model better at handling various brightness ranges. This is particularly useful for smaller datasets, but can just as easily help build state-of-the-art models.

Detectron2 addresses the need for random augmentations by defining the Augmentation class. Basically, this class constructs a Transform class instance, in a non-deterministic way. Specifically, by calling Augmentation.get_transform(), a new Transform instance is created with randomly sampled parameters. If we look back at the DatasetMapper.__call__ definition shown above, where augmentations are applied in two separate places, the following procedure is applied under the hood: First, call Augmentation.get_transform() once at the beginning of the function to randomly sample a new Transform instance. Then, apply this transform deterministically to all input data.

Albumentations: Slightly different

On the other hand, the default Albumentations API applies transforms in a fully non-deterministic way, where the transformations are randomly applied each time they are called. Unlike Detectron2, there is no intermediate abstraction which is used for applying transforms in a deterministic way. For example, every time you call A.RandomCrop, you will get a different crop. This works because the transformations are all applied in a single call. Consider the following standard Albumentations workflow:

import albumentations as A

transform = A.Compose([
    A.RandomBrightnessContrast(),
    A.RandomCrop(256, 256),
], bbox_params=A.BboxParams(format='coco'))

image = ...   # image here
bboxes = ...  # bounding boxes here

transformed = transform(image=image, bboxes=bboxes)

Even though calling transform here is non-deterministic (you get a different result every time you call it), this still works correctly because all data is passed to the transform at the same time. This means all data (in this case, the image and bounding boxes) will be transformed in a way that maintains alignment between them. Note that this means it is expected of the user to call transform only once for a single image and its annotations. Applying the transform separately to the image and bounding boxes would lead to inconsistencies, since the random parameters would be different each time.

Putting the pieces together

The non-determinism of the Albumentations API makes the integration with Detectron2 non-trivial, because it expects augmentations that can be applied several times in a deterministic way. However, when inspecting the Albumentations code more closely, it can be seen that there is actually a way to apply transformations in a deterministic way. Concretely, the Albumentations BasicTransform class defines a get_params() method (here), which randomly samples the parameters of the transform it is called on. What this means is that after get_params() is called, the returned parameters can be passed to the transform, which leads to execution of the transform in a deterministic way. For example, here is the implementation of get_params() for A.RandomCrop (source), which samples the h_start and w_start parameters used to define the size of the random crop:

    def get_params(self) -> Dict[str, float]:
        return {"h_start": random.random(), "w_start": random.random()}

Putting all of these pieces together, we can come up with the following Albumentations wrapper in Detectron2 (thanks to KUASWoodyLIN for his code contribution):

class Albumentations(T.Augmentation):
    """
    Wrap an augmentation from the albumentations library:
    https://github.com/albumentations-team/albumentations/.
    Image, Bounding Box and Segmentation are supported.
    """

    def __init__(self, aug):
        """
        Args:
            aug (albumentations.BasicTransform): albumentations transform
        """
        self._aug = aug

    def get_transform(self, image):
        do = self._rand_range() < self._aug.p
        if do:
            params = self.prepare_params(image)
            h, w = image.shape[:2]
            return AlbumentationsTransform(self._aug, params, h, w)
        else:
            return NoOpTransform()

    def prepare_params(self, image):
        params = self._aug.get_params()
        if self._aug.targets_as_params:
            targets_as_params = {"image": image}
            # Get additional parameters dependent on the input image
            params_dependent_on_targets = self._aug.get_params_dependent_on_targets(
                targets_as_params
            )
            params.update(params_dependent_on_targets)
        params = self._aug.update_params(params, **{"image": image})
        return params

The key idea is that get_transform() returns a deterministic transform, using randomly sampled parameters obtained by calling prepare_params(). The AlbumentationsTransform class takes the Albumentations transform along with the parameters and applies the transform deterministically. The full source code for AlbumentationTransform can be found here.

By using this wrapper class, we can now define Albumentations transforms as part of our regular Detectron2 pipeline, as shown in the beginning of this post.

Limitations

There are some limitations to this implementation to be aware of:

It does not work for keypoints yet. Keypoint augmentations could be added relatively easily, by implementing the AlbumentationsTransform.apply_coords method. The main reason I did not implement it yet is because I did not have any keypoint data to test on.
It does not work for composite transforms, e.g. A.Compose or A.OneOf.

Conclusion

If you want to use this implementation: I forked Detectron2 and added my changes there, which can be found here. If you just want a pip install command, use the following:

pip install 'git+https://github.com/tobiasvanderwerff/detectron2.git'

Alternatively, you could also copy the Albumentations and AlbumentationsTransform classes as defined here and here to your own project.

I’ve made a pull request for the changes to be integrated into Detectron2, but given the recent lack of activity by maintainers, it is unclear if or when it might get merged.

In summary, by leveraging the get_params() functionality in Albumentations to enable deterministic transforms, we can create a custom wrapper that allows using Albumentations within the Detectron2 augmentations API. This enables leveraging Albumentations’ wide array of augmentations to improve model performance.

Tobias van der Werff

I Hacked My Dehumidifier to Control it Over WiFi

Conclusion

The Age of AI-Powered Curiosity

Enter LLMs

It’s still early days

Conclusion

Related reading

CUDA C++: Using __CUDA_ARCH__ the Right Way

What is the __CUDA_ARCH__ macro?

The problem: undefined behavior

Case study: torchao FP6 kernel

Further reading

The Lindy Effect: Explaining the Longevity of Legacy Software

The Lindy Effect

The Lindy Effect Applied to Software

Bad Design Ossifies Software

Old Software Ain’t So Bad

Conclusion

Related Reading

Why You Should Take a Personality Assessment

Questionable life choices

Where to go from here

Knowledge Worth Learning

What makes knowledge useful?

The value of foundational knowledge

Related reading

CNN vs. Vision Transformer: A Practitioner’s Guide to Selecting the Right Model

Architectural Differences

What does the research say?

Data Requirements

Transferability

Robustness

Model Efficiency

Limitations

Summary and Recommendations

References

Extracting High-Quality Keyframes from Videos Using FFmpeg

Using Albumentations in Detectron2

The goal: Albumentations wrapper in Detectron2

How does Detectron2 use data transforms?

Albumentations: Slightly different

Putting the pieces together

Limitations

Conclusion