Daniel Lemire's blog

Prefix sums at tens of gigabytes per second with ARM NEON

Suppose that you have a record of your sales per day. You might want to get a running record where, for each day, you are told how many sales you have made
since the start of the year.

day	sales per day	running sales
1	10$	10 $
2	15$	25 $
3	5$	30 $

Such an operation is called a prefix sum or a scan.

Implementing it in C is not difficult. It is a simple loop.

  for (size_t i = 1; i < length; i++) {
    data[i] += data[i - 1];
  }

How fast can this function be? We can derive a speed limit rather simply: to compute the current value, you must have computed the previous one, and so forth.

data[0] -> data[1] -> data[2] -> ...

At best, you require one CPU cycle per entry in your table. Thus, on a 4 GHz processor, you might process 4 billion integer values per second. It is an upper bound but you might be able to reach close to it in practice on many modern systems. Of course, there are other instructions involved such as loads, stores and branching, but our processors can execute many instructions per cycle and they can predict branches effectively. So you should be able to process billions of integers per second on most processors today.

Not bad! But can we do better?

We can use SIMD instructions. SIMD instructions are special instructions that process several values at once. All 64-bit ARM processors support NEON instructions. NEON instructions can process four integers at once, if they are packed in one SIMD register.

But how do you do the prefix sum on a 4-value register? You can do it with two shifts and two additions. In theory, it scales as log(N) where N is the number
elements in a vector register.

input   = [A B   C     D]
shift1  = [0 A   B     C]
sum1    = [A A+B B+C   C+D]
shift2  = [0 0   A     B+A]
result  = [A A+B A+B+C A+B+C+D]

You can then extract the last value (A+B+C+D) and broadcast it to all positions so that you can add it to the next value.

Is this faster than the scalar approach? We have 4 instructions in sequence, plus at least one instruction if you want to use the total sum in the next block of four values.

Thus the SIMD approach might be worse. It is disappointing.

A solution might be the scale up over many more integer values.

Consider ARM NEON which has interleaved load and store instructions. If you can load 16 values at once, and get all of the first values together, all of the second values together, and so forth.

original data : ABCD EFGH IJKL MNOP
loaded data   : AEIM BFJN CGKO DHLP

Then I can do a prefix sum over the four blocks in parallel. It takes three instructions. At the end of the three instructions, we have one register which contains the local sums:

A+B+C+D E+F+G+H I+J+K+L M+N+O+P

And then we can apply our prefix sum recipe on this register (4 instructions). You might end up with something like 8 sequential instructions per block of 16 values.

It is theoretically twice as fast as the scalar approach.

In C with instrinsics, you might code it as follows.

void neon_prefixsum_fast(uint32_t *data, size_t length) {
  uint32x4_t zero = {0, 0, 0, 0};
  uint32x4_t prev = {0, 0, 0, 0};

  for (size_t i = 0; i < length / 16; i++) {
    uint32x4x4_t vals = vld4q_u32(data + 16 * i);

    // Prefix sum inside each transposed ("vertical") lane
    vals.val[1] = vaddq_u32(vals.val[1], vals.val[0]);
    vals.val[2] = vaddq_u32(vals.val[2], vals.val[1]);
    vals.val[3] = vaddq_u32(vals.val[3], vals.val[2]);

    // Now vals.val[3] contains the four local prefix sums:
    //   vals.val[3] = [s0=A+B+C+D, s1=E+F+G+H, 
    //                  s2=I+J+K+L, s3=M+N+O+P]

    // Compute prefix sum across the four local sums 
    uint32x4_t off = vextq_u32(zero, vals.val[3], 3);
    uint32x4_t ps = vaddq_u32(vals.val[3], off);       
    off = vextq_u32(zero, ps, 2);                      
    ps = vaddq_u32(ps, off);

    // Now ps contains cumulative sums across the four groups
    // Add the incoming carry from the previous 16-element block
    ps = vaddq_u32(ps, prev);

    // Prepare carry for next block: broadcast the last lane of ps
    prev = vdupq_laneq_u32(ps, 3);

    // The add vector to apply to the original lanes is the 
    // prefix up to previous group
    uint32x4_t add = vextq_u32(prev, ps, 3);  

    // Apply carry/offset to each of the four transposed lanes
    vals.val[0] = vaddq_u32(vals.val[0], add);
    vals.val[1] = vaddq_u32(vals.val[1], add);
    vals.val[2] = vaddq_u32(vals.val[2], add);
    vals.val[3] = vaddq_u32(vals.val[3], add);

    // Store back the four lanes (interleaved)
    vst4q_u32(data + 16 * i, vals);
  }

  scalar_prefixsum_leftover(data, length, 16);
}

Let us try it out on an Apple M4 processor (4.5 GHz).

method	billions of values/s
scalar	3.9
naive SIMD	3.6
fast SIMD	8.9

So the SIMD approach is about 2.3 times faster than the scalar approach. Not bad.

My source code is available on GitHub.

Appendix. Instrinsics

Intrinsic	What it does
`vld4q_u32`	Loads 16 consecutive 32-bit unsigned integers from memory and deinterleaves them into 4 separate `uint32x4_t` vectors (lane 0 = elements 0,4,8,12,…; lane 1 = 1,5,9,13,… etc.).
`vaddq_u32`	Adds corresponding 32-bit unsigned integer lanes from two vectors (`a[i] + b[i]` for each of 4 lanes).
`vextq_u32`	Extracts (concatenates a and b, then takes 4 lanes starting from lane `n` of the 8-lane concatenation). Used to implement shifts/rotates by inserting zeros (when `a` is zero vector).
`vdupq_laneq_u32`	Broadcasts (duplicates) the value from the specified lane (0–3) of the input vector to all 4 lanes of the result.
`vdupq_n_u32` (implied usage)	Sets all 4 lanes of the result to the same scalar value (commonly used for zero or broadcast).

Text formats are everywhere. Why?

The Internet relies on text formats. Thus, we spend a lot of time producing and consuming data encoded in text.

Your web pages are HTML. The code running in them is JavaScript, sent as text (JavaScript source), not as already-parsed code. Your emails, including their attachments, are sent as text (your binary files are sent as text).

It does not stop there. The Python code that runs your server is stored as text. It queries data by sending text queries. It often gets back the answer as text that must then be decoded.

JSON is the universal data interchange format online today. We share maps as JSON (GeoJSON).

Not everything is text, of course. There is no common video or image format that is shared as text. Transmissions over the Internet are routinely compressed to binary formats. There are popular binary formats that compete with JSON.
But why is text dominant?

It is not because, back in the 1970s, programmers did not know about binary formats.

In fact, we did not start with text formats. Initially, we worked with raw binary data. Those of us old enough will remember programming in assembly using raw byte values.

Why text won?

1.Text is efficient.

In the XML era, when everything had to be XML, there were countless proposals for binary formats. People were sometimes surprised to find that the binary approach was not much faster in practice. Remember that many text formats date back to an era when computers were much slower. Had text been a performance bottleneck, it would not have spread. Of course, there are cases where text makes things slower. You then have a choice: optimize your code further or transition to another format. Often, both are viable.

It is easy to make wrong assumptions about binary formats, such as that you can consume them without any parsing or validation. If you pick up data from the Internet, you must assume that it could have been sent by an adversary or someone who does not follow your conventions.

2.Text is easy to work with.

If you receive text from a remote source, you can often transform it, index it, search it, quote it, version it… with little effort and without in-depth knowledge of the format. Text is often self-documenting.

In an open world, when you will never speak with the person producing the data, text often makes everything easier and smoother.

If there is an issue to report and the data is in text, you can usually copy-paste the relevant section into a message. Things are much harder with a binary format.

You can use newline characters in URLs

We locate web content using special addresses called URLs. We are all familiar with addresses like https://google.com. Sometimes, URLs can get long and they can become difficult to read. Thus, we might be tempted to format them
like so in HTML using newline and tab characters, like so:

<a href="https://lemire.me/blog/2026/02/21/
        how-fast-do-browsers-correct-utf-16-strings/">my blog post</a>

It will work.

Let us refer to the WHATWG URL specification that browsers follow. It makes two statements in sequence.

If input contains any ASCII tab or newline, invalid-URL-unit validation error.
Remove all ASCII tab or newline from input.

Notice how it reports an error if there is a tab or newline character, but continues anyway? The specification says that A validation error does not mean that the parser terminates and it encourages systems to report errors somewhere. Effectively, the error is ignored although it might be logged. Thus our HTML is fine in practice.

The following is also fine:

<a href="https://go
ogle.c
om" class="button">Visit Google</a>

You can also use tabs. But you cannot arbitrarily insert any other whitespace.

Yet there are cases when you can use any ASCII whitespace character: data URLs. Data URLs (also called data URIs) embed small files—like images, text, or other content—directly inside a URL string, instead of linking to an external resource. Data URLs are a special kind of URL and they follow different rules.

A typical data URL might look like data:image/png;base64,iVBORw0KGgoAAAANSUhEUg... where the string iVBORw0KGgoAAAANSUhEUg... is the binary data of the image that has been encoded with base64. Base64 is a text format that can represent any binary content: we use 64 ASCII characters so that each character encodes 6 bits. Your binary email attachments are base64 encoded.

On the web, when decoding a base64 string, you ignore all ASCII whitespaces (including the space character itself). Thus you can embed a PNG image in HTML as follows.

<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA
                                 QAAAAECAIAAAAmkwkpAAAAEUl
                                 EQVR4nGP8z4AATEhsPBwAM9EB
                                 BzDn4UwAAAAASUVORK5CYII=" />

This HTML code is valid and will insert a tiny image in your page.

But there is more. A data URL can also be used to insert an SVG image. SVG (Scalable Vector Graphics) is an XML-based vector image format that describes 2D graphics using mathematical paths, shapes, and text instead of pixels.
The following should draw a very simple sunset:

<img src='data:image/svg+xml,
<svg width="200" height="200" 
     xmlns="http://www.w3.org/2000/svg">
  <rect width="100%" height="100%" fill="blue" /> 
  <!-- the sky -->
  <circle cx="100" cy="110" r="50" fill="yellow" />  
  <!-- the sun -->
  <rect x="0" y="120" width="200" height="80" fill="brown" />  
  <!-- the ground -->
</svg>' />

Observe how I was able to format the SVG code so that it is readable.

Further reading: Nizipli, Y., & Lemire, D. (2024). Parsing millions of URLs per second. Software: Practice and Experience, 54(5), 744-758.

How fast do browsers correct UTF-16 strings?

JavaScript represents strings using Unicode, like most programming languages today. Each character in a JavaScript string is stored using one or two 16-bit words. The following JavaScript code might surprise some programmers because a single character becomes two 16-bit words.

> t="🧰"
'🧰'
> t.length
2
> t[0]
'\ud83e'
> t[1]
'\uddf0'

The convention is that \uddf0 is the 16-bit value 0xDDF0 also written U+DDF0.

The UTF-16 standard is relatively simple. There are three types of values. high surrogates (the range U+D800 to U+DBFF), low surrogates (U+DC00 to U+DFFF), and all other code units (U+0000–U+D7FF together with U+E000–U+FFFF). A high surrogate must always be followed by a low surrogate, and a low surrogate must always be preceded by a high surrogate.

What happens if you break the rules and have a high surrogate followed by a high surrogate? Then you have an invalid string. We can correct the strings by patching them: we replace the bad values by the replacement character (\ufffd). The replacement character sometimes appears as a question mark.

To correct a broken string in JavaScript, you can call the toWellFormed method.

> t = '\uddf0\uddf0'
'\uddf0\uddf0'
> t.toWellFormed()
'��'

How fast is it?

I wrote a small benchmark that you can test online to measure its speed. I use broken strings of various sizes up to a few kilobytes. I run the benchmarks on my Apple M4 processor using different browsers.

Browser	Speed
Safari 18.6	1 GiB/s
Firefox 147	3 GiB/s
Chrome 145	15 GiB/s

Quite a range of performance! The speed of other chromium-based browsers (Brave and Edge) is much the same as Chrome.

I also tested with JavaScript runtimes.

Engine	Speed
Node.js v25.5.0	16 GiB/s
Bun 1.3.9	8.4 GiB/s

Usually Bun is faster than Node, but in this instance, Node is twice as far as Bun.

Thus, we can correct strings in JavaScript at over ten gigabytes per second if you use Chromium-based browsers.

How bad can Python stop-the-world pauses get?

When programming, we need to allocate memory, and then deallocate it. If you program in C, you get used to malloc/free functions. Sadly, this leaves you vulnerable to memory leaks: unrecovered memory. Most popular programming languages today use automated memory management: Java, JavaScript, Python, C#, Go, Swift and so forth.

There are essentially two types of automated memory managements. The simplest method is reference counting. You track how many references there are to each object. When an object has no more references, then we can free the memory associated with it. Swift and Python use reference counting. The downside of reference counting are circular references. You may have your main program reference object A, then you add object B which references object A, and you make it so that object A also reference object B. Thus object B has one reference while object A has two references. If your main program drops its reference to object A, the both objects A and B still have a reference count of one. Yet they should be freed. To solve this problem, you could just visit all of your objects to detect which are unreachable, including A and B. However, it takes time to do so. Thus, the other popular approach of automated memory management: generational garbage collection. You use the fact that most memory gets released soon after allocation. Thus you track young objects and visit them from time to time. Then, more rarely, you do a full scan. The downside of generational garbage collection is that typical implementations stop the world to scan the memory. In many instances, your entire program is stopped. There are many variations on the implementation, with decades of research.

The common Python implementation has both types: reference counting and generational garbage collection. The generational garbage collection component can trigger pauses. A lot of servers are written in Python. It means that your service might just become unavailable for a time. We often call them ‘stop the world’ pauses. How long can this pause get?

To test this out, I wrote a Python function to create a classical linked list:

class Node:
    def __init__(self, value):
        self.value = value
        self.next = None
    def add_next(self, node):
        self.next = node

def create_linked_list(limit):
    """ create a linked list of length 'limit' """
    head = Node(0)
    current = head
    for i in range(1, limit):
        new_node = Node(i)
        current.add_next(new_node)
        current = new_node
    return head

And then I create one large linked list and then, in a tight loop, we create small linked lists that are immediately discarded.

x = create_linked_list(50_000_000)
for i in range(1000000):
    create_linked_list(1000)

A key characteristic of my code is the 50 million linked list. It does not get released until the end of the program, but the garbage collector may still examine it.

And I record the maximum delay between two iterations in the loop (using time.time()).

How bad can it get? The answer depends on the Python version. And it is not consistent from run-to-run. So I ran it once and picked whatever result I got. I express the delay in milliseconds.

python version	system	max delay
3.14	macOS (Apple M4)	320 ms
3.12	Linux (Intel Ice Lake)	2,200 ms

Almost all of this delay (say 320 ms) is due to the garbage collection. Creating a linked list with 1000 elements takes less than a millisecond.

How long is 320 ms? It is a third of a second, so it is long enough for human beings to notice it. For reference, a video game drawing the screen 60 times per second has less than 17 ms to draw the screen. The 2,200 ms delay could look like a server crash from the point of view of a user, and might definitely trigger a time-out (failed request).

I ported the Python program to Go. It is the same algorithm, but a direct comparison is likely unfair. Still, it gives us a reference.

go version	system	max delay
1.25	macOS (Apple M4)	50 ms
1.25	Linux (Intel Ice Lake)	33 ms

Thus Go has pauses that are several times shorter than Python, and there is no catastrophic 2-second pause.

Should these pauses be a concern? Most Python programs do not create so many objects in memory at the same time. Thus you are not likely to see these long pauses if you have a simple web app or a script. Python gives you a few options, such as gc.set_threshold and gc.freeze which could help you tune the behaviour.

My source code is available.

Video

AI: Igniting the Spark to End Stagnation

Much of the West has been economically stagnant. Countries like Canada have failed to improve their productivity and standard of living as of late. In Canada, there has been no progress in Canadian living standards as measured by per-person GDP over the past five years. It is hard to overstate how anomalous this is: the USSR collapsed in part because it could only sustain a growth rate of about 1%, far below what the West was capable of. Canada is more stagnant than the USSR.

Late in 2022, some of us got access to a technical breakthrough: AI. In three years, it has become part of our lives. Nearly all students use AI to do research or write essays.

Dallas Fed economists projected the most credible effect that AI might have on our economies: AI should help reverse the post-2008 slowdown and deliver higher living standards in line with historical technological progress.

It will imply a profound, rapid but gradual transformation of our economy. There will still be teachers, accountants, and even translators in the future… but their work will change as it has changed in the past. Accountants do far less arithmetic today; that part of their work has been replaced by software. Even more of their work is about to be replaced by software, thus improving their productivity further. We will still have teachers, but all our kids, including the poorest ones, will have dedicated always-on tutors: this will not be just available in Canada or the USA, but everywhere. It is up to us to decide who is allowed to build this technology.

AI empowers the individual. An entrepreneur with a small team can get faster access to quality advice, copywriting, and so forth. Artists with an imagination can create more with fewer constraints.

I don’t have to prove these facts: they are fast becoming obvious to the whole world.

New jobs are created. Students of mine work as AI specialists. One of them helps build software providing AI assistance to pharmacists. One of my sons is an AI engineer. These are great jobs.

The conventional explanation for Canada’s stagnation is essentially that we have already harvested all the innovation we are ever going to get. The low-hanging fruit has been picked. Further progress has become inherently difficult because we are already so advanced; there is simply not much room left to improve. In this view, there is no need to rethink our institutions. Yet a sufficiently large breakthrough compels us to reconsider where we stand and what is still possible. It forces us to use our imagination again. It helps renew the culture.

We often hear claims that artificial intelligence will consume vast amounts of energy and water in the coming years. It is true that data centers, which host AI workloads along with many other computing tasks, rely on water for cooling.

But let’s look at the actual water numbers. In 2023, U.S. data centers directly consumed roughly 17.4 billion gallons of water—a figure that could potentially double or quadruple by 2028 as demand grows. By comparison, American golf courses use more than 500 billion gallons every year for irrigation, often in arid regions where this usage is widely criticized as wasteful. Even if data-center water demand were to grow exponentially, it would take decades to reach the scale of golf-course irrigation.

On the energy side, data centers are indeed taking a larger share of electricity demand. According to the International Energy Agency’s latest analysis, they consumed approximately 415 TWh in 2024—about 1.5% of global electricity consumption. This is projected to more than double to around 945 TWh by 2030 (just under 3% of global electricity). However, even this rapid growth accounts for less than 10% (roughly 8%) of the total expected increase in worldwide electricity demand through 2030. Data centers are therefore not the main driver of the much larger rise in overall energy use.

If we let engineers in Australia, Canada, or Argentina free to innovate, we will surely see fantastic developments.

You might also have heard about the possibility that ChatGPT might decide to kill us all. Nobody can predict the future, but you are surely more likely to be killed by cancer than by a rogue AI. And AI might help you with your cancer.

We always have a choice. Nations can try to regulate AI out of existence. We can set up new government bodies to prevent the application of AI. This will surely dampen the productivity gains and marginalize some nations economically.

The European Union showed it could be done. By some reports, Europeans make more money by fining American software companies than by building their own innovation enterprises. Countries like Canada have economies dominated by finance, mining and oil (with a side of Shopify).

If you are already well off, stopping innovation sounds good. It’s not if you are trying to get a start.

AI is likely to help young people who need it so much. They, more than any other group, will find it easier to occupy the new jobs, start the new businesses.

If you are a politician and you want to lose the vote of young people: make it difficult to use AI. It will crater your credibility.

It is time to renew our prosperity. It is time to create new exciting jobs.

References:

Wynne, M. A., & Derr, L. (2025, June 24). Advances in AI will boost productivity, living standards over time. Federal Reserve Bank of Dallas.

Fraser Institute. (2025, December 16). Canada’s recent economic growth performance has been awful.

DemandSage. (2026, January 9). 75 AI in education statistics 2026 (Global trends & facts).

MIT Technology Review. (2026, January 21). Rethinking AI’s future in an augmented workplace.

Davis, J. H. (2025). Coming into view: How AI and other megatrends will shape your investments. Wiley.

Choi, J. H., & Xie, C. (2025, June 26). AI is reshaping accounting jobs by doing the boring stuff. Stanford Graduate School of Business.

International Energy Agency. (n.d.). Energy demand from AI.

University of Colorado Anschutz Medical Campus. (2025, May 19). Real talk about AI and advancing cancer treatments.

International Energy Agency. (2025). Global energy review 2025.

The cost of a function call

When programming, we chain functions together. Function A calls function B. And so forth.

You do not have to program this way, you could write an entire program using a single function. It would be a fun exercise to write a non-trivial program using a single function… as long as you delegate the code writing to AI because human beings quickly struggle with long functions.

A key compiler optimization is ‘inlining’: the compiler takes your function definition and it tries to substitute it at the call location. It is conceptually quite simple. Consider the following example where the function add3 calls the function add.

int add(int x, int y) {
    return x + y;
}

int add3(int x, int y, int z) {
    return add(add(x, y), z);
}

You can manually inline the call as follows.

int add3(int x, int y, int z) {
    return x + y + z;
}

A function call is reasonably cheap performance-wise, but not free. If the function takes non-trivial parameters, you might need to save and restore them on the stack, so you get extra loads and stores. You need to jump into the function, and then jump out at the end. And depending on the function call convention on your system, and the type of instructions you are using, there are extra instructions at the beginning and at the end.

If a function is sufficiently simple, such as my add function, it should always be inlined when performance is critical. Let us examine a concrete example. Let me sum the integers in an array.

for (int x : numbers) {
  sum = add(sum, x);
}

I am using my MacBook (M4 processor with LLVM).

function	ns/int
regular	0.7
inline	0.03

Wow. The inline version is over 20 times faster.

Let us try to see what is happening. The call site of the ‘add’ function is just a straight loop with a call to the function.

ldr    w1, [x19], #0x4
bl     0x100021740    ; add(int, int)
cmp    x19, x20
b.ne   0x100001368    ; <+28>

The function itself is as cheap as it can be: just two instructions.

add    w0, w1, w0
ret

So, we spend 6 instructions for each addition. It takes about 3 cycles per addition.

What about the inline function?

ldp    q4, q5, [x12, #-0x20]
ldp    q6, q7, [x12], #0x40
add.4s v0, v4, v0
add.4s v1, v5, v1
add.4s v2, v6, v2
add.4s v3, v7, v3
subs   x13, x13, #0x10
b.ne   0x1000013fc    ; <+104>

It is entirely different. The compiler has converted the addition to advanced (SIMD) instructions processing blocks of 16 integers using 8 instructions. So we are down to half an instruction per integer (from 6 instructions). So we use 12 times fewer instructions. On top of having fewer instructions, the processor is able to retire more instructions per cycle, for a massive performance boost.

What if we prevented the compiler from using these fancy instructions while still inlining? We still get a significant performance boost (about 10x faster).

function	ns/int
regular	0.7
inline	0.03
inline (no SIMD)	0.07

Ok. But the add function is a bit extreme. We know it should always be inlined. What about something less trivial like a function that counts the number of spaces in a string.

size_t count_spaces(std::string_view sv) {
    size_t count = 0;
    for (char c : sv) {
        if (c == ' ') ++count;
    }
    return count;
}

If the string is reasonably long, then the overhead of the function call should be negligible.
Let us pass a string of 1000 characters.

function	ns/string
regular	111
inline	115

The inline version is not only not faster, but it is even slightly slower. I am not sure why.

What if I use short strings (say between 0 and 6 characters)? Then the inline function is measurably faster.

function	ns/string
regular	1.6
inline	1.0

Takeaways:

Short and simple functions should be inlined when possible if performance is a concern. The benefits can be impressive.
For functions that can be fast or slow, the decision as to whether to inline or not depends on the input. For string processing functions, the size of the string may determine whether inlining is necessary for best performance.

Note: My source code is available.

Converting data to hexadecimal outputs quickly

Given any string of bytes, you can convert it to an hexadecimal string by mapping the least significant and the most significant 4 bits of byte to characters in 01...9A...F. There are more efficient techniques like base64, that map 3 bytes to 4 characters. However, hexadecimal outputs are easier to understand and often sufficiently concise.

A simple function to do the conversion using a short lookup table is as follows:

static const char hex[] = "0123456789abcdef";
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = hex[val >> 4];
    dst[k + 1] = hex[val & 15];
}

This code snippet implements a straightforward byte-to-hexadecimal string conversion loop in C++. It iterates over an input byte array (src), processing one byte at a time using index i, while simultaneously building the output string in dst with index k that advances twice as fast (by 2) since each input byte produces two hexadecimal characters. For each byte, it extracts the value as an unsigned 8-bit integer (val), then isolates the high 4 bits (via right shift by 4) and low 4 bits (via bitwise AND with 15) to index into a static lookup table (hex) containing the characters ‘0’ through ‘9’ and ‘a’ through ‘f’. The loop continues until k reaches the expected output length (dlen), which should be twice the input length, ensuring all bytes are converted without bounds errors.

This lookup table approach is used in the popular Node.js JavaScript runtime. Skovoroda recently proposed to replace this lookup table approach with an arithmetic version.

char nibble(uint8_t x) { return x + '0' + ((x > 9) * 39); }
for (size_t i = 0, k = 0; k < dlen; i += 1, k += 2) {
    uint8_t val = src[i];
    dst[k + 0] = nibble(val >> 4);
    dst[k + 1] = nibble(val & 15);
}

Surprisingly maybe, this approach is much faster and uses far fewer instructions. At first glance, this result might be puzzling. A table lookup is cheap, the new nibble function seemingly does more work.

The trick that Skovoroda relies upon is that compilers are smart: they will ‘autovectorize’ such number crunching functions (if you are lucky). That is, instead of using regular instructions that process byte values, the will SIMD instructions that process 16 bytes at once or more.

Of course, instead of relying on the compiler, you can manually invoke SIMD instructions through SIMD instrinsic functions. Let us assume that you have an ARM processors (e.g., on Apple Silicon). Then you can process blocks of 32 bytes as follows.

size_t maxv = (slen - (slen%32));
for (; i < maxv; i += 32) {
    uint8x16_t val1 = vld1q_u8((uint8_t*)src + i);
    uint8x16_t val2 = vld1q_u8((uint8_t*)src + i + 16);
    uint8x16_t high1 = vshrq_n_u8(val1, 4);
    uint8x16_t low1 = vandq_u8(val1, vdupq_n_u8(15));
    uint8x16_t high2 = vshrq_n_u8(val2, 4);
    uint8x16_t low2 = vandq_u8(val2, vdupq_n_u8(15));
    uint8x16_t high_chars1 = vqtbl1q_u8(table, high1);
    uint8x16_t low_chars1 = vqtbl1q_u8(table, low1);
    uint8x16_t high_chars2 = vqtbl1q_u8(table, high2);
    uint8x16_t low_chars2 = vqtbl1q_u8(table, low2);
    uint8x16x2_t zipped1 = {high_chars1, low_chars1};
    uint8x16x2_t zipped2 = {high_chars2, low_chars2};
    vst2q_u8((uint8_t*)dst + i*2, zipped1);
    vst2q_u8((uint8_t*)dst + i*2 + 32, zipped2);
}

This SIMD code leverages ARM NEON intrinsics to accelerate hexadecimal encoding by processing 32 input bytes simultaneously. It begins by loading two 16-byte vectors (val1 and val2) from the source array using vld1q_u8. For each vector, it extracts the high nibbles (via right shift by 4 with vshrq_n_u8) and low nibbles (via bitwise AND with 15 using vandq_u8 and vdupq_n_u8). The nibbles are then used as indices into a pre-loaded hex table via vqtbl1q_u8 to fetch the corresponding ASCII characters. The high and low character vectors are interleaved using vzipq_u8, producing two output vectors per input pair. Finally, the results are stored back to the destination array with vst1q_u8, ensuring efficient memory operations.

You could do similar work on other systems like x64. The same code with AVX-512 for recent Intel and AMD processors would probably be insanely efficient.

Benchmarking these implementations on a dataset of 10,000 random bytes reveals significant performance differences. The basic lookup table version achieves around 3 GB/s, while the arithmetic version, benefiting from compiler autovectorization, reaches 23 GB/s. The manual SIMD NEON versions push performance further: I reach 42 GB/s in my tests.

method	speed	instructions per byte
table	3.1 GB/s	9
Skovoroda	23 GB/s	0.75
intrinsics	42 GB/s	0.69

One lesson is that intuition can be a poor guide when trying to assess performance.

My source code is available.

Converting floats to strings quickly

When serializing data to JSON, CSV or when logging, we convert numbers to strings. Floating-point numbers are stored in binary, but we need them as decimal strings. The first formally published algorithm is Steele and White’s Dragon schemes (specifically Dragin2) in 1990. Since then, faster methods have emerged: Grisu3, Ryū, Schubfach, Grisu-Exact, and Dragonbox. In C++17, we have a standard function called std::to_chars for this purpose. A common objective is to generate the shortest strings while still being able to uniquely identify the original number.

We recently published Converting Binary Floating-Point Numbers to Shortest Decimal Strings. We examine the full conversion, from the floating-point number to the string. In practice, the conversion implies two steps: we take the number and compute the significant and the power of 10 (step 1) and then we generate the string (step 2). E.g., for the number pi, you might need to compute 31415927 and -7 (step 1) before generating the string 3.1415927. The string generation requires placing the dot at the right location and switching to the exponential notation when needed. The generation of the string is relatively cheap and was probably a negligible cost for older schemes, but as the software got faster, it is now a more important component (using 20% to 35% of the time).

The results vary quite a bit depending on the numbers being converted. But we find that the two implementations tend to do best: Dragonbox by Jeon and Schubfach by Giulietti. The Ryū implementation by Adams is close behind or just as fast. All of these techniques are about 10 times faster than the original Dragon 4 from 1990. A tenfold performance gain in performance over three decades is equivalent to a gain of about 8% per year, entirely due to better implementations and algorithms.

Efficient algorithms use between 200 and 350 instructions for each string generated. We find that the standard function std::to_chars under Linux uses slightly more instructions than needed (up to nearly 2 times too many). So there is room to improve common implementations. Using the popular C++ library fmt is slightly less efficient.

A fun fact is that we found that that none of the available functions generate the shortest possible string. The std::to_chars C++ function renders the number 0.00011 as 0.00011 (7 characters), while the shorter scientific form 1.1e-4 would do. But, by convention, when switching to the scientific notation, it is required to pad the exponent to two digits (so 1.1e-04). Beyond this technicality, we found that no implementation always generate the shortest string.

All our code, datasets, and raw results are open-source. The benchmarking suite is at https://github.com/fastfloat/float_serialization_benchmark, test data at https://github.com/fastfloat/float-data.

Reference: Converting Binary Floating-Point Numbers to Shortest
Decimal Strings: An Experimental Review, Software: Practice and Experience (to appear)

Optimizing Python scripts with AI

One of the first steps we take when we want to optimize software is to look
at profiling data. Software profilers are tools that try to identify where
your software spends its time. Though the exact approach can vary, a typical profiler samples your software (steps it at regular intervals) and collects statistics. If your software is routinely stopped in a given function, this function is likely using a lot of time. In turn, it might be where you should put your optimization efforts.

Matteo Collina recently shared with me his work on feeding profiler data for software optimization purposes in JavaScript. Essentially, Matteo takes the profiling data, and prepares it in a way that an AI can comprehend. The insight is simple but intriguing: tell an AI how it can capture profiling data and then let it optimize your code, possibly by repeatedly profiling the code. The idea is not original since AI tools will, on their own, figure out that they can get profiling data.

How well does it work? I had to try it.

Case 1. Code amalgamation script

For the simdutf software library, we use an amalgamation script: it collects all of the C++ files on disk, does some shallow parsing and glues them together according to some rules.

I first ask the AI to optimize the script without access to profiling data. What it did immediately was to add a file cache. The script repeatedly loads the same files from disk (the script is a bit complex). This saved about 20% of the running time.

Specifically, the AI replaced this naive code…

def read_file(file):
    with open(file, 'r') as f:
        for line in f:
            yield line.rstrip()

by this version with caching…

def read_file(file):
    if file in file_cache:
        for line in file_cache[file]:
            yield line
    else:
        lines = []
        with open(file, 'r') as f:
            for line in f:
                line = line.rstrip()
                lines.append(line)
                yield line
        file_cache[file] = lines

Could the AI do better with profiling data? I instructed it to run the Python profiler: python -m cProfile -s cumtime myprogram.py. It found two additional optimizations:

1. It precompiled the regular expressions (re.compile). It replaced

  if re.match('.*generic/.*.h', file):
    # ...

if generic_pattern.match(file):
    # ...

where elsewhere in the code, we have…

generic_pattern = re.compile(r'.*generic/.*\.h')

2. Instead of repeatedly calling re.sub to do a regular expression substitution, it filtered the strings by checking for the presence of a keyword in the string first.

if 'SIMDUTF_IMPLEMENTATION' in line: # This IF is the optimization
  print(uses_simdutf_implementation.sub(context.current_implementation+"\\1", line), file=fid)
else:
  print(line, file=fid) # Fast path

These two optimizations could probably have been arrived at by looking at the code directly, and I cannot be certain that they were driven by the profiling data. But I can tell that they do appear in the profile data.

Unfortunately, the low-hanging fruit, caching the file access, represented the bulk of the gain. The AI was not able to further optimize the code. So the profiling data did not help much.

Case 2: Check Link Script

When I design online courses, I often use a lot of links. These links break over time. So I have a simple Python script that goes through all the links, and verifies them.

I first ask my AI to optimize the code. It did the same regex trick, compiling the regular expression. It created a thread pool and made the script asynchronous.

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    url_results = {url: executor.submit(check_url, url) for url in urls_to_check}
    for url, future in url_results.items():
        url_cache[url] = future.result()

This parallelization more than doubled the speed of the script.

It cached the URL checks in an interesting way, using functools:

from functools import lru_cache

@lru_cache(maxsize=None)
def check(link):
    # ...

I did not know about this nice trick. This proved useless in my context because I rarely have several times the same link.

I then started again, and told it to use the profiler. It did much the same thing, except for the optimization of the regular expression.

As far as I can tell all optimizations were in vain, except for the multithreading. And it could do this part without the profiling data.

Conclusion so far

The Python scripts I tried were not heavily optimized, as their performance was not critical. They are relatively simple.

For the amalgamation, I got a 20% performance gain for ‘free’ thanks to the file caching. The link checker is going to be faster now that it is multithreaded. Both optimizations are valid and useful, and will make my life marginally better.

In neither case I was able to discern benefits due to the profiler data. I was initially hoping to get the AI busy optimizing the code in a loop, continuously running the profiler, but it did not happen in these simple cases. The AI optimized code segments that contributed little to the running time as per the profiler data.

To be fair, profiling data is often of limited use. The real problems are often architectural and not related to narrow bottlenecks. Even when there are identifiable bottlenecks, a simple profiling run can fail to make them clearly identifiable. Further, profilers become more useful as the code base grows, while my test cases are tiny.

Overall, I expect that the main reason for my relative failure is that I did not have the right use cases. I think that collecting profiling data and asking an AI to have a look might be a reasonable first step at this point.

A new way to call C from Java: how fast is it?

Irrespective of your programming language of choice, calling C functions is often a necessity. For the longest time, the only standard way to call C was the Java Native Interface (JNI). But it was so painful that few dared to do it. I have heard it said that it was deliberately painful so that people would be enticed to use pure Java as much as possible.

Since Java 22, there is a new approach called the Foreign Function & Memory API in java.lang.foreign. Let me go through step by step.

You need a Linker and a SymbolLookup instance from which you will build a MethodHandle that will capture the native function you want to call.

The linker is easy:

Linker linker = Linker.nativeLinker();

To load the SymbolLookup instance for your library (called mylibrary), you may do so as follows:

System.loadLibrary("mylibrary");
SymbolLookup lookup = SymbolLookup.loaderLookup();

The native library file should be on your java.library.path path, or somewhere on the default library paths. (You can pass it to your java executable as -Djava.library.path=something).

Alternatively, you can use SymbolLookup.libraryLookup or other means of loading
the library, but System.loadLibrary should work well enough.

You have the lookup, you can grab the address of a function like so:

lookup.find("myfunction")

This returns an Optional<MemorySegment>. You can grab the MemorySegment like so:

MemorySegment mem = lookup.find("myfunction").orElseThrow()

Once you have your MemorySegment, you can pass it to your linker to get a MethodHandle which is close to a callable function:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     functiondescr
 );

The functiondescr must describe the returned value and the function parameters that your function takes. If you pass a pointer and get back a long value, you might proceed as follows:

 MethodHandle myfunc = linker.downcallHandle(
     mem,
     FunctionDescriptor.of(
        ValueLayout.JAVA_LONG,
        ValueLayout.ADDRESS
    )
 );

That is, the first parameter is the returned value.

For function returning nothing, you use FunctionDescriptor.ofVoid.

The MethodHandle can be called almost like a normal Java function:
myfunc.invokeExact(parameters). It always returns an Object which means that if it should return a long, it will return a Long. So a cast might be necessary.

It is a bit painful, but thankfully, there is a tool called jextract that can automate this task. It generates Java bindings from native library headers.

You can allocate C data structures from Java that you can pass to your native code by using an Arena. Let us say that you want to create an instance like

MemoryLayout mystruct = MemoryLayout.structLayout(
        ValueLayout.JAVA_LONG.withName("age"),
        ValueLayout.JAVA_INT.withName("friends"));

You could do it in this manner:

MemorySegment myseg = arena.allocate(mystruct);

You can then pass myseg as a pointer to a data structure in C.

You often get an array with a try clause like so:

try (Arena arena = Arena.ofConfined()) {
       //
}

There are many types of arenas: confined, global, automatic, shared. The confined arenas are accessible from a single thread. A shared or global arena is accessible from several threads. The global and automatic arenas are managed by the Java garbage collector whereas the confined and shared arenas are managed explicitly, with a specific lifetime.

So, it is fairly complicated but manageable. Is it fast? To find out, I call from Java a C library I wrote with support for binary fuse filters. They are a fast alternative to Bloom filters.

You don’t need to know what any of this means, however. Keep in mind that I wrote a Java library called jfusebin which calls a C library. Then I also have a pure Java implementation and I can compare the speed.

I should first point out that even if calling the C function did not include any overhead, it might still be slower because the Java compiler is unlikely to inline a native function. However, if you have a pure Java function, and it is relatively small, it can get inlined and you get all sorts of nice optimizations like constant folding and so forth.

Thus I can overestimate the cost of the overhead. But that’s ok. I just want a ballpark measure.

In my benchmark, I check for the presence of a key in a set. I have one million keys in the filter. I can ask whether a key is not present in the filter.

I find that the library calling C can issue 44 million calls per second using the 8-bit binary fuse filter. I reach about 400 million calls per second using the pure Java implementation.

method	time per query in nanoseconds
Java-to-C	22.7 ns
Pure Java	2.5 ns

Thus I measure an overhead of about 20 ns per C function calls from Java using a macBook (M4 processor).

We can do slightly better by marking the functions that are expected to be short running as critical. You achieve this result by passing an option to the linker.downcallHandle call.

binary_fuse8_contain = linker.downcallHandle(
    lookup.find("xfuse_binary_fuse8_contain").orElseThrow(),
    binary_fuse8_contain_desc,
    Linker.Option.critical(false)
);

You save about 15% of the running time in my case.

method	time per query in nanoseconds
Java-to-C	22.7 ns
Java-to-C (critical)	19.5 ns
Pure Java	2.5 ns

Obviously, in my case, because the Java library is so fast, the 20 ns becomes too much. But it is otherwise a reasonable overhead.

I did not compare with the old approach (JNI), but other folks did and they find that the new foreign function approach can be measurably faster (e.g., 50% faster). In particular, it has been reported that calling a Java function from C is now relatively fast: I have not tested this functionality myself.

One of the cool feature of the new interface is that you can pass directly data from the Java heap to your C function with relative ease.

Suppose you have the following C function:

int sum_array(int* data, int count) {
    int sum = 0;
    for(int i = 0; i < count; i++) {
        sum += data[i];
    }
    return sum;
}

And you want the following Java array to be passed to C without a copy:

int[] javaArray = {10, 20, 30, 40, 50};

It is as simple as the following code.

System.loadLibrary("sum");
Linker linker = Linker.nativeLinker();
SymbolLookup lookup = SymbolLookup.loaderLookup();
MemorySegment sumAddress = lookup.find("sum_array").orElseThrow();

// C Signature: int sum_array(int* data, int count)
MethodHandle sumArray = linker.downcallHandle(
    sumAddress,
    FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS, ValueLayout.JAVA_INT),
    Linker.Option.critical(true)
);

int[] javaArray = {10, 20, 30, 40, 50};

try (Arena arena = Arena.ofConfined()) {
    MemorySegment heapSegment = MemorySegment.ofArray(javaArray);
    int result = (int) sumArray.invoke(heapSegment, javaArray.length);
    System.out.println("The sum from C is: " + result);
}

I created a complete example in a few minutes. One trick is to make sure that java finds the native library. If it is not at a standard library path, you can specify the location with -Djava.library.path like so:

java -Djava.library.path=target -cp target/classes IntArrayExample

Further reading.When Does Java’s Foreign Function & Memory API Actually Make Sense? by A N M Bazlur Rahman.

How stagnant is CPU technology?

Sometimes, people tell me that there is no more progress in CPU performance.

Consider these three processors which had comparable prices at release time.

The AMD Ryzen 7 9800X3D (Zen 5, with up to 5.3 GHz boost) was released in 2024.
The AMD Ryzen 7 7800X3D (Zen 4, with up to 5.1 GHz boost) was released in 2023.
The AMD Ryzen 7 5800X3D (Zen 3, with 3.4 GHz base) was released in 2022.

Let us consider their results on on the PartialTweets open benchmark (JSON parsing). It is a single core benchmark.

2024 processor	12.7 GB/s
2023 processor	9 GB/s
2022 processor	5.2 GB/s

In two years, on this benchmark, AMD more than doubled the performance for the same cost.

So what is happening is that processor performance is indeed going up, sometimes dramatically so, but not all of our software can benefit from the improvements. It is up to us to track the trends and adopt our software accordingly.

What I Got Wrong About “Hard Work” in My 20s

When I was younger, in my 20s, I assumed that everyone was working “hard,” meaning a solid 35 hours of work a week. Especially, say, university professors and professional engineers. I’d feel terribly guilty when I would be messing around, playing video games on a workday.

Today I realize that most people become very adept at avoiding actual work. And the people you think are working really hard are often just very good at focusing on what is externally visible. They show up to the right meetings but unashamedly avoid the hard work.

It ends up being visible to the people “who know.” Why? Because working hard is how you acquire actual expertise. And lack of actual expertise ends up being visible… but only to those who have the relevant expertise.

And the effect compounds. The difference between someone who has honed their skills for 20 years and someone who has merely showed up to the right meetings becomes enormous. And so, we end up with huge competency gaps between people who are in their 30s, 40s, 50s. It becomes night and day.

A bit of glass and freedom is all you need

Galileo Galilei was the OpenAI of his time. He helped establish modern science by emphasizing experimentation as the primary means to uncover natural truths. To this end, he built his own telescopes. He revealed to the world the moons of Jupiter, thereby changing forever how we viewed the cosmos.

How was Galileo able to design better telescopes than others ? If you have ever been to Venice, you may know that it was famous for its glassmakers. There is a small island nearby (Murano) where there are still glassmakers. Further, Venice had some of the best merchants of Europe, so they could export their glass worldwide.

This is an important lesson as to what drives innovation. It is not a linear process. Living near people making fancy glasses could be the key you need.

A common misconception portrays Galileo as persecuted solely for advocating heliocentrism, the idea that Earth orbits the Sun. In reality, he spent most of his career challenging established doctrines and thrived under Church patronage. Galileo overturned the widespread belief that heavier objects fall faster, and this achievement, if nothing else, brought him greater fame. When he gathered strong evidence for heliocentrism, he initially faced only cautions rather than outright condemnation.

Pope Urban VIII had personally permitted Galileo to discuss heliocentrism as a hypothesis and even requested that his own arguments on the matter be included. However, Galileo placed these papal views in the mouth of Simplicio, a character portrayed as intellectually inadequate in defending the traditional geocentric position. This was widely interpreted as a mockery.

Galileo was sentenced to house arrest, during which he continued productive work. The ban on his Copernican writings applied mainly within Catholic territories, allowing their dissemination elsewhere in Europe.

Thus another important element that made Galileo possible was the relative freedom he enjoyed.

You want to innovate ? Don’t live in the world of ideas solely. Don’t be shy about mixing with commercial interest. And make sure to have a bit of freedom.

Technology is culture

We are experiencing one of the most significant technological breakthroughs of the last few decades. Call it what you will: AI, generative AI, large language models…

But where does it come from? Academics will tell you that it stems from decades of mathematical efforts on campus. But think about it: if this were the best model to explain what happened, where would the current breakthroughs have occurred? They would have happened on campus first, then propagated to industry. That’s the linear model of innovation—a rather indefensible one.

Technology is culture. Technological progress does not follow a path from the blackboard of a middle-aged MIT professor to your desk, via a corporation.

So what is the cultural background? Of course, there is hacker culture and the way hackers won a culture war in the 1980s by becoming cool enough to have a seat at the table.

But closer to us… I believe there are two main roots. The first is gaming. Gamers wanted photorealistic, high-performance games. They built powerful machines capable of solving linear algebra problems at very high speeds.

Powerful computing alone, however, does you no good if you want to build an AI. That’s where web culture came in. Everything was networked, published, republished. Web nerds helped build the greatest library the world had ever seen.
These two cultures came together to generate the current revolution.

If you like my model, I submit that it has a few interesting consequences. The most immediate one is that if you want to understand how and where technological progress happens, you have to look at cultural drivers—not at what professors at MIT are publishing.

The culture war that we won

Culture wars are real. They occur when a dominant culture faces a serious challenge. But unless you pay close attention, you might miss them entirely. As a kid, I was a “nerd.” I read a lot and spent hours on my computer. I devoured science and technology magazines. I taught myself programming. “Great!” you might think. Not at all. This was not valued where and when I grew up. Computers were seen as toys. A kid who spent a lot of time on a computer was viewed as obsessed with a rather dull plaything. We had a computer club, but it was essentially a gathering of “social rejects.” No one looked up to us. Working with computers carried no prestige. Dungeons & Dragons was outright “dangerous”—you had to hide any interest in such games. The 1983 movie WarGames stands out precisely because the computer-obsessed kid gets the girl and saves the world. Bill Gates was becoming famous around that time, but this marked only the beginning of a decade-long culture war in which hacker culture gradually rose to dominance.

Today, most people can speak the language of hackers. It did not have to turn out this way, and it did not unfold identically everywhere. The status of hacker culture is high in the United States, but it remains lower in many other places even now. Even so, in many organizations today, even in the United States, the « computer people » are stored in the basement. We do not let them out too often. They are not « people persons ». So the culture war was won by the hackers, the victory is undeniable. But as with all wars, the result is more nuanced that one might think. Many would like nothing more than to send back the computer people at the bottom of the prestige ladder.

Salaries are a good indicator for prestige. In the USA, in Australia and in Switzerland, « computer people » have high salaries and relatively high status. In the UK as a whole? Not so much. I bet you do better as a « financial analyst » over there.

What is worth watching is the effect that « AI » will have on the status battles. In some sense, building software that can do financial, political and legal analysis is the latest weapon in the arsenal of the computer people. Many despair about what AI might do to software developers: I recommend looking at it in the context of the hacker culture war.

By how much does your memory allocator overallocate?

How much virtual memory does the following C++ expression allocate on the heap?

new char[4096]

The answer is at least 4 kibibytes but surely more.

Firstly, each heap memory allocation requires some memory to keep track of what has been allocated. You are likely using 8 bytes or so of overhead that your program cannot access.

Secondly, the memory allocator may allocate a bit more than the 4096 bytes you requested. On a Linux machine, I found that it would allocate 4104 bytes, so 8 extra bytes that are usable by your program. You can check this value by calling malloc_usable_size under Linux.

Thus, overall, you may end up with an extra 16 bytes allocated when you requested 4096 bytes. It is an overhead of about 0.4%. You are basically wasting a byte for every 256 bytes that you allocate.

But that is not the worst possible case. On macOS, let us consider the following line of code.

new char[3585]

The system reports an allocation of 4096 bytes: a 14% overhead. What is happening is that macOS rounds up the memory allocation to the nearest 512 byte boundary for moderately small allocations. If you try allocating even larger memory blocks, it starts rounding up even more.

Freedom from incompetence

Many people say that they crave more freedom.
But what do we mean by “freedom”?
Being free from constraints? Is that what we mean? Would you feel “freer” if you could walk outside in your underwear?
It is almost surely not what you mean by “freedom.”
I submit to you that it is almost always the case that if you are frustrated at work by your lack of freedom, the actual problem is competence.
Imagine two scenarios.

Scenario A: You work for a highly directive boss. You are constantly accountable for what you do. But everyone around you is highly competent. You need to wear a jacket and a tie, but you work in the best team in the world.
Scenario B: You work in a context where you hardly know who your boss is. You come to work in your underwear. However, everyone is incompetent. You work with the least competent team in the world.

Assuming that the salary is the same, which job do you prefer?
I cannot answer for you, but most people I know prefer Scenario A.

Don’t be so eager to rewrite your code

I used to always want to rewrite my code. Maybe even use another programming language. « If only I could rewrite my code, it would be so much better now. »

If you maintain software projects, you see it all the time. Someone new comes along and they want to start rewriting everything. They always have subjective arguments: it is going to be more maintainable or safer or just more elegant.

If your code is battle tested… then the correct instinct is to be conservative and keep your current code. Sometimes you need to rewrite your code : you made a mistake or must change your architecture. But most times, the old code is fine and investing time in updating your current code is better than starting anew.

The great intellectual Robin Hanson argues that software ages. One of his arguments is that software engineers say that it does. That’s what engineers feel but whether it is true is another matter.

« Before Borland’s new spreadsheet for Windows shipped, Philippe Kahn, the colorful founder of Borland, was quoted a lot in the press bragging about how Quattro Pro would be much better than Microsoft Excel, because it was written from scratch. All new source code! As if source code rusted. The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material? » (Joel Spolsky)

Parsing IP addresses quickly (portably, without SIMD magic)

Most programmers are familiar with IP addresses. They take the form
of four numbers between 0 and 255 separated by dots: 192.168.0.1.
In some sense, it is a convoluted way to represent a 32-bit integer.
The modern version of an IP address is IPv6 which is usually surrounded
by square brackets. It is less common in my experience.

Using fancy techniques, you can parse IP addresses with as little as 50 instructions. It is a bit complicated and not necessarily portable.

What if you want high speed without too much work or a specialized library? You can try to roll your own. But since I am civilized programmer, I just asked my favorite AI to write it for me.

// Parse an IPv4 address starting at 'p'.
// p : start pointer, pend: end of the string
std::expected<uint32_t, parse_error> parse_manual(const char *p, const char *pend) {
uint32_t ip = 0;
    int octets = 0;
    while (p < pend && octets < 4) {
        uint32_t val = 0;
        const char *start = p;
        while (p < pend && *p >= '0' && *p <= '9') {
            val = val * 10 + (*p - '0');
            if (val > 255) {
                return std::unexpected(invalid_format);
            }
            p++;
        }
        if (p == start || (p - start > 1 && *start == '0')) {
            return std::unexpected(invalid_format);
        }
        ip = (ip << 8) | val;
        octets++;

        if (octets < 4) {
            if (p == pend || *p != '.') {
                return std::unexpected(invalid_format);
            }
            p++; // Skip dot
        }
    }
    if (octets == 4 && p == pend) {
        return ip;
    } else {
        return std::unexpected(invalid_format);
    }
}

It was immediately clear to me that this function was not as fast as it could be. I then asked the AI to improve the result by using the fact that each number is made of between one and three digits. I got the following reasonable function.

std::expected<uint32_t, parse_error> parse_manual_unrolled(const char *p, const char *pend) {
    uint32_t ip = 0;
    int octets = 0;
    while (p < pend && octets < 4) {
        uint32_t val = 0;
        if (p < pend && *p >= '0' && *p <= '9') {
            val = (*p++ - '0');
            if (p < pend && *p >= '0' && *p <= '9') {
                if (val == 0) { 
                  return std::unexpected(invalid_format);
                }
                val = val * 10 + (*p++ - '0');
                if (p < pend && *p >= '0' && *p <= '9') {
                    val = val * 10 + (*p++ - '0');
                    if (val > 255) { 
                      return std::unexpected(invalid_format);
                    }
                }
            }
        } else {
            return std::unexpected(parse_error::invalid_format);
        }
        ip = (ip << 8) | val;
        octets++;
        if (octets < 4) {
            if (p == pend || *p != '.') {
              return std::unexpected(invalid_format);
            }
            p++; // Skip the dot
        }
    }
    if (octets == 4 && p == pend) {
        return ip;
    } else {
        return std::unexpected(invalid_format);
    }
}

Nice work AI!

In C++, we have standard functions to parse numbers (std::from_chars) which can significantly simplify the code.

std::expected<uint32_t, parse_error> parse_ip(const char *p, const char *pend) {
  const char *current = p;
  uint32_t ip = 0;
  for (int i = 0; i < 4; ++i) {
    uint8_t value;
    auto r = std::from_chars(current, pend, value);
    if (r.ec != std::errc()) {
      return std::unexpected(invalid_format);
    }
    current = r.ptr;
    ip = (ip << 8) | value;
    if (i < 3) {
      if (current == pend || *current++ != '.') {
        return std::unexpected(invalid_format);
      }
    }
  }
  return ip;
}

You can also use the fast_float library as a substitute for std::from_chars. The latest version of fast_float has faster 8-bit integer parsing thanks to Shikhar Soni (with a fix by Pavel Novikov).

I wrote a benchmark for this problem. Let us first consider the results using an Apple M4 processors (4.5 GHz) with LLVM 17.

function	instructions/ip	ns/ip
manual	185	6.2
manual (unrolled)	114	3.3
from_chars	381	14
fast_float	181	7.2

Let us try with GCC 12 and an Intel Ice Lake processor (3.2 GHz) using GCC 12.

function	instructions/ip	ns/ip
manual	219	30
manual (unrolled)	154	24
from_chars	220	29
fast_float	211	18

And finally, let us try with a Chinese Longsoon 3A6000 processor (2.5 GHz) using LLVM 21.

function	instructions/ip	ns/ip
manual	187	29
manual (unrolled)	109	21
from_chars	191	39
fast_float	193	27

The optimization work on the fast_float library paid off. The difference is especially striking on the x64 processor.

What is also interesting in my little experiment is that I was able to get the AI to produce faster code with relatively little effort on my part. I did have to ‘guide’ the AI. Does that mean that I can retire? Not yet. But I am happy that I can more quickly get good reference baselines, which allows me to better focus my work where it matters.

Reference: The fast_float C++ library is a fast number parsing library part of GCC and major web browsers.