Reinventing The Wheel

Extracting regions from ESA WorldCover

2026-01-25T00:00:00+01:00

ESA WorldCover is a categorical dataset with 11 classes of land cover (such as water, trees, built-up etc.) at a 10-meter resolution covering most of the world. In this short post I show how to extract a specific region given its geographic coordinates.

Acquiring the data

The dataset is delivered in tiles covering 3x3 degrees each, bundled into macrotiles of 60x60 degrees. The macrotiles can be freely downloaded on the project’s website.

Assuming you have downloaded and extracted the correct macrotile, you can use the rasterio package to load the tile corresponding to the given geographic coordinates:

import rasterio

def load(lat_lon):
    # dataset is provided in the form of tiles 3x3 degrees in size
    # compute tile coordinates
    tile_coords = (int((lat_lon[0] // 3) * 3), int((lat_lon[1] // 3) * 3))
    tile_extent = (tile_coords, (tile_coords[0] + 3, tile_coords[1] + 3))

    # build file name
    lat_str = (f"N{abs(tile_coords[0]):02d}" if tile_coords[0] >= 0 else
               f"S{abs(tile_coords[0]):02d}")
    lon_str = (f"E{tile_coords[1]:03d}" if tile_coords[1] < 180 else
               f"W{(360 - tile_coords[1]) % 360:03d}")

    filename = f"ESA_WorldCover_10m_2021_V200_{lat_str}{lon_str}_Map.tif"

    with rasterio.open(filename) as dataset:
        data = dataset.read(1)  # read first band

    return data, tile_extent

Extracting the region of interest

Now, suppose we have a region of interest specified as a tuple of min-latitude, max-latitude, min-longitude, max-longitude. We now need to “cut out” this area out of the full tile.

The key to this is a function called discretize_extent. It takes 3 parameters:

extent of the tile we have loaded
shape of the tile (X & Y resolution)
the requested region

…and returns the raster (integer) coordinates corresponding to the region within the tile. Note that by snapping to the integer coordinates, the geographic coordinates will slightly change as well and must be recomputed. Given the high resolution of the dataset, this inaccuracy will be usually negligible, but we like to do things properly, so I also recompute the updated extent.

We follow the convention given by ISO 6709:

Latitude comes before longitude
North latitude is positive
East longitude is positive
Coordinates are represented as decimal degrees

On the other hand, rows in the dataset are ordered north-to-south, which means that at some point, we need to flip the Y coordinate.

import math

def discretize_extent(tile_extent: tuple[tuple[float, float], tuple[float, float]],
                      tile_shape: tuple[int, int],
                      requested_extent: tuple[tuple[float, float], tuple[float, float]]):
    # axis 0 is latitude north->south (!)
    # axis 1 is longitude west->east

    (tile_min_lat, tile_min_lon), (tile_max_lat, tile_max_lon) = tile_extent
    (req_min_lat, req_min_lon), (req_max_lat, req_max_lon) = requested_extent
    relative_extent = ((req_min_lat - tile_min_lat,
                        req_min_lon - tile_min_lon),
                       (req_max_lat - tile_min_lat,
                        req_max_lon - tile_min_lon))

    print(f"{relative_extent=}")

    # project user extents into raster space
    px_per_deg_lat = tile_shape[0] / (tile_max_lat - tile_min_lat)
    px_per_deg_lon = tile_shape[1] / (tile_max_lon - tile_min_lon)

    print(f"{px_per_deg_lat=} {px_per_deg_lon=}")

    # snap "outwards" to integer coordinates
    min_y = int(math.floor(relative_extent[0][0] * px_per_deg_lat))
    min_x = int(math.floor(relative_extent[0][1] * px_per_deg_lon))
    max_y = int(math.ceil(relative_extent[1][0] * px_per_deg_lat))
    max_x = int(math.ceil(relative_extent[1][1] * px_per_deg_lon))

    # sanity check :)
    print(f"{min_y=} {min_x=} {max_y=} {max_x=}")
    assert min_y >= 0 and min_y <= tile_shape[0]
    assert min_x >= 0 and min_x <= tile_shape[1]
    assert max_y >= 0 and max_y <= tile_shape[0]
    assert max_x >= 0 and max_x <= tile_shape[1]

    # reconstruct geo coordinates
    discretized_extent = ((tile_min_lat + min_y / px_per_deg_lat,
                           tile_min_lon + min_x / px_per_deg_lon),
                          (tile_min_lat + max_y / px_per_deg_lat,
                           tile_min_lon + max_x / px_per_deg_lon))
    print(f"{discretized_extent=}")

    # flip Y
    min_y_corr = tile_shape[0] - max_y
    max_y_corr = tile_shape[0] - min_y
    print(f"{min_y_corr=} {max_y_corr=}")
    assert min_y_corr >= 0 and min_y_corr <= tile_shape[0]
    assert max_y_corr >= 0 and max_y_corr <= tile_shape[0]

    return ((min_y_corr, max_y_corr), (min_x, max_x)), discretized_extent

In practice you might also want to down-sample the extract for performance or other reasons. Since the data is categorical, there is no point in trying to interpolate the sample values; nearest-neighbor sampling must be used:

STAK: My weird programming language for DOS games

2025-10-29T00:00:00+01:00

The inspiration for this project came from two main sources.

The first is Zork, a text adventure released back in 1977. The original version ran on the PDP-10 mainframe and was coded in MDL, a Lisp derivative. When the time came for a microcomputer version, however, MDL was just too big to fit. The game was reimplemented in a custom language called ZIL – Zork Implementation Language. Incredibly, the source code has been preserved and is now available publicly. It looks like this:

You can clearly see the Lispy origins, but this language compiles to a bytecode for a quite rudimentary stack-based virtual machine (keep in mind it had to run on 8-bit machines like TRS-80 or the Apple II).

The second impulse came from reading about the Amiga game Another World. It is likewise built on top of a custom virtual machine and, as far as I understood, its graphics are based completely on vector drawing commands. This game was programmed directly in the bytecode “assembly language” – apparently without even variable names!

start   jsr       init
        setvec    60        flip10
        seti      v255      4
        seti      v246      1
        seti      v227      0
        break
        play      106       20    0    2
        play      106       20    0    3
        jsr       initiz
        jsr       setmaz
        setvec    28        nag1
        setvec    21        nag2
        seti      v99       2
        si        v4        =     48   suit

(this snippet has been edited for brevity, but it should give a pretty accurate impression)

Especially the idea of using strictly vector graphics stuck with me. In theory, it means a single game build could work on various devices with different screen resolutions – this was usually not the case with classic bitmapped graphics.

In an interview with Eric Chahi, the game’s author, he mentions that during development, he would make changes and instantly see them in the running game. This would be tricky to do when you use a language like C, which compiles to optimized native code, and an off-the-shelf compiler: even though GCC and Clang are fully open-source, it would take considerable effort to make all the necessary changes.

One solution is to leave existing languages and compilers behind and re-invent the world from scratch. As the target, I chose a 286-era DOS PC with VGA graphics (the 320x200x8bpp Mode 13h in particular). I consider the 286 to be the last truly 16-bit CPU in the x86 series; the subsequent 386, while fully backwards compatible, is natively 32-bit.

Creating a language

Having implemented language parsing many times before, I knew I wanted to avoid inventing a new grammar this time around. Fortunately, there is an existing grammar that has been successfully used for programming languages for decades – you guessed it, I am talking about S-expressions, aka LISP syntax.

My language is quite minimalist, with expressive power on par with early BASICs. It takes many shortcuts, often at the cost of run-time efficiency. It knows only one data type: signed 16-bit integer. There are two variable scopes: global and function. There are no structures, no arrays and no text strings. On the other hand, there are a few convenience features that made the cut because they were easy to implement; one example is the possibility for functions to return multiple values, somewhat offsetting the missing the absence of a tuple/vector type.

Since the language does not include text output (after all, we want our games to be visual!), a Hello World example consists of filling the screen with a solid color and pausing for a few seconds before exiting:

(define (main)
  (fill-rect COLOR:WHITE 0 0 W H)
  (pause-frames 100))

The second example demonstrates user-defined functions, variables and looping:

(define (clear-screen color)
  (fill-rect color 0 0 W H))

(define (main)
  (define foo 10)
  (define bar (+ foo 20))
  (set! foo 40)

  (dotimes (color COLOR:COUNT)
    (clear-screen color)
    (pause-frames interval)))

Those UPPERCASE symbols are built-in constants, by the way. I don’t want to dwell on the language too much, as it is not particularly innovative and I imagine the (parenthesis-heavy syntax will ((be off-putting) to most)). Several example programs can be found in the repository.

Designing the virtual machine

At startup, a compiled program is loaded from a file. The interpreter steps through the bytecode. Input and output is performed through a library of built-in functions.

The VM is a textbook stack machine. It comes with a library of built-in functions, ranging from basic arithmetic and logic (+, or), utility functions (random) to input/output primitives (fill-rect, key-held?). Each of these gets a dedicated opcode, which improves bytecode density. In its disassembled form, a snippet of bytecode might look like this:

02         getlocal 2
40 01      pushconst 320
01         getlocal 1
          *
09 00      pushconst 9
          /
C8 00      pushconst 200
05         getlocal 5
          -
40 01      pushconst 320
09 00      pushconst 9
          /
05         getlocal 5
B1            fill-rect

The VM is implemented in C, with some assembly parts where it really matters (graphics routines). By the way, in old compilers – I am using Open Watcom – inline assembly is actually fun to use! It’s nothing like modern GCC, where it feels like you are filling out a tax form whenever you just want to insert a few instructions. For example, this is how you just casually call a BIOS interrupt:

void video_init(void) {
    _asm {
        mov ax, 13h
        int 10h
    }
}

The bytecode is produced by a compiler written in Hy, a Lisp dialect embedded in Python. This means that the toolchain must run on a reasonably modern machine – certainly not a 286. That’s okay, and not much different from how a dedicated game console would be programmed: you have a beefy host machine with your favorite editor/IDE, source control and the build environment, and only binary code/data is transferred to the target for execution.

The juicy parts

Live reload

Something that gets annoying when cross-compiling for another machine is the constant swapping of memory media, whether it is a floppy, a flash cart, let alone tape, every time you make a change to your program (unless you’re lucky enough to have a hardware debugger). Right from the beginning, I dreamed of some kind of debug console, which would continuously watch the source file and apply the changes at every edit. The utopia would be to make live changes on the level of individual statements without disturbing the program state; for now, we will settle for automatic recompile and restart of the whole program.

Step one is to somehow connect the host and the target. A straightforward way is to use a USB-to-RS232 adapter cable. Then, on the target, the STAK VM can be started in “listen mode”. On the host you run an interactive interpreter. This interpreter lets you enter individual statements (REPL – read, eval, print, loop) or load an entire file, optionally continuing to watch for changes. Whenever new code is entered, or the file changes, the compiler and linker are invoked in the background, aware of the previous program state, and they produce a set of binary “patches” which are sent to the listening VM for execution.

I will not go into details of the debug protocol and state machine (it’s straightforward and you can see the implementation in this file), but I would like to give a shout-out to HDLC framing. In the protocol it was necessary to somehow delimit the various frames that are sent through the serial port, in a way that allows recovery in case of desynchronization, such as if the REPL process gets terminated and restarted. HDLC reserves a few byte values for control signals and provides an elegant way of escaping those if they appear in the actual data. It is not the most efficient scheme in terms of overhead, but is easy to implement and does not require any look-ahead.

Faking floats

There are no floating-point values in the language, in part because I didn’t want to give up the dramatic simplicity of a single data type and in part due to the limitations of the target platform (having an FPU was not very common in the 286 era). This, however, is at odds with the stated goal of supporting game prototyping/development! Floats are very convenient for any kind of math that models our world – something that most games do to some extent.

The old-school work-around is to use fixed-point values; for example, you could say that the upper 8 bits of a 16-bit value shall represent the integer part and the lower 8 bits are dedicated to the fractional part (in increments of 2^-8, or approximately 0.004). It’s far from perfect: first, you need to allocate your bits very carefully; the trade-off is between the range of the integer part and the precision of the fractional part. The second issue is that fixed-point numbers are clunky. Many operations – such as anything that involves multiplication – require inserting shift operations to adjust the decimal point of the result. Real floats effectively provide a layer of abstraction, letting you chain many operations without giving much thought to the representation (even if that comes with many caveats).

Comparison of two signed 16-bit fixed-point formats. As with signed integers, the highest bit determines the sign.

Still, I was not willing to concede on linguistic complexity by adding a separate data type, but I was willing to expand the built-in function library. After some experimentation, I chose a signed 10.6 format. It has just enough range to represent screen coordinates (which range from 0 to 319 on the X-axis), and just enough resolution to use sin/cos to compute them without visible loss of precision. I added functions like mul@, sin@ and cos@. The @ denotes that they operate on the fixed-point format; it is all a matter of convention, since the compiler, ignorant of data types, cannot verify that the program is using the appropriate function.

Transformations

To provide a little more syntax sugar without complicating the compiler any further, a layer of transformations is applied before compiling each form. If you are familiar with Lisp, think of macros – except you don’t get to define your own. The dotimes loop shown above, for example, is first transformed to a while loop before being compiled down to bytecode:

(do
  (define color 0)
  (while (< color COLOR:COUNT)
    (clear-screen color)
    (pause-frames 10)
    (set! color (+ color 1))))

It is a rather simplistic substitution which comes with plenty of well-known issues. It does provide a tremendous bang for the buck, though.

Now to build something with it…

As the flagship demo program I decided to build a clone of the legendary GORILLA.BAS, an example program from Microsoft QBasic.

Your browser does not support the video tag.

Gorillas running on a Pocket 8086 (NEC V30 CPU @ 10 MHz). The flickering of the crosshairs and banana are due to absence of double buffering; the machine is simply too slow for that.

Despite already reducing the scope with respect to the original game, the language’s limitations made the process frustrating to the point that I almost gave up. For example, I needed to randomly generate building heights at the beginning of the round and store them for later. A reasonable solution would be to add arrays to the language, at least in some rudimentary form. Not for me. You want an array? Implement it in userspace! And so I did.

(define NUM-BLDGS 5)
(define building-h-0 0)
(define building-h-1 0)
(define building-h-2 0)
(define building-h-3 0)
(define building-h-4 0)

(define (set-building-h! n value)
  (cond (= n 0) (set! building-h-0 value)
        (= n 1) (set! building-h-1 value)
        (= n 2) (set! building-h-2 value)
        (= n 3) (set! building-h-3 value)
              1 (set! building-h-4 value)))

...

(dotimes (i NUM-BLDGS)
  (define MIN-H 20)
  (define MAX-H 120)
  (set-building-h! i (+ MIN-H (% (random) (- MAX-H MIN-H)))))

It is clear that this would not scale beyond a trivial number of tiny arrays, and ultimately beyond a rather simplistic game. I do think, though, that there is value in this kind of exercise where you impose some axiomatic constraints at the beginning and then carry them through until reaching the point of absurdity. Maybe it’s not productive, but it’s certainly an interesting challenge that forces one to think outside the box and thoroughly question all previous assumptions.

Conclusion

I think the project has fulfilled its purpose now. I am, of course, publishing everything as open-source. One day I would like to revisit the idea of finer-grained live program modification. I’m sure it has been done in some form before, but it seems worth reinventing.

If you want to try it out yourself, you can use DOSBox or a real DOS machine. It is also possible to build the VM natively (with SDL). Instructions can be found in the README. It is not complicated, but there are some prerequisites that need to be installed first. Polishing the user experience was just not high in the list of priorities. Sorry about that.

How I failed to make a game

2024-10-16T00:00:00+02:00

Today I am releasing my raycasting tech demo for the GameBoy Advance. Although it falls short of the goals I set out in the beginning – which included releasing a playable game –, I think there are some lessons worth sharing with the world.

This project started essentially as a challenge: inspired by the impressive work of 3DSage, I wanted to see if I could build a raycasting “2.5D” engine that ran well enough to play a game.

I explained in a previous post why I find the GBA a really nice platform to develop for. I have always enjoyed game programming, but pretty much from the beginning of my C++ journey I have been stuck in the Game Engine Trap. Due to that realization, and based on experience from my other gamedev projects, it was really important to avoid getting carried away by tasks that are fun, but ultimately non-important. To that end I established some firm rules:

Build a game, not a game engine
Don’t get stuck on assets. Make a quick placeholder when necessary
Don’t get distracted by languages. Use straightforward C++ and, if necessary, Python to generate code. Want fancy LISP-compiled-to-bytecode AI scripting? Put it in another project.
Stick to one design direction. Don’t start branching out in the middle of development.

And it worked – until I realized that tech development is the really interesting part for me, and while it is fun to think about the final result, with all the captivating environments and assets and mechanics, actually getting there eventually turns into a chore. Therefore, after not touching the project in several months, I declare time for a post-mortem!

What worked out well

My tech stack was fixed more-or-less from the start. The main language for run-time code was C++, technically C++20, but I use barely any new language features. I was determined to use CMake, partly because it is the native project format of my favorite C++ IDE (CLion), partly because I just like it. As toolchain, I used devkitARM at first, but moved to the official ARM GNU Toolchain with Tonclib to access the GBA hardware. Tonclib is really great. I don’t think it gets enough praise, and I think that more libraries should follow the same kind of spartan philosophy.

For really hot code paths, namely sprite scaling code, I also generate what are basically unrolled loops. They’re still compiled as C++, GCC does a good job of optimizing it. I didn’t want to write any assembly code except as a last resort at the end of the project, since it would be time-consuming to maintain.

Most games cannot exist without a considerable amount of assets. Having previously built somewhat sophisticated asset pipelines, this time I wanted something dead simple, knowing my game would be very small. I had recently finally grasped how input/output dependencies work for custom steps in CMake projects, so it seemed kind of obvious to use these for asset compilation, including the management of any dependencies between assets (such as between sprites and palettes). On GBA, the entirety of the cartridge ROM shows up as addressable memory, so instead of embedding custom binary formats and filesystem images, I just decided to generate a C++ header with the compiled data for each asset. Here’s an example snippet from the animation data for a spider enemy:

    ...
    // col 30
    {25, 7, ani_spider_dead_1_data_30},
    // col 31
    {25, 6, ani_spider_dead_1_data_31},
};

static const AnimFrame ani_spider_dead_frames[] = {
    { .spans = ani_spider_dead_0_spans },
    { .spans = ani_spider_dead_1_spans },
};

static const AnimImage ani_spider_anims[] = {
    { .frames = ani_spider_idle_frames, .num_frames = 2 },
    { .frames = ani_spider_walk_frames, .num_frames = 2 },
    { .frames = ani_spider_attk_frames, .num_frames = 2 },
    { .frames = ani_spider_dead_frames, .num_frames = 2 },
};

static const SpriteImage ani_spider = { .anims = ani_spider_anims, };

I cannot overstate how satisfied I am with this solution. Again, this might only work for games up to a certain size. Rumor has it, though, that there is leaked Pokémon R/S/E code out there which does the exact same thing.

The custom asset compilers are all written in Python with only the most essential external dependencies (Numpy and Pillow). Python dependency management in this context is still an unsolved problem for me; in principle, the CMake script could set up a venv and install arbitrary dependencies at configuration time, but such a setup sounds rather fragile. So I just require those two packages to be already installed in whatever interpreter is used to configure the project.

To create and edit maps I opted for the venerable Tiled editor and it didn’t disappoint. About the only point of friction was due to difference in the coordinate system between the editor and the game, so the conversion script has to include some transformation math which took a few iterations to get right:

# correct for rotation
# Tiled anchors sprites in bottom left of the cell, but in-game sprite origin
# (incl. for rotation) is at the center
# therefore un-rotate the vector from corner to center
xxx = obj["width"] / 2
yyy = obj["height"] / 2
aaa = -obj["rotation"] * 2 * math.pi / 360
xxxx = xxx * math.cos(aaa) - yyy * math.sin(aaa)
yyyy = xxx * math.sin(aaa) + yyy * math.cos(aaa)
obj_x = obj["x"] + xxxx
obj_y = obj["y"] - yyyy

x, y = tiled_to_world_coords(obj_x, obj_y)

Benchmarking was an important part of the development process. I wrote about it at length in a previous article. Unless you are willing to spend a lot of time experimenting and guessing, I would now say that having strong benchmarking capability built-in is essential in such a project sensitive to “detail optimization” (I don’t know if there is an established term for this – what I mean is that changing e.g. the alignment of some function or structure can make a 1% performance difference, which quckly adds up. A related idea is “performance lottery”, which should really be called performance chaos theory, whereby a minor change to the codebase can have performance impacts elsewhere by changing the layout of the compiled code).

What didn’t work

Content. Brace for an obvious statement: it is one thing to imagine a grand, Daggerfall-scale game in your head, and a completely different thing to actually start building the pieces that make it up. I am not a graphic artist and I get quickly frustrated playing one. (If you haven’t figured by now, I am writing this article as a reminder/deterrent to future me.) I could commission the graphics and sounds, but that gets expensive for a game with no prospect of commercialization. Making maps was more interesting, but I often felt like I was missing the right textures for the environments I wanted to build.

Controls were an issue from the beginning. I should have done more research into how other FPS games solved this. What I settled on was “tank controls” on the D-pad, A/B to strafe and R to shoot.

The screen on the GBA is notoriously bad. I don’t know how we managed to stare at it for hours as kids. The gamma curve is crazy, the colors are kinda washed out, and the contrast is a joke. The right answer here is probably to simply use a GBA SP, but I don’t have any emotional connection to it.

In any case, it always feels great to see your code run on real hardware, but after already being somewhat burned out with assets, it really drove home the latter two points above, providing a final straw to kill the project.

Just show me the code

The release is as complete as was possible; unfortunately, I do not have redistribution rights for the third-party assets. I have blackened them out in the published files, hopefully keeping just the silhouettes can be considered fair use.

And here it is: mcejp/GBA-raycaster

sys/iosupport.h: No such file or directory

2024-06-02T00:00:00+02:00

I ran into this issue in the process of moving away from devkitARM to a vanilla ARM GNU Toolchain.

It boils down to a library called libsysbase, which was at proposed for merging into Newlib all the way back in 2006, but never made it in, so it remains a devkitARM addition.

In Newlib, the way to implement one’s own STDOUT handler for functions like printf is to implement a set of low-level, “syscall-like” functions; the most important of those is _write. On the other hand, libsysbase already implements these, providing a layer of abstraction on top, having the concept of devices for dealing with the file system.

If you only care about STDOUT, it’s quite easy to migrate from libsysbase to vanilla Newlib. Instead of declaring devoptab_t entries, implement the _write function directly.

Automated benchmarking in GameBoy Advance homebrew

2023-10-30T00:00:00+01:00

The GBA is a delightful platform to develop for. It is straightforward enough to understand thoroughly – a single 32-bit CPU, no OS, no built-in wireless features – but also sufficiently advanced to allow an ergonomic workflow based on modern languages and tools like C++20 and CMake (or Rust). Still, it is not particularly powerful in terms of raw computation and when writing rendering code, even a simple ray-caster, some way of benchmarking the performance is essentially a must.

It is, of course, possible to add FPS (or better, cycle) counters on the screen and check them after every code change, but we seek a more rigorous approach. It should be completely automatic, so that it is trivial to execute both locally and in a CI pipeline.

Here is the rough idea:

Implement benchmarking in the game, using hardware resources built into the GBA. In my project I am using the TONC library which provides rudimentary cycle counting using GBA Timers 2 & 3 in a cascade mode.
(optional) Record a sequence of inputs to replay during the benchmark, to move around the scene and get more statistically relevant results.
Execute a GBA emulator in headless mode (no GUI) and have it output the measured performance metrics.
Capture this output for further processing.

This leaves me with 2 problems to solve: find a cycle-accurate emulator that can run headless and somehow exfiltrate the measurements from the emulated console.

It turns out that the GBA’s SIO peripheral supports an UART mode, which in essence means very simple character output. Perfect, now just to find an emulator that can forward the UART to the host operating system.

mGBA

As of writing, mGBA is the only GBA emulator “confidently recommended” by the Emulation General Wiki. Not to imply that this is somehow the authoritative source on all things emulation, but I have gotten good advice from it in the past. mGBA is being actively developed and it’s technical blog provides some excellent reading. Does it fit the bill, though?

mGBA implements Lua scripting support that allows to introspect the emulated system rather deeply. Unfortunately, it does not implement UART mode. However, there is an alternative – albeit non-standard – way to get stuff out of the emulator. This mechanism consists of several I/O registers in the 0x04FFxxxx address space of the emulated system. Through these registers, the ROM has access to the emulator’s own logging facilities.

mGBA comes with example code for using these, assuming a libgba runtime. As mentioned earlier, my project is based on TONC; fortunately, porting the example code is trivial – in mgba.c, #include just needs to be changed to #include .

Next, I was looking for a way to run mGBA headless and for a limited duration. At first it looked like some level of source code hackery would be necessary, but then I discovered one of the built-in test utilities, mgba-rom-test. It can execute a ROM, without any user interface, and while it cannot be told to quit after a fixed interval of time (like mgba-perf can), it can exit once the game calls an SWI specified by the user.

Calls a what? SWI (software interrupt) instructions are normally used to invoke functions built into the GBA BIOS; we can therefore either re-purpose a function that would never be used during the benchmark, or find an unallocated SWI number to claim for our purposes. In absence of convincing reasons for either option, I went with the former, appropriating the Stop call (swi 0x03).

Putting it together

A minimal, but complete example then looks like this:

#include 

#include "mgba/mgba.h"

int main(void) {
    int i;

    mgba_open();

    profile_start();

    // waste some time
    for (i = 1; i <= 10; i++) {
        Div(i, i);
    }

    uint duration = profile_stop();
    mgba_printf(MGBA_LOG_INFO, "BENCHMARK: %d cycles", duration);

    Stop();
}

The mgba-rom-test executable is not distributed with mGBA releases, in fact, it is not even compiled by default. We will need to build mGBA from source with some custom flags. Since we don’t care about GUI or fancy features, we can use additional options to minimize the build time and dependencies. The following configuration worked well for me:

$ cmake -DBUILD_QT=OFF \
        -DBUILD_ROM_TEST=ON \             <-- secret sauce
        -DBUILD_SDL=OFF \
        -DUSE_EDITLINE=OFF \
        -DUSE_ELF=OFF \
        -DUSE_EPOXY=OFF \
        -DUSE_FFMPEG=OFF \
        -DUSE_LIBZIP=OFF \
        -DUSE_USE_MINIZIP=OFF \
        -DUSE_PNG=OFF \
        -DUSE_SQLITE3=OFF \
        -DUSE_ZLIB=OFF \
        -G Ninja ..

$ ninja mgba-rom-test

If everything goes well, we run it like this:

$ ./test/mgba-rom-test -S 0x03 --log-level 15 /path/to/ROM.gba
GBA Debug: BENCHMARK: 1751 cycles

--log-level is a bit field whose documentation leaves it rather mysterious, but it seems to correspond to enum mLogLevel in the code base. A value of 15, or 0x0F, then corresponds to all levels from FATAL down to INFO, but excluding DEBUG and lower. Without the flag, mGBA’s implementation of the GBA BIOS emits a message every time a BIOS function is used, which can be annoying.

Multiple scenarios

At the beginning I alluded to recording inputs for later playback. It doesn’t seem to be implemented as a native feature of mGBA at this time, so let’s come back to this topic in the future.

Let’s start with a presumably easier problem, which is how to select one out of multiple test scenarios to execute. There exists a trivial solution which is to use compile-time flags and build a number of different ROMs, one per scenario. That feels rather wasteful, and a potential nightmare to manage as the number of tests cases goes up. Can we, instead, bake everything into a single ROM and make the choice at runtime?

Having previously solve the problem of extracting data from mGBA, let’s now look at the ways to inject data at the start. A cursory glance reveals a number of entry vectors:

the ROM file itself
the built-in command-line debugger (-d)
the built-in GDB (-g)
IPS patches (-p)
Lua scripts
save states (-t)
cheat codes (-c)

Some of these are out – for example, the scripting engine is not accessible in mgba-rom-test. Patching a fixed address inside the ROM is always an option, but let’s look for something more elegant. Save states are very powerful, but they use a proprietary binary format, which might even change between versions of the emulator.

What about cheat codes? They operate on the principle of hooking the game code and allowing pretty much arbitrary memory modifications. That sounds interesting, to say the least!

GameShark for Game Boy Advance

Of course, things cannot be too simple. The cheat devices for the GBA are a mess, with the most famous ones (Action Replay and GameShark Advance) encrypting their codes. While the encryption has been broken open years ago, it would be preferable to avoid such a complication altogether. There is Codebreaker, where encryption is optional. Finally, mGBA supports a “VBA” cheat file format (presumably pioneered by VisualBoyAdvance), which is the most straightforward of all: it’s just a list of address-value pairs.

In order to do their dirty work, the classic cheat devices work by hooking the game code and hijacking a suitable branch instruction, whose address is encoded as part of the cheat “master code”. With VBA cheats, this is not necessary; the memory modifications are applied at the end of each emulated frame. This has the downside that the game has to wait for one frame to pass before checking the memory location (~5 extra lines of code including clean-up). It still seems like a better trade-off than having to look for a Thumb branch instruction that is on the main code path and guaranteed to be stable across builds.

Having sorted out the mechanism, we still need a place to put our magic cookie. For a proof of concept, let’s just put it at the very end of EWRAM, which spans the 256 KiB from 0x0200 0000 to 0x0203 FFFF. The linker script should ideally be adjusted to make sure that the compiler will not interfere with our chosen special location.

The cheat file itself then boils down to a single line:

0203FFFE:beef

And reading the value from inside is straightforward, too:

// wait for 1 frame to pass
irq_init(NULL);
irq_enable(II_VBLANK);
VBlankIntrWait();

mgba_printf(MGBA_LOG_INFO, "Requested test case: %04Xh", *(u16*)0x0203FFFE);

Let’s see it in action:

$  ./test/mgba-rom-test -S 0x03 --cheats cheatfile.txt --log-level 15 /path/to/ROM.gba
GBA Debug: Requested test case: BEEFh
GBA Debug: BENCHMARK: 1751 cycles

GitLab CI

We can now take it one step further, and have the benchmark run automatically in a CI pipeline. It makes sense to separate the ROM build step from the benchmarking step, since the former needs the GBA toolchain to build, while the latter builds mGBA for the host platform (or whatever container image we run it in), before executing the test proper. The complete example is a bit too long to reproduce here in full, but you can check it out here.

In a previous post, I have shown how you could accumulate these results in a database and track long-term trends, generate fancy badges and so on. This time, it is left as an exercise to the reader :)

Generating core dumps for bare-metal AArch64 programs

2023-07-30T00:00:00+02:00

Introduction

Bare-metal 64-bit ARM programming is a strange niche: small, power-efficient microcontrollers usually implement the – considerably simpler – 32-bit version of the architecture. And on larger chips, one would typically run their application under a full-blown OS, namely Linux. Yet, there are cases where one needs the raw performance of an advanced 64-bit CPU, but a standard OS, despite all efforts to tune it, would bring in too much timing uncertainty for real-time process control.

Welcome to CERN, where standard approaches don’t quite cut it, and state of the art is yesterday’s news. The FGC4, a new digital controller in development by the Electrical Power Converters group has exactly this kind of requirements.

Debugging these kinds of systems can be… interesting. There is a whole spectrum of mechanisms that one might have at their disposal, depending on choices made in the system design – from the simplicity of the blinking LED, the versatility of a serial output, to the sophistication of a JTAG adapter. But once the device is out in the field, you cannot guarantee to be present when an error occurs. The best you can do is to log it in the maximum detail possible, and attempt to understand the issue later. Fortunately, there is a standard mechanism for that – the core dump. Unfortunately, it is not readily available in environments like the one described in this post.

What is a core dump?

A core dump is a snapshot of the state of a process, usually at the time it crashed. The idea is that you can take this snapshot, load it up in a debugger, and inspect the process as if the error had occurred just moments ago. The most precious piece of knowledge to recover is perhaps the stack trace; however, the ability to inspect the program’s variables can certainly be useful as well.

Unfortunately, while a core dump is the de-facto standard way of capturing process state, its format is actually dependent on the host operating system (as is the case for executables). But on bare metal, there is no OS. No OS, no core dump – right? Well, consider this: what if we could pretend to be running an OS, and synthesize the core dump file accordingly? Then we could use all the standard tools to analyze it and extract useful information.

Anatomy of a core dump

Interestingly, on Linux the same ELF format used for executables and libraries is used as a container for the core dump. They even use the same program headers to describe memory segments, although things begin to diverge beyond that. Before diving deeper into the specifics, let’s take a look at what such a dump contains:

A data structure giving basic information about the process, such as its PID or argument list
A snapshot of the process’ memory, including code, global variables, thread stacks and the heap
For each thread:
- A structure giving basic information
- A copy of the CPU state, including general-purpose and floating-point registers

For better illustration, let’s draw a comparison between a core dump and a standard executable. An ELF executable usually includes these sections:

.text (executable code)
.rodata (read-only data)
.data (initialized data)
.bss (zero-initialized data); since the contents of this section are known to be all zeroes, it is not necessary to physically include them in the file

When you use GDB to save a core dump, it will contain a copy of all of these, plus the program heap and stacks of all threads.

Comparison of the memory sections between executables and core dumps – both in ELF format. Note that each file contains additional sections not corresponding to program memory

This has been a simplification, and to correctly synthesize a core dump, we have to be a bit more precise: The core dump will indeed contain a snapshot of memory corresponding to the sections mentioned above, however, this snapshot is described by one or more PT_LOAD segments rather than sections; names and other attributes of sections are therefore lost. This is not a problem, because we can extract section information from the original executable file.

Process & thread information

An ELF-encoded core dump also contains a PT_NOTE segment providing some general information about the process and its threads.

Structure of the PT_NOTE segment

Inspecting a core dump

To extract useful information from a core dump, the original ELF file of the program is also required. This is because the core dump does not contain information about symbols, let alone mappings from compiled code to source locations. With a core file in hand, we can execute

$ gdb program.elf core

This will start GDB, load the program ELF file and combine it with the information found in the core file. This article is not meant as a GDB tutorial, but in case you need a rundown, this one is quite nice. What matters to us here is the observation that information contained in the core dump is “overlaid” on top of the original executable.

Writing our own

Impedance mismatch

There are some discrepancies between the Linux model and a bare-metal application. For example, there is no concept of Process ID or a command line. We also assume a single thread of execution. That might not always be the case; one might, for example, want to capture the state of a multi-threaded FreeRTOS application.

Another difference is that of address spaces: Linux processes always execute in virtual memory, while a bare-metal program would use physical memory directly. In practice, it’s not really a problem. All that matters is that the addresses agree between the ELF of the program and the dumped image. As long as we do not move the program around when loading it, this will indeed be the case.

Collecting information

A core dump would usually be emitted in response to a program crash. Under an operating system with memory protection, it is well-defined what a program is allowed to do. On bare metal, the possibilities are much wider and there are footguns aplenty. Without going too much into detail, one symptom of a critical problem on AArch64 can be a Synchronous exception. If the goal is to produce a dump, it is important to save all general-purpose registers (GPRs) on exception entry instead of just caller-saved ones, as is often done before calling an exception handler written in C/C++. You can see an example here, but it will probably require adapting to your specific application.

Besides the GPRs, we also need to gather the contents of floating-point registers, and finally, we need to know which region(s) of memory are relevant to the program.

Writing the core file

With all the inputs on hand, we can move on to assembling the actual file. ELF is not the simplest format to write manually (without the use of any libraries), but in this case the structure will be simple enough. To begin, we need a copy of elf.h (careful about the license, though) to provide the structures and constants. A slight complication here lies in the fact that we have to precompute all the offsets and sizes. Let’s begin by visualizing the physical layout of the file we are going to write:

Arrangement of data structures comprising the core file

First comes the ELF file header. Not much surprising here. Note the file type of ET_CORE.

FILE* elf = fopen("core", "wb");

// ELF header
Elf64_Ehdr ehdr {};
ehdr.e_ident[0]   = ELFMAG0;
ehdr.e_ident[1]   = ELFMAG1;
ehdr.e_ident[2]   = ELFMAG2;
ehdr.e_ident[3]   = ELFMAG3;
ehdr.e_ident[4]   = ELFCLASS64;
ehdr.e_ident[5]   = ELFDATA2LSB;
ehdr.e_ident[6]   = EV_CURRENT;
ehdr.e_type       = ET_CORE;
ehdr.e_machine    = EM_AARCH64;
ehdr.e_version    = EV_CURRENT;
ehdr.e_phoff      = sizeof(ehdr);
ehdr.e_ehsize     = sizeof(ehdr);
ehdr.e_phentsize  = sizeof(Elf64_Phdr);
ehdr.e_phnum      = 2;
ehdr.e_shentsize  = sizeof(Elf64_Shdr);
fwrite(&ehdr, 1, sizeof(ehdr), elf);

Next, we will need to write two program headers: one for the memory snapshot (it can, in fact, come in multiple segments, but we’re keeping things simple) and one for the PT_NOTE segment described previously. Some complexity comes from the computation of the segment size and the need to align the snapshot to page size.

Elf64_Phdr phdr {};

// NOTE segment
phdr.p_type     = PT_NOTE;
phdr.p_offset   = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(phdr);
phdr.p_filesz   = sizeof(Elf64_Nhdr) + 8 + sizeof(elf_prpsinfo) +
                  sizeof(Elf64_Nhdr) + 8 + sizeof(elf_prstatus) +
                  sizeof(Elf64_Nhdr) + 8 + sizeof(elf_fpregset_t);
fwrite(&phdr, 1, sizeof(phdr), elf);

// LOAD segment (memory image)
// First, compute alignment after previous segment
phdr.p_align    = 4096;
auto note_align = phdr.p_align - ((phdr.p_offset + phdr.p_filesz) % phdr.p_align);

if (note_align == phdr.p_align)
{
    note_align = 0;
}

phdr.p_type     = PT_LOAD;
phdr.p_flags    = PF_R | PF_X | PF_W;
phdr.p_offset  += phdr.p_filesz + note_align;
phdr.p_vaddr    = MEMORY_SNAPSHOT_ADDR;
phdr.p_paddr    = 0;
phdr.p_filesz   = MEMORY_SNAPSHOT_SIZE;
phdr.p_memsz    = MEMORY_SNAPSHOT_SIZE;
fwrite(&phdr, 1, sizeof(phdr), elf);

We don’t need to write any sections, so after the program headers we immediately proceed with the note segment. The alignment/padding convention here justifies writing a couple of helper functions first:

template<size_t alignment>
static auto make_padding_span(size_t length)
{
    static const std::byte zeros[alignment - 1] {};
    auto padding_needed = (length % alignment) == 0 ? 0 : (alignment - length % alignment);

    return std::span{zeros, padding_needed};
}

static bool write_note(FILE* f,
                       const char* name,
                       Elf64_Word type,
                       std::span<std::byte const> desc)
{
    auto terminated_name_len = strlen(name) + 1;
    auto nhdr = Elf64_Nhdr { .n_namesz = (Elf64_Word) terminated_name_len,
                             .n_descsz = (Elf64_Word) desc.size(),
                             .n_type = type };

    auto name_padding = make_padding_span<4>(terminated_name_len);

    if (fwrite(&nhdr, 1, sizeof(nhdr), f) != sizeof(nhdr)
            || fwrite(name, 1, terminated_name_len, f) != terminated_name_len
            || fwrite(name_padding.data(), 1, name_padding.size(), f) != name_padding.size()
            || fwrite(desc.data(), 1, desc.size(), f) != desc.size())
    {
        return false;
    }

    return true;
}

Now for the main show:

//  Process information (we leave most fields set to zero)
elf_prpsinfo prpsinfo {};
strncpy(prpsinfo.pr_psargs, "bare-metal application", sizeof(prpsinfo.pr_psargs));
write_note(elf, "CORE", NT_PRPSINFO, std::as_bytes(std::span{&prpsinfo, 1}));

// Thread status and integer registers
elf_prstatus prstatus {};
prstatus.pr_pid = 1;
memcpy(&prstatus.pr_reg, &saved_gp_registers, sizeof(saved_gp_registers));
write_note(elf, "CORE", NT_PRSTATUS, std::as_bytes(std::span{&prstatus, 1}));

// FPU registers
write_note(elf, "CORE", NT_FPREGSET, std::as_bytes(std::span{&saved_fp_registers, 1}));

Finally, write the memory image, respecting the alignment calculated earlier:

if (note_align)
{
    char scratch[note_align];
    memset(scratch, 0, sizeof(scratch));
    fwrite(scratch, 1, sizeof(scratch), elf);
}

fwrite((void*) MEMORY_SNAPSHOT_ADDR, 1, MEMORY_SNAPSHOT_SIZE, elf);

And that’s it! You can find the complete code in this Gist.

As for testing, here is a tip: GDB can be used to extract information from a core dump non-interactively – useful for unit tests, et cetera:

$ gdb --batch -n -ex bt  

warning: core file may not match specified executable file.
[New LWP 1]
Core was generated by `bare-metal application'.
#0  0x0000000078020d8c in access_invalid_memory () at access_violation.cpp:8
#1  0x0000000078020db0 in main (argc=, argv=) at access_violation.cpp:23
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Generating Makefile-like workflows with Python

2022-09-23T00:00:00+02:00

Every now and then I feel a need to programatically generate and execute a Makefile: for example, to execute a graph-based procedural generation workflow, or to compile some game assets that are discovered dynamically by a tool – in general, all these problems can be represented as directed acyclic graphs of files and tasks.

At some point it gets quite tiring to write the same code over and over, so I went ahead and made it into a little library. It’s pretty straightforward. First, you create a graph.

import goeiedag

graph = goeiedag.CommandGraph()

Afterwards, you add tasks to it…

from pathlib import Path

# Get username
graph.add(["whoami", ">", "username.txt"],
          inputs=[],
          outputs=["username.txt"])

…and execute it.

goeiedag.build_all(graph, Path())

Note that if you run this code twice, it will not re-execute the whoami command, since the output already exists and is considered up-to-date.

For commands that have inputs (or dependencies), the build tool needs to know about these, so that it is able to determine when the output has become obsolete and needs to be rebuilt. Of course, one command can depend on outputs from other commands, as long as there are no circular dependencies.

# Extract OS name from /etc/os-release
graph.add(["grep", "^NAME=", "/etc/os-release", ">", "os-name.txt"],
          inputs=["/etc/os-release"],
          outputs=["os-name.txt"])

To make usage more convenient and avoid repetition, the library provides some special symbols to let you refer to the declared inputs and outputs when building up the command.

from goeiedag import ALL_INPUTS, INPUT, OUTPUT

# Extract OS name from /etc/os-release
graph.add(["grep", "^NAME=", INPUT, ">", OUTPUT],
          inputs=["/etc/os-release"],
          outputs=["os-name.txt"])
# Get username
graph.add(["whoami", ">", OUTPUT],
          inputs=[],
          outputs=["username.txt"])
# Glue together to produce output
graph.add(["cat", ALL_INPUTS, ">", OUTPUT.result],
          inputs=["os-name.txt", "username.txt"],
          outputs=dict(result="result.txt"))

When graph.build_all is called, the library will generate a Ninja build file and execute it. I didn’t feel a need to re-implement the logic of orchestrating and executing the DAG efficiently, so the aim was to take advantage of an existing build executor. Compared to Make, Ninja has more pleasant tooling, better performance when dealing with complex builds, and is also easier to generate code for (for one, it cleanly supports tasks that generate multiple outputs).

There exist many similar libraries with somewhat different paradigms; for example, Dask and TaskGraph use plain Python functions, rather than shell commands, as a unit of execution. This is advantageous if your flow is Python-heavy, but it locks you out of taking advantage of high-quality build executors like Ninja. Snakemake does work with commands, but AFAIU doesn’t help you build the task graph programatically – the input has to be provided in Snakemake’s text format.

The package is available on PyPI and GitHub. Let me know what you think!

Macro expansion in Hy-based custom languages

2022-08-18T00:00:00+02:00

This seems like a very obvious thing to do, but I could not find a simple example anywhere.

Suppose we want to define a very simple S-expression-based language, with no variables or functions, just literals and a single core form – print:

(print [1 2 3])

Outputting:

'[1 2 3]

However, to allow the user to abstract, we want to permit the usage of Hy macros – with the full Hy language available at expansion time:

(defmacro iota [max] (list (range 1 (+ 1 max))))

; pointless macro just to demonstrate repeated expansion
(defmacro identity [expr] expr)

(print (identity (iota 3)))

Output:

'[1 2 3]

Here is the Hy source of an interpreter for this language. This is as simple as I managed to get:

#!/usr/bin/env hy

(import hyrule)
(import os)
(import sys)
(import types)

(with [f (open "minimal.hy")] (do
    (setv module (types.ModuleType "minimal"))
    ; without the following, evaluation of defmacro triggers an error
    (setv (get sys.modules module.__name__) module)

    (setv compiler (hy.compiler.HyASTCompiler module module.__name__))

    (for [form (hy.read-many f)]
        ; expand macros -- this includes processing of "defmacro" itself
        (setv exp (hyrule.macrotools.macroexpand-all
                :form form
                :ast-compiler compiler
                ))
        ;(print "EVAL" (hy.repr exp))

        ; evaluate form
        (cond
            (= (get exp 0) (hy.models.Symbol "defmacro"))
                ; at evaluation time ignore defmacro
                None
            (= (get exp 0) (hy.models.Symbol "print"))
                ; this is our core form
                (print (hy.repr (get exp 1)))
            True
                (raise (Exception "invalid form"))
        )
    )
))

One could argue that this is a stupid thing to do, that I should have simply defined my print as a new macro, and used the standard hy.eval. Well, what if my set of core forms is not known ahead of time? Consider another language, one where each core form executes a shell command of the same name:

(defmacro get-filename [] "hello.txt")

(echo "Hello" "World" ">" (get-filename))
(cat (get-filename))
(uname "-a")

Well, with this structure, you can do that! The evaluation block just changes to this:

        ; evaluate form
        (if (!= (get exp 0) (hy.models.Symbol "defmacro"))
            (os.system (.join " " (list exp)))
            None
        )

This seems quite useful, doesn’t it – imagine the possibilities! I would be curious to see the Racket equivalent, since the juggling of modules and scopes seemed quite a bit more involved there. On the other hand, Racket has first-class custom language support, which might help quite a bit. And then there is ee-lib, which I have yet to explore.

ARMv8: Cache coherency between code running in different exception levels

2022-08-05T00:00:00+02:00

At its heart, the Xilinx UltraScale+ SOC has a multi-core Cortex-A53 CPU. This is not the fastest ARM out there, but it’s still plenty capable. One interesting feature is its built-in Snoop Control Unit (SCU). This enables transparent synchronization of L1 caches among the individual cores. There is one pitfall that you might fall into when running bare-metal code: the distinction between secure and non-secure memory access.

If one of your cores operates in a secure exception level (such as EL3) and another runs a non-secure exception level (EL1), by default they see different memory spaces. The AXI bus has a signal called AxPROT that distinguishes secure and non-secure access. Once your data reaches a “stupid” memory such as DDR SDRAM, this distinction vanishes; however, the data cache does takes it into account and effectively treats these two as separate address spaces. Thus, your precious shared memory buffer will not be coherent until flushed on one side and invalidated on the other.

Fortunately, there is a simple solution. The page tables that you set up to configure the MMU (a prerequisite for using the snooper) have a bit called NS (Non-secure). Setting this bit forces all accesses to be treated as Non-secure even when running in a secure EL. The converse (forcing secure access from EL1) is of course not possible, because it would completely break the security model.

NS is bit 5 of the Lower attributes, so if you’re using the Xilinx-provided template code (translation_table.S) and you don’t really care about this type of security, you might want to simply change the line which says

.set Memory,	0x405 | (3 << 8) | (0x0)		/* normal writeback write allocate inner shared read write */

.set Memory,	0x425 | (3 << 8) | (0x0)		/* normal writeback write allocate inner shared read write (forced non-secure) */

Alternatively, this can be done at runtime.

Thanks to this thread on the ARM Support forums.

Tracking FPGA design build metrics

2022-06-20T00:00:00+02:00

…with low infrastructure footprint

For my latest FPGA toy project, I was looking for a way to have an overview of builds with performance metrics – f_max and resource usage, but also results of simulation and benchmarks.

GitLab CI has a convenient feature whereby one can specify a regular expression to extract test coverage from CI logs, and this value is then displayed on the job list page. Unfortunately it does not offer any flexibility beyond that.

Despite searching far and wide, I didn’t find any satisfactory solution (at least not for free), so I set out to build my own.

The rough idea was the following:

collect key metrics from CI jobs:
- timing results
- resource usage
- test results under simulation
- benchmarks under simulation
accumulate data across builds
present in a spreadsheet for easy viewing

Since I am already quite invested in GitLab CI, I wanted to maximally reuse the facilities it provides. However, to accumulate data across many builds, I still needed some kind of database. I opted for PostgreSQL, motivated by the existence of Supabase, which provides a 500 MB cloud-hosted database for free. It doesn’t really matter, a private MariaDB or some NoSQL solution would do the job as well. The only requirement is reachability from the CI runners.

The next question was one of the database schema. I decided to harcode only a few very general columns which might need to be indexed later, for purposes of filtering and sorting:

commit hash & title
timestamps of commit + pipeline
CI pipeline URL
branch name

The rest of the build data is shoved into a JSON object to allow maximum flexibility and easy schema evolution.

To present the data, a final CI job generates a static, self-contained HTML page and deploys it via GitLab Pages. Simple!

Implementation

Let’s take a walk through the code now.

DB schema

We start by creating a table. (Recall that public is the default schema in a new PostgreSQL database.)

CREATE TABLE public.builds (
    id serial NOT NULL,
    pipeline_timestamp timestamp(0) NOT NULL,
    pipeline_url varchar NOT NULL,
    branch varchar NOT NULL,
    commit_sha1 varchar(40) NOT NULL,
    commit_timestamp timestamp(0) NOT NULL,
    "attributes" json NULL,
    CONSTRAINT builds_pk PRIMARY KEY (id)
);

Configuring the jobs to always save artifacts

Builds may succeed or fail, but in any case, we want the logs to be saved for later processing. This applies to all build and simulation jobs.

build_yosys:
  script: ...

  artifacts:
    paths:
      - build/nextpnr-report.json
      - ulx3s.bit
      - "*.log"
    when: always

Note that if an entry under paths fails refers to a non-existent file, GitLab Runner will complain a bit, but ultimately will go about its day, without triggering a job failure or skipping the remaining artifacts.

Collecting the results

The first custom job is tasked with collecting the results of all builds/tests, extracting key metrics, and storing them into the database.

For this reason, it needs to depend on all the previous jobs and their artifacts.

reports:
  stage: upload
  needs:
  - job: build_ulx3s
    artifacts: true
  - job: test_cocotb
    artifacts: true
  - job: test_verilator
    artifacts: true
  when: always

  image: python:3.10

  script:
    - pip install junitparser "psycopg>=3"
    - ./tools/ci/save_build_stats.py

The body of this script is, for the most part, unremarkable. We begin by preparing a dictionary and collecting the first pieces of metadata:

results = {}
results["commit_title"] = os.environ["CI_COMMIT_TITLE"]

The results extraction is mostly tool specific, so I will only reproduce one example here, which is that of parsing Cocotb results in JUnit format:

if os.path.exists("results.xml"):
    xml = JUnitXml.fromfile("results.xml")

    failures = []

    for suite in xml:
        for case in suite:
            # Failures are reported by a  node under the test case,
            # while passing tests don't carry any result at all.
            # To be determined whether this is the JUnit convention,
            # or a cocotb idiosyncrasy.
            if any(r._tag == "failure" for r in case.result):
                failures.append(case.classname + ":" + case.name)

    if len(failures) > 0:
        results["sim"] = dict(result="fail", failed_testcases=failures)
    else:
        results["sim"] = dict(result="pass")
else:
    results["sim"] = dict(result=None)   # result unknown, maybe design failed to compile

Finally, we collect the indexable metadata and shove everything into the table:

with psycopg.connect(os.environ["POSTGRES_CONN_STRING"]) as conn:
    cursor = conn.cursor()
    cursor.execute('INSERT INTO builds(pipeline_timestamp, pipeline_url, branch, '
                   'commit_sha1, commit_timestamp, "attributes") '
                   'VALUES (%s, %s, %s, %s, %s, %s)', (
        os.environ["CI_PIPELINE_CREATED_AT"],
        os.environ["CI_PIPELINE_URL"],
        os.environ["CI_COMMIT_BRANCH"],
        os.environ["CI_COMMIT_SHA"],
        os.environ["CI_COMMIT_TIMESTAMP"],
        json.dumps(results))
    )

The environment variable POSTGRES_CONN_STRING must be defined in the project’s CI settings. Normally your database host will provide the connection string readily. It should follow this template: postgresql://username:password@hostspec/dbname. Don’t forget to include the password!

Presentation

A second script takes care of fetching all historical records and presenting them on a webpage.

To deploy into GitLab pages the name of the job must be literally pages and it needs to upload a directory called public as artifact. Otherwise, the configuration is straightforward, it just needs to wait for the reports job to finish, in order to always work with the most up-to-date data.

pages:
  stage: upload
  needs: [reports]

  image: python:3.10

  script:
    - mkdir public
    - cd public
    - pip install Jinja2 "psycopg>=3"
    - ../tools/ci/present_build_stats.py

  artifacts:
    paths:
    - public

The Python part starts by fetching all the DB records and passing them onto a Jinja template.

with psycopg.connect(os.environ["POSTGRES_CONN_STRING"]) as conn:
    cursor = conn.cursor()
    cursor.execute('SELECT id, pipeline_timestamp, pipeline_url, branch, commit_sha1, '
                   'commit_timestamp, "attributes" '
                   'FROM builds ORDER BY pipeline_timestamp DESC, id DESC')

    builds = [dict(id=row[0],
                   pipeline_timestamp=row[1],
                   pipeline_url=row[2],
                   branch=row[3],
                   commit_sha1=row[4],
                   commit_timestamp=row[5],
                   **row[6]   # unpack all other attributes into the dictionary
                   ) for row in cursor.fetchall()]

env = jinja2.Environment(loader=jinja2.FileSystemLoader(os.path.dirname(__file__)))
template = env.get_template("build_stats.html")

Path("builds.html").write_text(template.render(
    builds=builds,
    project_url=os.environ["CI_PROJECT_URL"]
    ))

Here you can see the result live.

For more flavor, we also generate some badges. The idea here is to generate a publicly accessible JSON file that can be fed into shields.io Endpoint mode.

fmax_str = "%.1f MHz" % reference_build["build"]["fmax"][reference_clk]["achieved"]
Path("fmax.json").write_text(json.dumps(dict(schemaVersion=1,
                                             label="Fmax",
                                             message=fmax_str,
                                             color="orange")))
# --> render via https://img.shields.io/endpoint?url=https://MY_GITLAB_PAGES.io/fmax.json

This produces a static URL that can be added to the project website and the image will always reflect the result of the most recent build:

Magic? No – science!

A note about security

The database connection string is supplied through an environment variable, which allows us to store it in the GitLab project with reasonable security – as long as an untrusted party cannot execute an echo command in our CI. Admittedly, this is a weakness of the chosen approach. To allow secure collection of results from pipelines triggered by third parties (which includes all merge requests, for example), it would probably require a separate, trusted pipeline or some kind of scraper task running somewhere in the cloud.

Scalability

You might have noticed that there is no pagination mechanism; in fact, the presentation job has O(n) complexity with respect to the number of historic builds. This will not scale once there are hundreds and thousands of builds, and a more efficient approach will be required.

The collection and presentation jobs also add some time to the overall runtime of the pipeline. Unfortunately, it seems that most of this time is just overhead of spinning the runner up, thus quite difficult to get rid of.

Final thoughs

At this point, it might seem an logical next step to generalize the presented solution into something more flexible, an off-the-shelf tool that could suit other teams. For now, I have decided against that for reasons of complexity. In the world of build automation and complex FPGA designs, different projects have wildly different needs; and while a couple hard-coded scripts are easy to understand and maintain, a useful generic framework would need a lot of flexibility, thus incurring a high upfront cost in terms of complexity.

Therefore, my recommendation would be that you just copy the code, and adapt it to your specific needs.

UPDATE (2023-11-01): Originally, the post recommended bit.io as the cloud Postgres database. Unfortunately, that service shut down earlier in 2023. Supabase seems like a competent alternative.