Software Development Blog

Learning OCaml by Parsing JSON

2025-11-23T19:00:00+00:00

Two weeks ago, I started on a journey to learn how to design programming languages. After thorough research, I decided to learn and use OCaml for this.

For starters, it’s very important to me that the programming languages I create are supported by formal mathematics and type theory. The Coq proof assistant (which now prefers to be called Rocq) enables logicians to write their compiler alongside a proof of the language’s interesting properties. More on this in a future post. Rocq extracts to OCaml. It extracts to other languages, as well, like Haskell, but since the official installation instructions reference opam, I thought that I would have a better time learning and using OCaml than fighting with the Rocq compiler to extract to workable Haskell. I’ll probably never learn whether that hypothesis holds water.

The other books I’ve selected for my journey also make heavy use of OCaml. Besides this, I haven’t yet been able to claim an ML dialect as one of my competencies, so I thought this would be an opportunity check that box.

I selected the book Real World OCaml to teach myself the language. The second edition is available for free online from Cambridge, but I like to touch paper, so I ordered a used paperback on Thriftbooks. I read Part I from start to finish, only skimming the chapters on objects and classes, and picked a smattering of content from Part II–the chapters on testing, command line arguments and parsing. I skipped Part III altogether.

Reflections on the module system

So far, I’ve been pleasantly impressed with the language. Features like named parameters and the module system are unique from other mainstream languages. For a file foo.ml containing this definition:

type t = int

We can name this type in client modules with Foo.t, where Foo is the name that OCaml assigns to the module from its filename. If I prefer to nest modules, I might make Foo a submodule in bar.ml, where the name of t would be Bar.Foo.t:

module Foo = struct
  type t = int
end

The rules for forming the name of a definition are clear and predictable. This is something I miss when going back to C++, where namespaces are completely orthogonal to file structure. A C++ development team has to discuss and align on a set of rules, and then somehow enforce and maintain them, either with tools or code inspection¹.

Coming from non-ML languages, however, I find it a little strange that modules are the mechanism for constructing types. OCaml does not have type classes, interfaces, traits or templates. In place of these, it has only modules.

Monads in Haskell and OCaml

Let’s consider the definition of Monad in Haskell:

class Applicative m => Monad (m :: Type -> Type) where
  (>>=) :: m a -> (a -> m b) -> m b
  (>>) :: m a -> m b -> m b
  return :: a -> m a

This is clearly a type. We can export it from a module, and import it:

-- In Monad.hs
module Control.Monad (Monad) where

-- In MyApp.hs
import Control.Monad (Monad)

Here, the module system is clearly separate from the type system, and it’s an organizational structure–not a logical one. Let’s compare this to the definition of Monad in OCaml:

module type Monad = sig
  type 'a t
  val return : 'a -> 'a t
  val ( >>= ) : 'a t -> ('a -> 'b t) -> 'b t
end

There are some obvious and superficial syntactic differences. For example, the type parameters appear here on the left side of the type constructor, whereas in Haskell they appear on the right. This type, however, is inextricably connected to the filesystem. The language facility for scoping lexical names is the same as the one used to construct a higher order type abstraction.

Implementing Monads

We’re going to take this two steps further. To implement Monad for your type in Haskell, you declare it as an instance:

instance Monad Maybe where
  (Just x) >>= k = k x
  Nothing  >>= _ = Nothing

  (>>) = (*>)

This is copied directly from the source in Base. If the monad laws are upheld, return is definitionally equal to pure from the Applicative typeclass, and the right sequence operator is equivalent to the same operator in Applicative.

To implement Monad in OCaml, you could create a module that satisfies the Monad module signature:

module MOption : Monad = struct
  type 'a t = 'a option
  let return x = Some x
  let ( >>= ) m f = Option.bind m f
end

I should point out another syntactic difference: OCaml does not support pointfree style definitions like Haskell, so this expression is ill-typed: let return = Some.

In practice, however, no one writes OCaml this way. There is no abstraction for monads. The interface is enforced for common monads in Base by convention.

Programming With Monads

To program using monads in Haskell, we often don’t use the typeclass functions directly. Instead, we use do notation, which allows us to write code that reads like an imperative program:

maybeSum :: Maybe Int -> Maybe Int -> Maybe Int
maybeSum a b = do
  a' <- a
  b' <- b
  return $ a' + b'

GHC transforms this into an equivalent expression that uses the typeclass functions during compilation:

maybeSum a b =
  a >>= \a' ->
    b >>= \b' ->
      return $ a' + b'

OCaml has two solutions that provide similar ergonomics. The most mature and common tool is ppx_let (a Jane Street invention), but OCaml 4.08 introduced binding operators which might someday provide the same level of convenience for common monads. This example writes our maybeSum function using ppx_let:

open Base

let maybe_sum a b =
  let open Option.Let_syntax in
  let%bind a' = a in
  let%bind b' = b in
  Some (a' + b')

Both ppx_let and do-notation are syntax extensions–they are transformed into equivalent monadic code before compilation. These functions are both written directly in terms of the Maybe monad.

Monads are, of course, very powerful, and we can express all sorts of interesting logic on them. The functional reactive programming library Yampa in Haskell provides reactimate, a function for running an arrow function on a monad given a pair of “input sensing” and “actuation” monadic actions:

reactimate
  :: Monad m
  => m a                            -- Initializing action
  -> (Bool -> m (DTime, Maybe a))   -- Input sensing action
  -> (Bool -> b -> m Bool)          -- Actuation (output) action
  -> SF a b                         -- The arrow function
  -> m ()

In practice, this is often run on a side-effect-producing monadic action, like IO. However, it’s polymorphic in the monad m, so it could just as easily be run on a Maybe or a list.

As far as I’m aware, there’s no easy way to write this function in OCaml. If we were to create a Monad module, like above, we could implement this as an OCaml functor (a function on modules). However, that module does not exist in the standard library, so any solution would be inconsistent with the rest of the ecosystem. I’m interested to hear if I’m missing something.

Parsing JSON

I only completed one of the exercises in the book–the JSON parser exercise built using OCamllex and Menhir. This was the one I thought would be directly applicable to developing compilers.

Since the book is available online for free, I won’t reproduce the code for the parser here. Instead, I just want to point out a couple of interesting experiences I encountered while developing it.

Testing

I wanted to try ppx_expect, so I pulled it in and wrote an expect test for the parser:

let%expect_test _ =
  let lexbuf = Lexing.from_string {|{"obj":"\"foo\""}|} in
  let result = Option.get (Json_parser.parse_with_error lexbuf) in
  Json_parser.output_value stdout result;
  [%expect {| |}];

I followed the advice of the book and established a “test only” library, separate from my parser library. Supposedly, this prevents code bloat. It does require me to deviate from the Dune project template a little, so here’s my test/dune file:

(library
 (name test_json)
 (inline_tests)
 (preprocess (pps ppx_inline_test ppx_expect))
 (libraries json_parser))

This test is written to expect empty output, so we fully expect it to fail, and it does. But expect tests produce a diff in the output that, if accepted, would cause the test to pass:

[json_parser]$ opam exec -- dune runtest
File "test/test_json.ml", line 1, characters 0-0:
diff --git a/_build/default/test/test_json.ml b/_build/.sandbox/713d6c9a80082f32d86b6de371e3845a/default/test/test_json.ml.corrected
index 435e144..884c113 100644
--- a/_build/default/test/test_json.ml
+++ b/_build/.sandbox/713d6c9a80082f32d86b6de371e3845a/default/test/test_json.ml.corrected
@@ -3,4 +3,4 @@ let%expect_test _ =
   let lexbuf = Lexing.from_string {|{"obj":"\"foo\""}|} in
   let result = Option.get (Json_parser.parse_with_error lexbuf) in
   Json_parser.output_value stdout result;
-  [%expect {| |}];
+  [%expect {| {"obj":"\"foo\""} |}];

This is called Exploratory Programming, and I’m very excited to use it when writing my compiler.

Escaping Quotes in String Literals

The authors wrote a function for printing a Json.t back to a string, which is not rendered in the output. I copied it from the book sources on GitHub, but changed mine to output “minified” JSON, without whitespace between. I figured this would be easier to write expect tests against.

When given a string literal that contains escaped quotes, the parser fails with a syntax error. This turns out to be because the lexer in the book is missing a rule for escaped double quotes in string literals. Here’s my updated definition of the read_string rule:

and read_string buf =
  parse
  | '"'       { STRING (Buffer.contents buf) }
  | '\\' '/'  { Buffer.add_char buf '/'; read_string buf lexbuf }
  | '\\' '\\' { Buffer.add_char buf '\\'; read_string buf lexbuf }
  | '\\' 'b'  { Buffer.add_char buf '\b'; read_string buf lexbuf }
  | '\\' 'f'  { Buffer.add_char buf '\012'; read_string buf lexbuf }
  | '\\' 'n'  { Buffer.add_char buf '\n'; read_string buf lexbuf }
  | '\\' 'r'  { Buffer.add_char buf '\r'; read_string buf lexbuf }
  | '\\' 't'  { Buffer.add_char buf '\t'; read_string buf lexbuf }
  | '\\' '"'  { Buffer.add_char buf '"'; read_string buf lexbuf }
  | [^ '"' '\\']+
    { Buffer.add_string buf (Lexing.lexeme lexbuf);
      read_string buf lexbuf
    }
  | _ { raise (SyntaxError ("Illegal string character: " ^ Lexing.lexeme lexbuf)) }
  | eof { raise (SyntaxError ("String is not terminated")) }

I also had to change the output_value function to escape double-quote characters when serializing string literals. Here’s that fragment. The rest is the same:

let rec output_value outc = function
  | `Assoc obj -> print_assoc outc obj
  | `List l -> print_list outc l
  | `String s -> print_string outc s
  | `Int i -> printf "%d" i
  | `Float x -> printf "%f" x
  | `Bool true -> Out_channel.output_string outc "true"
  | `Bool false -> Out_channel.output_string outc "false"
  | `Null -> Out_channel.output_string outc "null"

and print_string outc s =
  let escaped = CCString.replace ~sub:"\"" ~by:"\\\"" s in
  Out_channel.output_string outc ("\"" ^ escaped ^ "\"")

What’s Next?

I have so far enjoyed programming in OCaml, even though it seems that OCaml has less support for categorical abstractions than Haskell does. I’m now moving on to reading Types and Programming Languages, by Benjamin C. Pierce, which develops type checking algorithms in OCaml. My next post is likely to include the results of experimenting with these.

Of course, C++ is mostly alone in these troubles–modern languages like Rust also have clear and predictable rules for paths of types and definitions, which are derived from the file structure. ↩

Pinned Places in C++

2025-04-05T08:00:00+00:00

In C++, move semantics become very difficult to express and constrain accurately without dynamic memory allocation. This is an unfortunate feature of the language, especially when working in environments that don’t have a heap.

Storage Durations

As a brief reminder, all variables in C++ have a storage duration, and there are four storage duration classes: automatic, static, thread and dynamic. Memory with dynamic storage duration is allocated on the heap, of course. All non-global variables have automatic storage duration, unless they are declared static, extern, or thread_local.

The Hypothetical ADC Driver

Imagine, for example, that I have a class that represents a SPI bus, and that I’m writing a driver for a specific ADC device that’s connected to my SPI bus. So far, that may look something like this:

enum class SpiError { /* Enumeration Literals... */ };

class Spi {
public:
  Spi() = default;
  // Destructor _may_ actually do something interesting.
  ~Spi();

  std::expected<std::monostate, SpiError> write(
    std::span<const uint8_t> data,
    uint8_t chip_select);

  // Delete the copy constructor
  Spi(const Spi&) = delete;
  Spi& operator=(const Spi&) = delete;

  // Moving is okay, though.
  Spi(Spi&&) = default;
  Spi& operator=(Spi&&) = default;

private:
  // Instance data...
};

class Adc {
public:
  Adc(Spi* spi, uint8_t chip_select) : spi_{spi}, chip_select_{chip_select} {}
  uint32_t read_channel(uint8_t channel_index);

private:
  Spi* spi_;
  uint8_t chip_select_;
};

We don’t want Adc to own an instance of Spi by value, because we need to share the Spi instance between multiple devices. Spi needs to implement resource locking and ensure that concurrent access to the bus is impossible.

There’s no problem with these classes yet. We can leave the default move and copy constructors–after all, they don’t own any resources, so it’s valid to move an instance of these objects.

…Until I do this:

class ApplicationState {
public:
  ApplicationState() : spi_{}, left_adc_{&spi_}, right_adc_{&spi_} {}

private:
  Spi spi_;
  Adc left_adc_;
  Adc right_adc_;
};

The move semantics of ApplicationState are constrained, but not automatically. When I create a Spi* member variable in the Adc class, I’m introducing a new class invariant on Adc: a borrowed lifetime. This isn’t Rust, but if it were, we would be forced to add a generic lifetime parameter on Adc, like this:

struct Adc<'a> {
  spi: &'a Spi,
  chip_select: u8,
}

So that the lifetime checker could ensure we are free of temporal memory safety issues. C++ has nothing like this. Granted, this kind of thing becomes easier to spot if you’re used to reviewing code for this (or perhaps if you’re a Rust programmer). It becomes harder at scale–when there are many member variables, or when we aren’t already familiar with the invariants on Spi and Adc–for example, if we didn’t write them.

Pinned Places

The issue is that the class invariant is placed on the instance data of Adc. We could make the class Spi immovable, but that’s not really correct. There’s nothing in the type Spi that makes it immovable. We may be able to coerce the C++ type system into helping us create a type that enforces this invariant. The concept of a pinned place may help us here.

We’ll start by introducing our type, Pin:

template<typename T>
concept ValueType = std::is_same_v<std::remove_reference_t<std::remove_pointer_t<T>>, T>;

template<ValueType T>
class Pin {
public:
    using reference_type = std::add_lvalue_reference_t<T>;
    using pointer_type = gsl::not_null<T*>;

    template<typename... Args>
    Pin(Args&&... args) : m_value{std::forward<Args>(args)...} {}
    ~Pin() = default;

    // Pinned objects intentionally have non-copyable/non-movable semantics.
    // Strictly speaking, copy semantics ought to be definable if T is
    // copyable, but default-ing them would restrict Pin to copyable types T.
    // Semantically, there is no reason why we should be able to copy a pinned
    // object, so we are safe to delete this.
    Pin(const Pin&) = delete;
    Pin& operator=(const Pin&) = delete;
    Pin(Pin&&) = delete;
    Pin& operator=(Pin&&) = delete;

    reference_type operator*() noexcept { return m_value; }
    pointer_type operator->() noexcept { return &m_value; }

private:
    T m_value;
};

A Pin object wraps an instance of a movable type T in an immovable container. For variables with automatic storage duration, this has the effect of “pinning” them on the stack. The noexcept declaration on the operators is not necessary for this discussion, but it allows the noexcept constraint on this operator to be inherited from the same operator on the type T, which is nice. Similarly, we use gsl::not_null for a little bit of extra safety. We never intend to return a null pointer, so it’s helpful to annotate that. Finally, you may notice that I created the concept Value to ensure that T is not a pointer or reference type. This would make the nested type declarations more complicated, and after all, “pinning” a reference type has no semantic value, so we disallow it.

Now, we need a type that will allow classes to require their callers to uphold the immovable invariant on owned instance data. We’ll call it PinPtr:

template<ValueType T>
class PinPtr {
public:
    using reference_type = std::add_lvalue_reference_t<T>;
    using pointer_type = gsl::not_null<T*>;

    // TODO: Constructors?

    reference_type operator*() const
        noexcept(noexcept(*std::declval<pointer_type>())) {
      return *m_value;
    }
    pointer_type operator->() const noexcept { return m_value; }

private:
    pointer_type m_value;
};

We’ll come back to the constructors in a moment. We can now rewrite Adc like this, and PinPtr acts just like any smart pointer type:

class Adc {
public:
  Adc(PinPtr<Spi> spi, uint8_t chip_select) : spi_{spi}, chip_select_{chip_select} {}
  uint32_t read_channel(uint8_t channel_index) {
    static constexpr std::array<uint8_t, 2> command = {0x08, 0x00};
    spi_->write(command, chip_select_);
  }

private:
  PinPtr<Spi> spi_;
  uint8_t chip_select_;
};

Adc is still movable, and so is Spi. But the idea is that now, when I write ApplicationState:

class ApplicationState {
public:
  ApplicationState() : spi_{}, left_adc_{spi_}, right_adc_{spi_} {}

private:
  Pin<Spi> spi_;
  Adc left_adc_;
  Adc right_adc_;
};

The move and copy constructors for ApplicationState are automatically deleted!

Constructing a `PinPtr`

The last thing is to ensure that it’s only possible to construct a PinPtr from a valid Pin. For this, we need to recall how value categories work. Remember that an rvalue is either a temporary value that has no address, or a temporary value that is “expiring”. An lvalue is something that has an address, and is neither of these. How these are represented in the type system, however, may be surprising:

T& can only represent an lvalue.
T&& can only represent an rvalue.
const T& can represent either an lvalue or an rvalue.

The last point is critical! Objects of Pin are only valid as lvalues, and so it’s only valid to construct a PinPtr from a Pin&–this is the only way that we can ensure the pinned object will remain valid after the constructor runs. With this in mind, we can add our constructors:

template<Value T>
class PinPtr {
public:
  template<typename U>
  PinPtr(Pin<U>& pin) : m_value{pin.operator->()} {}
};

We template the constructor to allow polymorphic pointers. We can construct a PinPtr from a Pin if U is a derived class of T. Also recall that defining this constructor implicitly deletes the default constructor, which is critical to enforcing this invariant.

Prevailing wisdom would have us declare the single-arg constructor as explicit, but I think that’s not necessary here. There is one–and only one–way to construct a PinPtr. Requiring an explicit constructor call would not add clarity, and would only add visual noise.

Does It Really Work, Though?

The answer seems to be yes!

class Resource {};

class Borrows {
public:
    Borrows(PinPtr<Resource> resource) : resource_{resource} {}
private:
    PinPtr<Resource> resource_;
};

int main() {
    Resource r;
    Pin<Resource> pinned;

    // These fail to compile:
    // Borrows borrower{Pin{Resource{}}};
    // Borrows borrower{&r};
    // Borrows borrower{Resource{}};

    // This is the only way to construct a Borrows:
    Borrows borrower{pinned};
}

Conclusion

It’s not a perfect bolt-on solution for temporal memory safety. In the previous example, PinPtr cannot ensure that the lifetime of pinned outlives the lifetime of borrower. Now that we’re protected from pointer invalidation by move/destruction, however, other temporal memory safety issues are theoretically harder to invoke accidentally, and may be easier to locate during code inspection.

Whether this adds value, though, or visual noise, is up to you. It may feel odd to represent a new semantic value category using a vocabulary type. If that’s the case, this likely isn’t for you! I think it has the potential to prevent a number of nasty UB issues, however, so I think I’m a fan of it! After trying to use it, I’ll give an update to see if that turned out to be the case.

Running LineageOS for the First Time

2025-02-17T20:00:00+00:00

Lately, I’ve been consuming a lot of literature that comes exclusively in digital form. Academic papers, PhD theses, blogs, journals, and creative commons digital books. I find it to be really uncomfortable to spend long hours reading material on my laptop, so I decided to buy my first tablet.

I poked around on eBay to see what I could find cheaply available. The iPads were enticing, but I decided I would look for one with LineageOS support. If I only spent a few dollars, I wouldn’t mind if I brick the thing. I have been a professional Android engineer in my very short career, but I’ve never engaged with the LineageOS ecosystem, so this felt like a good opportunity.

I selected the Samsung Galaxy Tab S2 9.7 (Wi-Fi), which is about 9 years old at the time of this writing. I spent $60, shipping included (I probably overpaid), and when it arrived a week later I dubiously factory reset the thing. It was pretty speedy, and the battery life is not terrible, considering the device’s age. Additionally, it was very comfortable to read with for extended periods of time, and the screen is roomy.

The LineageOS team stopped supporting this device after LineageOS 16 (which seems to have been around 2019?), but the instructions are still easily found on Google. I was able to enable the “OEM Unlock” switch in the Developer Settings without any trouble at all, which can not be said for my Samsung Galaxy A11. The instructions are fairly simple, from a high-level. Install Heimdall, download a TWRP recovery image (another product I had no previous familiarity with), and sideload the Lineage image using ADB. The linked version of Heimdall didn’t provide a release bundle for arm64 Linux (which I’m not surprised by), but it seemed to run perfectly under box64.

I had almost no trouble at all after that, until step 4: Build a LineageOS installation package. I was not prepared to build. Thankfully, again, the build instructions are also linked nearby. One thing that can definitely be said for LineageOS is that their instructions are very clear, almost without exception.

The build instructions seem stock-standard for Android, with a couple of cute oddities (the lunch command replaced by brunch, etc.). The standard build dependencies are mostly the same, but Lineage 16 was the last version to require Python 2, which is no longer available in the Debian testing repositories. It is still available in nixpkgs, although nix-env refuses to install until this fragment is added to ~/.config/nixpkgs/config.nix:

{
  permittedInsecurePackages = [
    "python-2.7.18.8"
  ];
}

From there, I was able to set up a Python 2 virtualenv:

nix-env -iA nixpkgs.python2
python2.7 -m ensurepip --user --default-pip
python2.7 -m pip install --user virtualenv
virtualenv --python=python2.7 .lineage_venv

Which I can activate in every shell with . ./.lineage_venv/bin/activate.

The build instructions then ask the user to run the extract-files.sh script within the build tree, which seems to extract proprietary blobs from the running device using ADB. The script worked great–however, my device is apparently missing some of the proprietary blobs that are necessary, because the build system bailed out immediately when I tried to run it. Luckily, I was able to find an old prebuilt image for my device on the unofficial LineageOS archive, and the LineageOS wiki contains instructions for extracting files from prebuilt images, so I was able to supplement my losses.

That got me a little further, but the build was now failing on the first compilation step, because the prebuilt clang that came from the repo manifest links to two libraries, libtinfo.so.5 and libncurses.so.5 which aren’t installed on my machine. Naturally, these aren’t available in the Debian testing repositories either, but the build instructions indicated I might be able to install them if I downloaded them from an older release’s repositories. These versions were still being updated as of buster, so I clicked around until I found the download link, and the manual install worked!

curl -LO http://ftp.us.debian.org/debian/pool/main/n/ncurses/libtinfo5_6.4-4_amd64.deb
curl -LO http://ftp.us.debian.org/debian/pool/main/n/ncurses/libncurses5_6.4-4_amd64.deb
dpkg -i ./libtinfo5_6.4-4_amd64.deb
dpkg -i ./libncurses5_6.4-4_amd64.deb

Now a little bit further, and onto a make error:

Makefile:791: *** multiple target patterns. Stop.

This one stumped me. I’ve never seen this error from GNUMake before, and it’s not an immediately googleable problem. A little bit of poking around, and I did find one person who reported an error like this when trying to build Ubuntu touch for their Xperia Z5 Compact. Apparently, setting USE_HOST_LEX=yes in the environment fixed it. To my shock and horror, it worked for me as well. In the future, I might like to look into that a little further to see what was actually causing it and why that would fix it.

At this point, I got about 25 build steps into the process, when I got an obscure error about cannot exec .../prebuilts/clang: File not found. I’d seen this enough to know it was either a dynamic linker error or a shebang error, and the file turned out to be a Python script with a #!/usr/bin/python shebang at the top. These pesky developers apparently never planned for me to want to use a Python other than the system Python to build the image. Unfortunately, the fix for this was to symlink /usr/bin/python to the Python2 installation in the virtual environment:

sudo ln -s /home/edtwardy/Git/lineageos/.lineage_venv/bin/python /usr/bin/python

Luckily, Python never returned to using that path for Python 3 installations after the Python 2 end-of-life, so this was a (relatively) non-intrusive change.

Now, after a long 45 minute wait, it looked like the build was going to succeed. At the last step, however, I got a Python exception trace that ended with:

AssertionError: compression of system.new.dat failed.

Nice. That was frustrating. The next day, I took a look at build/make/tools/releasetools/common.py, the path mentioned in the stack trace. It looks like they were trying to perform a brotli compression, which should have been obvious from my earlier experience extracting files from the prebuilt brotli-compressed ext2 system image.

I made a small code change to try to get more information:

diff --git a/tools/releasetools/common.py b/tools/releasetools/common.py
index f7ab11cd8..c9ba9fc45 100644
--- a/tools/releasetools/common.py
+++ b/tools/releasetools/common.py
@@ -1755,10 +1755,10 @@ class BlockDifference(object):
                     '--output={}.new.dat.br'.format(self.path),
                     '{}.new.dat'.format(self.path)]
       print("Compressing {}.new.dat with brotli".format(self.partition))
-      p = Run(brotli_cmd, stdout=subprocess.PIPE)
-      p.communicate()
+      p = Run(brotli_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+      _, err = p.communicate()
       assert p.returncode == 0,\
-          'compression of {}.new.dat failed'.format(self.partition)
+            'compression of {}.new.dat failed: {}'.format(self.partition, err.strip())
 
       new_data_name = '{}.new.dat.br'.format(self.partition)
       ZipWrite(output_zip,

This change prints the output of stderr from the child process in the stack trace, which gave me all I needed to know:

Compressing system.new.dat with brotli
  running:  brotli --quality=6 --output=/tmp/tmpz4Bykj/system.new.dat.br /tmp/tmpz4Bykj/system.new.dat
Traceback (most recent call last):
  File "build/make/tools/releasetools/ota_from_target_files", line 2051, in 
    main(sys.argv[1:])
  File "build/make/tools/releasetools/ota_from_target_files", line 2025, in main
    output_file=args[1])
  File "build/make/tools/releasetools/ota_from_target_files", line 858, in WriteFullOTAPackage
    system_diff.WriteScript(script, output_zip)
  File "/home/edtwardy/Git/lineageos/android/lineage/build/make/tools/releasetools/common.py", line 1606, in WriteScript
    self._WriteUpdate(script, output_zip)
  File "/home/edtwardy/Git/lineageos/android/lineage/build/make/tools/releasetools/common.py", line 1761, in _WriteUpdate
    'compression of {}.new.dat failed: {}'.format(self.partition, err.strip())
AssertionError: compression of system.new.dat failed: failed to write output [/tmp/tmpz4Bykj/system.new.dat.br]: No space left on device
ninja: build stopped: subcommand failed.
06:29:59 ninja failed with: exit status 1

No Android development effort is complete without failing builds caused by a disk space shortage! It’s unclear why the developers would choose to use /tmp for this, when there’s a long history of system administrators putting /tmp on a different (and smaller) device. I’m embarrassed to say that I am (was) one of those admins. Luckily, it was an easy fix to mount a tmpfs over top of /tmp:

sudo mount -t tmpfs -o size=16g tmpfs /tmp

Let me tell you, it was such an adrenaline rush to see the build finally succeed:


#### build completed successfully (05:03 (mm:ss)) ####

I know that LineageOS 16 is getting up there in age now, but I’ve been impressed with the look and feel of it. I get nostalgia back to my first-ever Android program as a professional developer, which was also based on Android 9. Unfortunately, the camera doesn’t work. But I can live with that.

Implementing the batch-sequential architecture style in Rust

2024-12-11T09:00:00+00:00

As promised, this post is a follow up to my previous post with the details of how we’ve implemented a batch-sequential architecture pattern for the redfish-codegen project.

Types that do work

The goal is to write code like this:

struct Hello;
impl Process<()> for Hello {
    type Output = String;
    fn process(self, input: ()) -> Self::Output {
        "Hello, world!".to_string()
    }
}

fn print_message(input: String) {
    println!("{}", &input);
}

Pipeline::builder()
    .stage(Hello)
    .stage(print_message)
    .execute();

Here, we can construct a pipeline, and then execute it. We don’t care whether a stage consists of a free function or a type impl. The pipeline ensures, statically, that the input type of a stage is compatible with the output type of the previous stage. To achieve this, we start with the trait Process:

pub trait Process {
    type Output;
    fn process(self, input: Input) -> Self::Output;
}

This type represents a procedure that consumes self (is terminating) from an input value to an output. We can provide a blanket implementation for all FnOnce closures:

impl Process for F
where
    F: FnOnce(In) -> Out,
{
    type Output = Out;
    fn process(self, input: In) -> Out {
        self(input)
    }
}

Composing a Pipeline

The next trait is Stage, and we can use it to combine stages to form our pipeline:

pub trait Stage: private::Sealed {
    type Result;
    fn stage(self, process: Q) -> Self::Result
    where
        Q: Process;
}

Note that I’ve sealed this trait. Other areas of the code should not be implementing this trait. We’ll implement this on a type Pipeline. Here’s the definition of Pipeline, with some of the cruft removed:

pub struct Pipeline {
    pub(super) process: Proc,
    pub(super) previous: PreviousStage,
}

impl Stage for Pipeline
where
    P: Process,
{
    type Result = Pipeline;
    fn stage(self, process: Q) -> Self::Result
    where
        Q: Process,
    {
        Pipeline {
            process,
            previous: self,
        }
    }
}

All this does is compose a new Pipeline containing the previous Pipeline. You might refer to this type as telescoping, because its real type (as known to the compiler) contains the type of its sub-pipeline…

…which of course, contains the type of its sub-pipeline (recursively).

Executing a Pipeline

We know that we need a trait that can allow us to execute a stage, so let’s begin there:

trait RunStage {
    type Output;
    fn run_stage(self) -> Self::Output;
}

If Process is the category of types with a self-consuming function from an input to an output, RunStage is the category of types that can execute a Process and propagate its output value. I’ll leave this for now, and we’ll come back to it in a moment.

Executing a pipeline requires executing each sub-pipeline. Earlier, while showing the Stage trait, I left out one critical piece of information. What is the nature of the pipeline’s first stage? I’ve got a type called PipelineBuilder. The name of this type should conjure an accurate depiction of its actual qualities–it contains nothing interesting. Except, an implementation of Stage:

impl Stage<(), (), ()> for PipelineBuilder {
    type Result = Pipeline;
    fn stage(self, process: Q) -> Self::Result
    where
        Q: Process<()>,
    {
        Pipeline {
            process,
            previous: (),
        }
    }
}

So with this, if there exists a function Pipeline::builder() which returns a PipelineBuilder, we can construct a Pipeline, given a process whose input is the unit type (), and whose previous stage is also the unit type. This is nice–it requires that the initial stage of a pipeline take no inputs.

What does this have to do with executing a pipeline, though? Since we know the “head” of the pipeline is always the unit type, we can use this as a kind of recursive “base case” and provide an impl for it:

impl RunStage for () {
    type Output = ();
    fn run_stage(self) -> Self::Output {
        ()
    }
}

You might have noticed that the output of this impl is also the unit type, which happens to also be the input type of the next stage. Isn’t that beautiful? Finally, we can provide an implementation for Pipeline.

impl RunStage for Pipeline
where
    P: Process<::Output>,
    R: RunStage,
{
    type Output = P::Output;
    fn run_stage(self) -> Self::Output {
        let Self { process, previous } = self;
        process.process(previous.run_stage())
    }
}

The one thing I’ve left out is the trait Execute, which can be implemented in terms of RunStage. It’s so simple that I don’t see any value to including it here. I’ll leave this and other details as an exercise to the reader, or you can go and look at the feature branch where I’ve implemented some of this stuff. Cheater!

With this, the redfish-codegen project should be in a good place to begin applying this pattern in future work.

Thanks for reading!

New Patterns for Redfish-Codegen

2024-12-11T08:00:00+00:00

I haven’t spent very much time working on the Redfish-Codegen project lately, unfortunately. There’s an open PR out there that’s festering, and I’m long overdue for a release. I’ve been fixating on more important (read: vain) issues that I haven’t found a path forward on, and I’m experiencing some executive dysfunction.

The first problem is the technical complexity of the code generator, relative to the size of the application. The constructor for the main class of the code generator application is 113 lines long, and while I am pleased that we managed to keep expert information localized in this area of the codebase for so long, it’s become quite unwieldy. Mustache templates for code generation have become cumbersome and bugprone to maintain, and there’s a monstrous 8.81 MiB patch file under version control that performs a simple transformation on all of the input schemas. Naturally, this patch is applied using quilt(1), and it’s generated by a Python script, which is called from a shell script. There are a lot of skills that a contributor needs to be effective.

Contributors to this codebase know Rust–we can count on that, or else they wouldn’t have been considering this solution. Anything else is extra. So, I’d like to gradually rewrite the code generator in Rust.

The ultimate pattern for gradual replacement of legacy systems is the Strangler Fig pattern. In this pattern, rewrites can proceed as long as they don’t cross the seam between two components. Rewriting a component is done wholesale, but as long as components are isolated, we are saved from the trap of the rewrite spiraling out of control. In order to apply this pattern, though, we need seams. That’s where my next pattern comes in.

Since the beginning, the code generator has employed a kind of batch-processing technique, similar to the Batch-sequential processing architecture pattern. In this pattern, connectors pass data between stages, which transform the data from one form to the next. Ideally, stages are decoupled from the implementation of adjacent stages, coupled only to the representation of the data that they receive. Stages receive their input in its entirety, and they produce their entire output synchronous with the next stage. This is a common architecture style for compilers and code generators of all kinds. In the canonical implementation, stages are allowed (or expected) to terminate when they have completed their processing.

This isn’t formalized in the architecture, however, so code for one “stage” is mixed with code from another stage. When reading the code, one has to stare for a long time to figure out when this transformation might be applied during code generation.

This sucks.

Keeping all of the expert information localized to the main class was very useful towards discovering this, however, since it made the problem painfully obvious as soon as it appeared.

So, I implemented a few types in Rust that will allow us to begin reconstructing the pipeline in Rust, with first-class abstractions that represent our architecture pattern, which I’ll describe in my next post!.

Fixing Git Clone Errors

2024-11-28T00:00:00+00:00

Just before the holiday, I was working on a Yocto-based distribution for a Raspberry Pi. I’ve been using the gadget to stream music to my stereo over Bluetooth. I’d just finished pulling a bunch of junk out of the image when, to my disappointment, the do_fetch task for linux-raspberrypi failed. So, I tried again, and it failed again. I had been able to fetch successfully not twenty minutes earlier. Inspecting the log, it looks like git index-pack generated an invalid index file for one of the received pack archives:

...
Receiving objects: 100% (74709/74709), 26.56 MiB | 7.52 MiB/s, done.
fatal: local object e0a447351623bfa2df5a7e7429e1479826bc9a7a is corrupt
fatal: fetch-pack: invalid index-pack output

I’m not fluent in git internals, so at the time, this meant nothing to me. My immediate suspicion was a network error. I’ve seen repeatable problems with git clone magically disappear after Europe goes to sleep in the past, so I assumed this was another such fluke. It was already late by this time, so I went to bed.

As you might imagine, it did not resolve itself in the morning. I tried setting BB_SHALLOW_CLONE and BB_SHALLOW_CLONE_DEPTH in my kas file to see if I could work around the issue by trying to minimize data transfer. No such luck. Strangely, I had not seen this with any other repository in my distro.

I tried the clone manually–the same branch, from the same GitHub repository. Here, I was able to get through a shallow clone, but trying to deepen the clone with git fetch --unshallow produced the same errors as were in the BitBake log.

So, I scripted an interaction to incrementally deepen the clone, to see how far I could get:

$ while true; do git fetch --deepen=1; done

This worked for a little while, until I got to a region of the history that retrying wouldn’t seem to get through. It wasn’t a terribly large transaction, only about 50 MiB. It gets more interesting, though–the error message isn’t consistent. There are a few patterns that I could pull out, in addition to the one shown above:

Receiving objects: 100% (130810/130810), 48.85 MiB | 7.40 MiB/s, done.
fatal: SHA1 COLLISION FOUND WITH c8fdd0d03907f9d11d2080ec77d94add9f144916 !
fatal: fetch-pack: invalid index-pack output

Receiving objects: 100% (130810/130810), 48.85 MiB | 8.33 MiB/s, done.
error: inflate: data stream error (incorrect data check)

In a situation like this, it often helps me to view the system from a high level and work on ranking failure modes for each component. In this scenario, I’m cloning the repository on my AMD machine running Debian testing. This operation goes out to the network, and copies a bunch of data from a server to disk. So, these are the major components:

The Git remote (GitHub)
The network
My installation of Git
My server’s RAM
My SSD

Let’s move down the list. GitHub wasn’t reporting an outage, and since I hadn’t had any other network troubles, it seemed unlikely to be something outside of my box. A bad DIMM might fit the bill, but I would expect to see other kinds of system instability–processes crashing and unrecoverable kernel panics at runtime, etc.

Next is the installation of git. The reported version is 2.45.2, and that matches the version of the installed package from dpkg -l. When I looked to see if there was an upgrade available, apt took the liberty of reminding me about an issue I’ve been ignoring for a month:

  WARNING: Device /dev/sdb5 has size of 911755265 sectors which is smaller than corresponding PV size of 911757312 sectors. Was device resized?
  WARNING: One or more devices used as PVs in VG edtwardy-vg have changed sizes.

The partition /dev/sdb5 is the only physical volume in the LVM2 volume group that contains my home directory and root filesystem. This error is telling us that the LVM2 physical volume is configured for a size exactly 1023.5 KiB larger than the partition that actually contains it. I’m not exactly sure how that happened. Recently, I was setting up a btrfs filesystem on a neighboring partition. It’s likely that I made an arithmetic error when I was resizing everything.

I procrastinate fixing things like this because my partitioning solution is extremely complicated in its current state, and I never have a Debian Live CD around when I need it. After booting into a live image, I fixed the issue by freeing up 1 logical extent (about 4 MiB) from the volume containing my /var partition and reallocating a couple of extents to make free space at the end of the physical volume. This allowed me to reduce the size of the PV to the size of the partition.

Apt no longer reports the above error, and a test shows that I can clone the linux kernel. Even better, it still works the second time. It bothers me that I’m not sure why this may have been the cause of the problem. I know that git makes some temporary files in /var/tmp, perhaps the invalid logical extent lived somewhere in that partition. I don’t exactly know what writing to that region would do, but I’m not surprised that it wouldn’t work. I suppose I’m more surprised that I didn’t see something about this in dmesg first.

December Update

I never saw the failing Git clone errors again, but I did start seeing other kinds of system instability. I saw SEGFAULTs in GCC, crashing in pseudo, and finally, ext4 corruption. This all prompted me to run memtest86+, and sure enough, I had about 2049 bad addresses. A new pair of DIMMs passed a memtest out of the box, and I haven’t seen the problems since! It’s entirely possible this was caused by the bad RAM. But the lvm2 size issue was another ticking time-bomb that needed action, so I can’t complain that now the both of them are resolved.

From Problem to Solution

2024-11-11T00:00:00+00:00

From Problem to Solution

I often struggle to organize and prioritize my ideas. Time is precious, and I’d prefer to spend it with love ones, relaxing, and practicing self care, but my time spent tinkering is also very important to me. Today, I did some semantic modeling to understand the motivation for the work I do in my free time, and I came up with this model:

In this model, a Feature is an increment of work; a thing that I spent time to accomplish. My takeaway was that there are three motivations for completing Features:

To enable a Use Case.
To mitigate a Risk.
To advance a Goal.

I think that completely captures the solution space. I won’t explain every relationship in the diagram, since the relationships speak for themselves. There are a couple of details not represented in this diagram that deserve a little explaining, however.

Solvable Problems

This model exists so that I can focus my creativity to develop solutions to problems. The starting point is to recognize when a need arises and to identify the next step. The fundamental assumption of this process is that the Problem statement is the genesis of the product–that there exists some hypothetical product or process which is capable of addressing the need underlying the problem statement. Obviously, some problems can’t be solved with products or processes. I’m sure you have no trouble thinking of one. Some of these problems can be decomposed further into problem statements that do uphold this assumption, however.

The Genesis of Product

In my last post, I wrote about the source of requirements. You might notice that I only listed two sources of requirements there, but there’s a third thing here that motivates my work–Goals, which are driven by Values. This is a different kind of value than the value-driven design I proposed in my last post. I hope you won’t find me naive if I say that values are generally not a motivating factor in industry. My employer defines a set of values, but they only inform how I accomplish my work. They do not dictate what I work on. Conversely, I as an individual can form a value statement around the accessibility and quality of open source software. That’s enough justification to make contributions to the Linux kernel. I’d be surprised if businesses were making decisions in the same way. That’s why I included it here.

Use Case Subtypes

I often work with a subtype of Use Cases with which any software developer ought to be familiar: User Stories. I chose to omit it from the diagram to minimize visual noise. Here, I consider a user story to be a subtype of a use case because it has more constraints than a use case often does: a user story contains the use case built into its statement, and includes the user’s motivation. The latter is usually omitted from a canonical Use Case description. It may be the case that the next logical step, after capturing a problem statement, may be to conceive a User Story that addresses the underlying need, especially if the solution to the problem involves a great amount of technical detail not suitable for a high-level use case, or if the motivation isn’t inherently clear from the solution.

Risk Subtypes

There are also two implicit subtypes of Risk not captured in the model.

A Program Risk is a risk that I will fail to implement my objective. For example, if I’m working with a new technology to implement a feature, there may be a chance that I misunderstand the technology and fail to implement the product in a useful manner. I may choose to mitigate this risk by doing some investigation or prototyping to burn down the risk.

I’ll refer to the other type of risk as System Risk. This is the risk that a user, while acting on my product to achieve their goal, will fail or experience harm. These are usually mitigated with tools like invariants, which enforce rules on the ways that a user can interact with a product based on its state, or with architectural features, which support a qualitative architectural characteristic, such as quality or reliability.

Both kinds of risk impede my Goals and Use Cases, which is why I didn’t draw the distinction in the model.

Conclusion

My plan is to refer back to this model from time to time, to calibrate my process for tinkering. The goal here was not to create a new process, but to model the process I’m already using, to be able to understand and reason about it more effectively in the future. I’ll provide updates in the context of my projects in future posts!

Value- and Risk-driven Design

2024-10-31T00:00:00+00:00

I’ve been favoring design methods lately that I would consider to be “Value-“ and/or “Risk-“ driven. “Risk-driven” design methods, as I refer to them, are well-documented. The book Just Enough Software Architecture: A Risk-driven Approach by George Fairbanks stresses a risk-driven method for software architecture. There are international standards for various industries that describe risk management processes proven to be successful in their target industry:

IEC 61508 (safety-critical industrial applications)
ISO 14971 (medical devices)
ISO 26262 (typically automotive applications)

Usually, this involves identifying hazards (situations that cause harm) and failure modes (events that can cause hazardous situations). An engineer then assigns a risk level based on the probability of the failure mode occurring and the severity of the harm. If the risk level is not acceptable, the engineer identifies mitigations that reduce the risk to an acceptable level. There are many tools to aid this process, including fault tree analysis, reliability testing and hypothesis testing.

Value-driven design methods are not well documented, but they are universally understood; probably, a number of things come immediately to mind. Here, the term refers to design methods that either increase the inherent “value” of a product or service, or decrease its cost. This can involve optimizing unit cost or reducing non-recurring expenses. Often, it means planning to deliver the highest value features within the stakeholder’s financial constraints.

How do we identify the highest value work?

The Source of Requirements

Let’s forget about business requirements for this conversation. I find there are two sources for the genesis of product requirements:

Suffering
Risk

In that order.

Really, use cases are the primary source of requirements. But what generates use cases? Ultimately, people make choices to lessen their suffering (or the suffering of others; I do believe in empathy). So suffering, I think, is the root cause of a use case.

We can capture and model a person’s choices through functionality scenarios, which describe the course taken by an actor with a goal as they navigate from problem to solution (Fairbanks, 2010). Generally, these don’t capture the pain point that motivated an actor, and they also don’t capture the system of interest–they simply trace an actor’s steps. Though, perhaps they should capture motivation. If they did, it may be easier for us to develop empathy for our customers and end users. Where I work, in the engineering services industry, empathy is what brings work through the door.

Functionality scenarios can be used to identify and quantify use cases and domain concepts. From there, we can begin to identify features and facets of the system under interest, and visions of the solution space may begin to dance in our heads. Use cases lend themselves to requirements, and requirements lend themselves to code. We know this process, and yet we frequently fail to apply it.

On the other hand, though, engineering generates risk. Our product may provide services that ease suffering, but it likely also introduces new ways of creating suffering. What happens if our actor uses the product wrong, or the product fails? Is the user better or worse off than they were to begin with? What about our business? If our product fails, our name may become tarnished and our employees may worry about feeding their families.

Risk-analysis is the activity that removes the barriers to joy. It helps us to implement mitigations that protect our users, reduce technical risk, and finish faster with a better product. Risk mitigations reveal non-functional requirements. They also lend themselves naturally to architecture solutions. Risk mitigations tend to be intensional–as in, related to design intent, rather than solutions we can apply directly to our code and assemblies. Consider, for example, a heterogeneous redundant solution in a high-SIL application. We can read the code and see that two software units have a similar function, but the average reader might jump to the conclusion that one unit is dead code–a relic from a prototype. However, an engineer could create an architectural view that traces the redundancy directly to a risk mitigation.

In the medical and aerospace industries, outputs of risk analysis activities feed into product requirements and architecture. I consider cybersecurity-related activities to also fall into this category. Planning to apply cybersecurity process at the end of a project is planning to fail.

I imagine the forces of suffering and risk as being on either side of a see-saw. On Monday, we may discover a new use case. On Wednesday, we look at our product invariants, affordances, and our new architecture, and consider all the new ways we’ve just constructed to fail.

Designing the Design Process

How we tackle a problem is more important than the problem itself. Lately, I’ve had two questions ringing in my head:

Do I know everything I need to know to succeed? What can I do today to be more sure of my success tomorrow?

This forces me to think about technical and project management risk. But it also forces me to think about the problem statement, and the design process. What’s my customer’s greatest pain point? What’s the problem they’re trying to solve? How can I make sure I know the right answer to these questions? How will I make sure the patient is safe, even if my code fails?

This, I think, is the fundamental principle underpinning the design methods I’m referring to: not applying a rote development process, not indifferently applying a canned architecture style. Continuously evaluating the present to ensure I’m solving the right problem today.

Conclusion

I’m currently trying to apply this strategy on a program at work, where I’m serving the team as their software architect. I’m also trying to apply this to two projects at home: designing a backup system for my server, and building a financial tool.

In the past, I’ve failed at developing solutions for these because I would either fall victim to a form of analysis paralysis, and end up designing an ivory tower, or repeatedly prototype something that doesn’t solve the problem. Do I really need a tool with a hundred views that graphs data in real time? Do I really need that Redfish-enabled off-site RAID array? The answer would turn out to be no, of course. Now, I’m hoping that this fresh perspective will help me to apply just enough design to the right problem.

If you know about any books on this topic, reach out to me. This is an area where I want to read and learn more.

The Last Nine Months

2024-10-30T00:00:00+00:00

The Last Nine Months

It’s been a while since I last posted. I’ve been reading. A lot. I’ve also been trying to catch up on some projects that I’ve been neglecting, and start an initiative around Rust at work. And keep up on my relationships. And practice hobbies that aren’t programming.

In the time since my last post, I’ve spun up an instance of miniflux on my server–an absolutely fantastic application, by the way. It’s a simple, self-hosted RSS aggregator. I deploy it in the same fashion as all of my other applications, including this blog. I’ve been reading the blogs of some really smart people, and I’ve learned that I shouldn’t feel like I need a lot of words to say what I want to say. The book Several Short Sentences About Writing by Verlyn Klinkenborg has also helped in that area.

So, I’m going to try to commit to shorter posts from now on. I’m writing this one while I cook dinner. Hopefully, this will help me to post more in the future.

(Nearly) Immutable Jenkins Deployments

2024-01-20T00:00:00+00:00

In Search of Immutable Deployments

This blog, and the rest of my applications, are served from a single computer running Debian in my closet. Among my deployments is an instance of Jenkins, which I use to continuously build and deploy my two static websites–this blog, and my quick-reference documentation. The setup was inspired by GitHub Pages, and I could do this same thing much more easily if I full embraced the GitHub solutions, but that doesn’t align with my romance for self-hosting. Git is the only thing I’m too afraid to self-host for the moment (the right combination of drive failures and my life’s work is gone forever), so that significantly narrows the landscape of CI solutions that are available to me.

I originally introduced Jenkins in August of 2021 (commit), where I was running my Jenkins controller and a single agent in Podman containers. Very quickly, the setup became unwieldy–there was no way to track what had been changed, and it was difficult to remember what to do when I wanted to recreate my agent or add a job. Additionally, the limitations of the configuration began to conflict with the design goals of my other deployments:

Every application or site should be split into its own Debian package.
Installing a package should bring a site up, and removing the package should remove all site data and bring the site down gracefully.

This is difficult to do when adding a site means I have to log in to my Jenkins instance, click around on the GUI and chant some incantations in order to setup a job for a new application. Thus spawned the need for configuration as code, and the creation of an immutable deployment for Jenkins.

Immutability in the context of application deployments means that the application and all its associated data can be destroyed and programmatically recreated anew–at will. Obviously, configuration as code is a big part of immutable deployments. After that problem is solved, there is data management–minimizing the data that needs to be persisted between deployments, and maximizing the data that can be destroyed and recreated deterministically at will. Finally, there is configuration discovery–how can dependent applications configure the Jenkins instance programmatically, e.g. by adding jobs?

Jenkins is simultaneously very slow moving and very fast moving. Certain defects and critical feature requests go ignored for years, but plugins introduce breaking changes constantly. Additionally, there aren’t a lot of great resources on the management of Jenkins deployments. The world, it seems, is moving away from Jenkins, towards CI solutions that are highly integrated with source control solutions–Gitlab CI, Bamboo, GitHub CI, etc. Most of this process was achieved by reading issues on GitHub and Jira created by people who had encountered the same problems as I, and crafting the ultimate solution from the breadcrumbs.

Configuration as Code

Thankfully, there is a plugin for this–and it works quite well. The Jenkins configuration as code plugin can even export the configuration of a running instance into YAML, to provide a starting point. I installed the plugin, and navigated to Manage Jenkins > Configuration as Code > View Configuration.

The exported configuration was quite long, and there were many options that I didn’t understand, but I decided that minimizing the configuration could wait until after I had a running setup.

To inject this configuration into my container, I decided to do something very similar to what I had recently implemented for my Nginx configuration–I would install the configuration file to somewhere in my root filesystem, then have a dpkg trigger that would produce a squashfs image at installation time, and finally a Quadlet volume configuration file that would mount the squashfs image into the container. This would provide the mechanism for other packages (applications) to install configuration fragments later that could be picked up by the dpkg trigger. Configuring Jenkins to use this configuration file is as easy as following the README for the plugin.

At first, the Jenkins instance was failing to read the mounted configuration file from the squashfs image. Thankfully, the container entrypoint remains running after this failure, so it’s easy to exec into the container to poke around. Obviously, it was a permissions issue with the mountpoint. Since the Jenkins instance runs as a non-root user in the container, I had to change the volume configuration to mount the volume as owned by the jenkins user:

[Volume]
# Other options...
Options=allow_other
User=1000
Group=1000

The allow_other option is surprisingly required. Without it, the image is mounted as owned by UID 1000, but non-root users cannot access the mountpoint itself. It took me a while to figure this out.

Bootstrapping Plugins

The official Jenkins instance comes with no plugins installed, so we have to create a derived container image that comes with all the plugins we need. The Jenkins configuration as code documentation points us to a page in the official Jenkins docs that describes how to do this using the Jenkins plugin CLI:

FROM jenkins/jenkins:lts-jdk17
COPY --chown=jenkins:jenkins plugins.txt /usr/share/jenkins/ref/plugins.txt
RUN jenkins-plugin-cli -f /usr/share/jenkins/ref/plugins.txt

That same page gives us a mechanism to getting the list of currently installed plugins with a cURL command:

JENKINS_HOST=username:password@myhost.com:port
curl -sSL "http://$JENKINS_HOST/pluginManager/api/xml?depth=1&xpath=/*/*/shortName|/*/*/version&wrapper=plugins" | perl -pe 's/.*?([\w-]+).*?([^<]+)()(<\/\w+>)+/\1 \2\n/g'|sed 's/ /:/'

Data management

Since we’re going for immutability, I want to persist as little as possible. In the old configuration, there was a Jenkins “config” volume that persisted everything under /var/jenkins_home. This ended up being pretty much everything–secrets, plugin binaries, and of course, the configuration itself.

The ideal scenario is that no volumes are required–the container creates all the data needed for the running instance of Jenkins, and all of that data is destroyed when the instance stops. When I tried removing this volume, however, the Jenkins agent failed to connect. After some poking around, it became clear that this is because the Jenkins JNLP secrets used to connect agents to controllers are not deterministically generated. If I were running my agents in a k8s cluster, I could configure the controller to dynamically spin up agents for job processing as necessary. However, single-node k8s is still not an option in 2024 without virtual machines, and I just don’t care to introduce a new heavy-handed virtualization mechanism on my poor Ryzen 3 CPU.

Planning for a future where k8s is an option, the current best option is to create agents through the GUI, and then persist their secrets into a volume. After some googling, I stumbled upon this GitHub issue, which recommends provisioning the contents of /var/jenkins_home/secrets before starting the container. This is a flat directory, containing only a few small files, so I’ll choose to persist this with a btrfs volume:

[Volume]
PodmanArgs=--driver=local
Type=btrfs
Options=subvol=@jenkins_controller-secrets
Device=/dev/disk/by-uuid/05599193-00bc-4a81-9550-54623b2ec8c4

This also demonstrates the new strategy I’ve taken for all of my container volumes recently, which is to persist the actual data in a btrfs subvolume, which I then create a Quadlet volume unit for. This creates a named volume that mounts the subvolume into the container at runtime, so I can safely run podman volume prune without worrying about a loss of data!

Configuration Discovery

The penultimate issue that needs solving is the discovery of configuration. How do dependent applications programmatically create jobs in the controller? We can use the Jenkins job-dsl plugin for this. The mechanism we’ll implement allows packages to install individual Groovy files to a known location, which will be picked up by our dpkg trigger and used to regenerate the configuration for the Jenkins controller. The configuration fragment we need to generate from the Groovy files will look something like this demo from the configuration-as-code project.

This requires some changes to our dpkg hook to generate a single configuration fragment file, called jobs.yaml, containing the job-dsl scripts from the individual Groovy files:

diff --git a/debian/twardyece-jenkins.postinst b/debian/twardyece-jenkins.postinst
index ce18fbb..0cb3675 100644
--- a/debian/twardyece-jenkins.postinst
+++ b/debian/twardyece-jenkins.postinst
@@ -10,8 +10,23 @@ update_config() {
     rm -rf $LOCAL_STATE_DIR/casc
     mkdir -p $LOCAL_STATE_DIR/casc
 
+    # Copy all YAML files directly to the config directory
+    cp $DATA_DIR/*.yaml $LOCAL_STATE_DIR/casc
+
+    # Emit all groovy files into a YAML fragment as job-dsl scripts
+    local IFS_SAVE=$IFS
+    IFS=$'\n'
+    printf '%s\n' "jobs:" > $LOCAL_STATE_DIR/casc/jobs.yaml
+    for f in $DATA_DIR/*.groovy; do
+           printf '  - script: >\n' >> $LOCAL_STATE_DIR/casc/jobs.yaml
+           for line in $(cat $f); do
+                   printf '      %s\n' "$line" >> $LOCAL_STATE_DIR/casc/jobs.yaml
+           done
+    done
+    IFS=$IFS_SAVE
+
     # Create a volume image from the configuration files
-    mksquashfs $DATA_DIR $LOCAL_STATE_DIR/$IMAGE -noappend
+    mksquashfs $LOCAL_STATE_DIR/casc $LOCAL_STATE_DIR/$IMAGE -noappend
 }
 
 case "$1" in

Now, we can create the job that builds this blog as its own Groovy file and install it to /usr/share/twardyece-jenkins to be automatically picked up at package installation time:

multibranchPipelineJob('Blog') {
  branchSources {
    git {
      id('blog-trunk')
      remote('https://github.com/AmateurECE/twardyece-blog.git')
      includes('trunk')
    }
  }
}

Agent Dependency Management

In the old way, I had all the dependencies necessary to build my Jenkins jobs installed in the image that the agent was running. However, this creates another point of tight coupling between my jobs and my agents. At the company where I work, our agents launch Docker containers that contain our build environments when a new job is triggered. I tried multiple scenarios to achieve a similar kind of thing:

The Jenkins Kubernetes plugin can dynamically launch agents, but for reasons already stated, this wasn’t an option for me.
The docker-workflow plugin allows the Jenkinsfile to specify a container image in which the build should be run. However, this doesn’t work when using docker-in-docker with Podman.
The docker-plugin allows the Jenkins controller to spin up agents from container images dynamically as jobs are run, but it seems to struggle when using docker-in-docker when the controller is running in a container. Additionally, it uses docker-java, which requires the Docker socket to be mounted, and the controller container to be run with --privileged, and since this controller is open to the wide internet, that’s not something I was willing to consider.

In the end, I decided that all Jenkins jobs would need to use Nix flakes to manage their dependencies. I created an agent image that had Nix installed for the jenkins user, and a wrapper that allows executing a bash script within a devShell derived from a Nix flake. The wrapper needed to include some annoying workarounds, because apparently Jenkins does not set the USER environment variable for a job (well, it sets the user environment variable, which is obviously not the same). From a Jenkinsfile, I can conveniently use this wrapper in the shebang:

pipeline {
  # ...
  stages {
    stage('Build') {
      steps {
        # ...
        sh '''#!/usr/bin/flake-run
        bundle install
        bundle exec jekyll build
        '''
      }
    }
  }
}

Conclusion

That’s about as close as we can get to a fully immutable Jenkins deployment without moving to Kubernetes. Now, there’s only two volumes that store actual data: One to store the secrets for the controller, and one to store the secrets for the inbound agent. In reality, this took me about a week to accomplish in my free time–spending a few moments here and there. Hopefully this will be helpful for someone else who decides to traverse a similar path!

Software Development Blog

Learning OCaml by Parsing JSON

Reflections on the module system

Monads in Haskell and OCaml

Implementing Monads

Programming With Monads

Parsing JSON

Testing

Escaping Quotes in String Literals

What’s Next?

Pinned Places in C++

Storage Durations

The Hypothetical ADC Driver

Pinned Places

Constructing a PinPtr

Does It Really Work, Though?

Conclusion

Running LineageOS for the First Time

Implementing the batch-sequential architecture style in Rust

Types that do work

Composing a Pipeline

Executing a Pipeline

New Patterns for Redfish-Codegen

Fixing Git Clone Errors

December Update

From Problem to Solution

From Problem to Solution

Solvable Problems

The Genesis of Product

Use Case Subtypes

Risk Subtypes

Conclusion

Value- and Risk-driven Design

The Source of Requirements

Designing the Design Process

Conclusion

The Last Nine Months

The Last Nine Months

(Nearly) Immutable Jenkins Deployments

In Search of Immutable Deployments

Configuration as Code

Bootstrapping Plugins

Data management

Configuration Discovery

Agent Dependency Management

Conclusion

Constructing a `PinPtr`