William G. Underwood

Minimax-optimal In-context Regression with Transformers

2026-01-22T00:00:00+00:00

Excited to share a collaboration with three brilliant undergraduate/masters students, Michelle Ching, Ioana Popescu and Nico Smith!

In “Efficient and minimax-optimal in-context nonparametric regression with transformers”, we show that deep transformer networks require only order $\log n$ parameters to attain optimal rates in smooth regression problems. The paper is joint work with Michelle Ching, Ioana Popescu, Nico Smith, Tianyi Ma and Richard Samworth, and can be found at arXiv:2601.15014.

Abstract

We study in-context learning for nonparametric regression with $\alpha$-Hölder smooth regression functions, for some $\alpha>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $\Theta(\log n)$ parameters and $\Omega\bigl(n^{2\alpha/(2\alpha+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2\alpha/(2\alpha+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.

Early Career Researcher Grant

2025-10-06T00:00:00+00:00

Delighted to have been awarded an Early Career Researcher Grant by G-Research!

MRC Biostatistics Unit Talk

2025-10-02T00:00:00+00:00

I presented my recent work on “Upgrading survival models with CARE” at the MRC Biostatistics Unit in Cambridge.

The talk is available on YouTube.

Upgrading Survival Models with CARE

2025-07-01T00:00:00+00:00

I’m pleased to share my new preprint, “Upgrading survival models with CARE”.

It is authored with Henry Reeve, Oliver Feng, Samuel Lambert, Bhramar Mukherjee and Richard Samworth, and can be found at arXiv:2506.23870.

Abstract

Clinical risk prediction models are regularly updated as new data, often with additional covariates, become available. We propose CARE (Convex Aggregation of relative Risk Estimators) as a general approach for combining existing “external” estimators with a new data set in a time-to-event survival analysis setting. Our method initially employs the new data to fit a flexible family of reproducing kernel estimators via penalised partial likelihood maximisation. The final relative risk estimator is then constructed as a convex combination of the kernel and external estimators, with the convex combination coefficients and regularisation parameters selected using cross-validation. We establish high-probability bounds for the $L_2$-error of our proposed aggregated estimator, showing that it achieves a rate of convergence that is at least as good as both the optimal kernel estimator and the best external model. Empirical results from simulation studies align with the theoretical results, and we illustrate the improvements our methods provide for cardiovascular disease risk modelling. Our methodology is implemented in the Python package care-survival.

Sharp Anti-Concentration Inequalities for Extremum Statistics via Copulas

2025-02-13T00:00:00+00:00

My new preprint is titled “Sharp Anti-Concentration Inequalities for Extremum Statistics via Copulas”, with Matias Cattaneo and Ricardo Masini.

It can be found at arXiv:2502.07699.

Abstract

We derive sharp upper and lower bounds for the pointwise concentration function of the maximum statistic of $d$ identically distributed real-valued random variables. Our first main result places no restrictions either on the common marginal law of the samples or on the copula describing their joint distribution. We show that, in general, strictly sublinear dependence of the concentration function on the dimension $d$ is not possible. We then introduce a new class of copulas, namely those with a convex diagonal section, and demonstrate that restricting to this class yields a sharper upper bound on the concentration function. This allows us to establish several new dimension-independent and poly-logarithmic-in-$d$ anti-concentration inequalities for a variety of marginal distributions under mild dependence assumptions. Our theory improves upon the best known results in certain special cases. Applications to high-dimensional statistical inference are presented, including a specific example pertaining to Gaussian mixture approximations for factor models, for which our main results lead to superior distributional guarantees.

MondrianForests.jl

2023-11-04T00:00:00+00:00

My new Julia package for Mondrian random forest regression is available on GitHub.

This package provides:

Fitting (debiased) Mondrian random forests
Parameter selection with polynomial estimation or generalized cross-validation

Inference with Mondrian random forests

2023-10-17T00:00:00+00:00

I’m excited to share my new preprint, titled “Inference with Mondrian random forests” and coauthored with Matias Cattaneo and Jason Klusowski.

It can be found at arXiv:2310.09702.

Abstract

Random forests are popular methods for classification and regression, and many different variants have been proposed in recent years. One interesting example is the Mondrian random forest, in which the underlying trees are constructed according to a Mondrian process. In this paper we give a central limit theorem for the estimates made by a Mondrian random forest in the regression setting. When combined with a bias characterization and a consistent variance estimator, this allows one to perform asymptotically valid statistical inference, such as constructing confidence intervals, on the unknown regression function. We also provide a debiasing procedure for Mondrian random forests which allows them to achieve minimax-optimal estimation rates with $\beta$-Hölder regression functions, for all $\beta$ and in arbitrary dimension, assuming appropriate parameter tuning.

Bernstein’s Inequality

2023-02-22T00:00:00+00:00

Bernstein’s inequality is an important concentration inequality. In this post we motivate, state and prove a “maximal inequality” version which I think is clearer than the usual formulation.

Introduction

Concentration inequalities are central to probability theory, mathematical statistics and theoretical machine learning, providing a mathematical framework to the notion that “with enough samples you eventually get the right answer.” More precisely, they provide bounds on the typical deviations that a random variable makes from its expected value. Bernstein’s inequality allows us to control the size of a sum of independent zero-mean random variables where variance and almost sure bounds on the summands are available.

In this post we state and prove Bernstein’s inequality, and also demonstrate its approximate optimality by establishing two different lower bounds. The Julia code for the simulations is available on GitHub.

Motivation

In many applications we need to control the maximum of a collection of random variables. For example, we might be proving uniform convergence of a statistical estimator, establishing consistency for a binary classifier under empirical risk minimization, or controlling the regret of a reinforcement learning algorithm. Understanding how this maximum behaves as a function of the number of variables is essential (Figure 1). In this post we focus on maximal inequalities for sums of independent random variables.

Fig. 1: With $X_j \sim \cN(0,1)$ independent, $\max_{1 \leq j \leq d} |X_j|$ grows with $d$.

Setup

We propose the following setup which is widely applicable in practice.

$X_{i j}$ are real-valued random variables for $1 \leq i \leq n$ and $1 \leq j \leq d$.
$X_{1j}, \ldots, X_{n j}$ are independent and identically distributed (i.i.d.) for each $j$.
$\E[X_{1j}] = 0$ for each $j$.
$\max_{1 \leq j \leq d} \V[X_{1j}] \leq \sigma^2$.
$\max_{1 \leq j \leq d} |X_{1j}| \leq M$ almost surely (a.s.).
We will provide expectation bounds for the variable $\max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right|$.

A brief discussion is in order. Firstly, the mean-zero property and the variance bound tell us that $\max_{1 \leq j \leq d} \E\big[\left| \sum_{i=1}^n X_{i j} \right|\big] \leq \sqrt{n\sigma^2}$. However in order to put the maximum inside the expectation, we need finer control on the tails of the summands, attained by imposing the almost sure bound. Note that we do not make any assumptions regarding the dependencies between different values of $j$.

Bernstein’s inequality

Now let’s state the main result, a maximal version of Bernstein’s inequality. The proof is provided later in the post, to avoid distracting from the discussion.

Theorem (Bernstein's maximal inequality)

For each $1 \leq j \leq d$, let $X_{1j}, \ldots, X_{n j}$ be i.i.d. real-valued random variables. Suppose $\E[X_{1j}] = 0$ for each $j$, $\max_{1 \leq j \leq d} \V[X_{1j}] \leq \sigma^2$ and $\max_{1 \leq j \leq d} |X_{1j}| \leq M$ a.s. Then $$ \E\left[ \max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right] \leq \sqrt{2 n \sigma^2 \log 2d} + \frac{M}{3} \log 2d. $$

Why this formulation?

If you have seen Bernstein’s inequality before, you may notice a few differences between this version and the usual formulation, including:

This result is stated for the maximum of $d$ random variables. This is to highlight the dependence of the resulting bound on $d$.
This version is stated as an expectation rather than a tail probability to avoid notational complexity, but can be easily strengthened to include the tail bound.
The terms on the right hand side are rather complicated and perhaps unfamiliar, but allow us to more directly parse the bound on the maximum. The more standard version of Bernstein’s inequality makes it difficult to read off this value.

Interpreting the bound

The resulting bound of $\sqrt{2 n \sigma^2 \log 2d} + \frac{M}{3} \log 2d$ consists of two terms which are worth discussing separately.

The first term is $\sqrt{2 n \sigma^2 \log 2d}$, which depends on $n$ and $\sigma^2$ but not $M$, and has a sub-Gaussian-type dependence on the dimension. This is the bound obtained if we assume that each $X_{1j}$ is $\sigma^2$-sub-Gaussian, and this term corresponds to the central limit theorem for $\frac{1}{\sqrt{n \sigma^2}}\sum_{i=1}^n X_{i j}$.
The second term is $\frac{M}{3} \log 2d$ and depends on $M$ but not $n$ or $\sigma^2$. This is a sub-exponential-type tail which captures rare event phenomena associated with bounded random variables.

It is worth remarking at this point that Bennett’s inequality provides a further refinement of Bernstein’s inequality, but the difference is minor in many applications.

Approximate optimality

In this section we provide two explicit examples which show why each of the two terms discussed above are necessary. It is somewhat remarkable that these examples are so easy to find, providing a straightforward demonstration of the near-optimality of Bernstein’s maximal inequality.

Example 1: central limit theorem

Let $X_{i j} = \pm \sigma$ with equal probability be i.i.d. for $1 \leq i \leq n$ and $1 \leq j \leq d$. Note $\E[X_{i j}] = 0$, $\V[X_{i j}] = \sigma^2$ and $|X_{i j}| = \sigma$ a.s., so Bernstein’s inequality gives

\[\begin{align*} \E\left[ \max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right] &\leq \sqrt{2 n \sigma^2 \log 2d} + \frac{\sigma}{3} \log 2d, \end{align*}\]

and hence for fixed $d \geq 1$

\[\begin{align*} \limsup_{n \to \infty} \E\left[ \max_{1 \leq j \leq d} \left| \frac{1}{\sqrt{n \sigma^2}} \sum_{i=1}^n X_{i j} \right| \right] &\leq \sqrt{2 \log 2d}. \end{align*}\]

However we also have by the central limit theorem that

\[\begin{align*} \frac{1}{\sqrt{n \sigma^2}} \sum_{i=1}^n (X_{i1}, \ldots, X_{id}) \rightsquigarrow (Z_1, \ldots, Z_d) \end{align*}\]

as $n \to \infty$, where $Z_j \sim \cN(0,1)$ are i.i.d. So by a Gaussian lower bound in the appendix,

\[\begin{align*} \lim_{n \to \infty} \, \E\left[ \max_{1 \leq j \leq d} \left| \frac{1}{\sqrt{n \sigma^2}} \sum_{i=1}^n X_{i j} \right| \right] = \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] \geq \frac{1}{2} \sqrt{\log d} \end{align*}.\]

Thus the first term in Bernstein’s maximal inequality is unimprovable up to constants.

Fig. 2: Bernstein's upper bound of $\sqrt{2 n \sigma^2 \log 2d} + \frac{\sigma}{3} \log 2d$ and simulated
values of $\E\left[\max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right]$ for Example 1 with $n = 50$ and $\sigma^2 = 1$.

Example 2: Poisson weak convergence

Now let $X_{i j} = M\left(1 - \frac{1}{n}\right)$ with probability $1/n$ and $-\frac{M}{n}$ with probability $1 - 1/n$ be i.i.d. for $1 \leq i \leq n$ and $1 \leq j \leq d$. Note that $\E[X_{i j}] = 0$, $\V[X_{i j}] = \frac{n-1}{n^2} M^2$ and $|X_{i j}| \leq \frac{n-1}{n} M$ a.s., so Bernstein’s inequality gives

\[\E\left[ \max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right] \leq \sqrt{2 M^2 \log 2d} + \frac{M}{3} \log 2d,\]

and hence

\[\limsup_{d \to \infty} \limsup_{n \to \infty} \E\left[ \max_{1 \leq j \leq d} \left| \frac{1}{M \log d} \sum_{i=1}^n X_{i j} \right| \right] \leq \frac{1}{3}.\]

However also note the binomial distribution limit

\[\begin{align*} \P\left(\sum_{i=1}^n \left(\frac{X_{i j}}{M} + \frac{1}{n}\right) = k\right) &= \frac{n!}{k!(n-k)!} \left(\frac{1}{n}\right)^k \left(1 - \frac{1}{n}\right)^{n-k} \to \frac{1}{e k!} \end{align*}\]

as $n \to \infty$. Thus we have the Poisson weak convergence

\[\begin{align*} \frac{1}{M} \sum_{i=1}^n (X_{i1}, \ldots, X_{id}) + (1, \ldots, 1) \rightsquigarrow (Z_1, \ldots, Z_d) \end{align*}\]

as $n \to \infty$ where $Z_j \sim \Pois(1)$ are i.i.d. So by a Poisson lower bound in the appendix,

\[\begin{align*} \liminf_{d \to \infty} \lim_{n \to \infty} \E\left[ \max_{1 \leq j \leq d} \left| \frac{\log \log d}{M \log d} \sum_{i=1}^n X_{i j} \right| \right] = \liminf_{d \to \infty} \frac{\log \log d}{\log d} \left( \E\left[\max_{1 \leq j \leq d} Z_j \right] - 1 \right) \geq \frac{1}{6}. \end{align*}\]

Hence the second term in Bernstein’s inequality is tight up to a factor of $\log \log d$. This factor diverges so slowly that Bernstein’s inequality is practically optimal in many applications. For example, $\log \log d \geq 6$ already requires ${d > 10^{175}}$, far more than the number of particles in the universe!

Fig. 3: Bernstein's upper bound of $\sqrt{2 M^2 \log 2d} + \frac{M}{3} \log 2d$ and simulated
values of $\E\left[\max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right]$ for Example 2 with $n = 50$ and $M = 1$.

References

Four lectures on probabilistic methods for data science by Roman Vershynin
The University of Oxford’s course on Algorithmic Foundations of Learning, taught by Patrick Rebeschini in 2018
Princeton University’s course on Probability in High Dimension, taught by Ramon van Handel in 2021
A note on the distribution of the maximum of a set of Poisson random variables by K. M. Briggs, L. Song and T. Prellberg, 2009
A note on Poisson maxima by A.C. Kimber, 1983

Appendix: proofs

We begin by proving the main result of this post.

Proof of Bernstein's maximal inequality

We first bound the moment generating function of $X_{i j}$. Let $t > 0$ and note that by the mean-zero property and the variance and almost sure bounds, $$ \begin{align*} \E\left[ e^{t X_{i j}} \right] &= 1 + \sum_{k=2}^\infty \frac{t^k \E[X_{i j}^k]}{k!} \leq 1 + t^2 \sigma^2 \sum_{k=2}^\infty \frac{t^{k-2} M^{k-2}}{k!}. \end{align*} $$ Now since $k! \geq 2 \cdot 3^{k-2}$ for all $k \geq 2$ and $1 + x \leq e^x$ for all $x$, we have for $t < 3/M$ $$ \begin{align*} \E\left[ e^{t X_{i j}} \right] &\leq 1 + \frac{t^2 \sigma^2}{2} \sum_{k=2}^\infty \left( \frac{t M}{3} \right)^{k-2} = 1 + \frac{t^2 \sigma^2/2}{1 - t M/3} \leq \exp\left( \frac{t^2 \sigma^2/2}{1 - t M/3} \right). \end{align*} $$ Now we bound the expected maximum, using Jensen's inequality on the convex logarithm function, to see $$ \begin{align*} \E\left[ \max_{1 \leq j \leq d} \sum_{i=1}^n X_{i j} \right] &\leq \frac{1}{t} \log \E\left[ \exp \max_{1 \leq j \leq d} \sum_{i=1}^n t X_{i j} \right] \\ &\leq \frac{1}{t} \log \E\left[ \sum_{j=1}^d \exp \sum_{i=1}^n t X_{i j} \right] \\ &\leq \frac{1}{t} \log \big( d \, \E\left[ \exp t X_{i j} \right]^n \big) \\ &\leq \frac{1}{t} \log d + \frac{n \sigma^2 t / 2}{1 - Mt/3}. \end{align*} $$ Minimizing the bound using calculus by selecting $t = \frac{2}{2M/3 + \sqrt{2n \sigma^2 / \log d}}$ gives $$ \begin{align*} \E\left[ \max_{1 \leq j \leq d} \sum_{i=1}^n X_{i j} \right] &\leq \sqrt{2 n \sigma^2 \log d} + \frac{M}{3} \log d. \end{align*} $$ Finally we set $X_{i (d+j)} = -X_{i j}$ for $1 \leq j \leq d$ to see that $$ \begin{align*} \E\left[ \max_{1 \leq j \leq d} \left| \sum_{i=1}^n X_{i j} \right| \right] &\leq \sqrt{2 n \sigma^2 \log 2d} + \frac{M}{3} \log 2d. \end{align*} $$

Next we prove the Gaussian lower bound used in Example 1.

Lemma (Gaussian lower bound)

Let $Z_1, \ldots, Z_d$ be i.i.d. $\cN(0,1)$ random variables. Then $$ \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] \geq \frac{1}{2} \sqrt{\log d}. $$

Proof

For any $t > 0$, we have by Markov's inequality $$ \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] \geq t \, \P\left( \max_{1 \leq j \leq d} |Z_j| \geq t \right) = t \left(1 - \left(1 - \P\left(|Z_j| \geq t \right) \right)^d \right). $$ Now note that by the Gaussian density function and since $s^2 \leq 2(s-t)^2 + 2t^2$, $$ \begin{align*} \P\left(|Z_j| \geq t \right) &= \sqrt\frac{2}{\pi} \int_t^\infty e^{-s^2/2} \diff{s} \geq \sqrt\frac{2}{\pi} e^{-t^2} \int_t^\infty e^{-(s-t)^2} \diff{s} \geq e^{-t^2}. \end{align*} $$ Hence because $1-x \leq e^{-x}$, $$ \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] \geq t \left(1 - \left(1 - e^{-t^2} \right)^d \right) \geq t \left(1 - \exp\left(-d e^{-t^2} \right) \right). $$ Finally set $t = \sqrt{\log d}$ to see $$ \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] \geq \sqrt{\log d} \left(1 - 1/e \right) \geq \frac{1}{2}\sqrt{\log d}. $$

Finally we establish the Poisson lower bound used in Example 2.

Lemma (Poisson lower bound)

Let $Z_1, \ldots, Z_d$ be i.i.d. $\Pois(1)$ random variables. Then for $d \geq 16$, $$ \E\left[ \max_{1 \leq j \leq d} Z_j \right] \geq \frac{\log d}{6 \log \log d}. $$

Proof

As for the Gaussian lower bound, we have for any integer $t \geq 2$ $$ \E\left[ \max_{1 \leq j \leq d} Z_j \right] \geq t \left(1 - \left(1 - \P\left(Z_j \geq t \right) \right)^d \right). $$ Now note that $$ \begin{align*} \P\left(Z_j \geq t \right) &= \frac{1}{e} \sum_{k=t}^\infty \frac{1}{k!} \geq \frac{1}{e t!} \geq \frac{1}{e t^t}. \end{align*} $$ Hence noting that $\log \log \log d \geq 0$ and setting $\frac{e-1}{e}\frac{\log d}{\log \log d} \leq t \leq \frac{\log d}{\log \log d}$ gives $$ \begin{align*} \E\left[ \max_{1 \leq j \leq d} |Z_j| \right] &\geq t \left(1 - \left(1 - \frac{1}{e t^t} \right)^d \right) \\ &\geq t \left(1 - \left(1 - e^{-1} \exp(-t \log t) \right)^d \right) \\ &\geq \frac{e-1}{e} \frac{\log d}{\log \log d} \left(1 - \left(1 - e^{-1} \exp\left( -\frac{\log d}{\log \log d} \log \frac{\log d}{\log \log d} \right) \right)^d \right) \\ &\geq \frac{e-1}{e} \frac{\log d}{\log \log d} \left(1 - \left(1 - \frac{1}{e d} \right)^d \right) \\ &\geq \frac{e-1}{e} \frac{\log d}{\log \log d} \left(1 - e^{-1/e} \right) \\ &\geq \frac{\log d}{6 \log \log d}. \end{align*} $$

Advent of Code 2022

2023-01-11T00:00:00+00:00

In 2022 I tackled Advent of Code for the first time, using the Julia language. Here are some of my thoughts; my code is on GitHub.

First impressions

Advent of Code (AoC) consists of twenty-five problems increasing in difficulty from the 1st to the 25th of December. I enjoyed solving all the problems, which were varied enough to avoid repetition, but which also fell into a few distinct categories giving some familiarity. The first few problems were very easy, typically requiring just a few lines of code and no particular insights – the obvious solution “just worked.” The later problems presented various challenges, including knowledge of network search algorithms, computational complexity and memory considerations, and plenty of general-purpose problem solving.

The Julia language

Since this was my first attempt at AoC, I decided to use a language which I am already comfortable with but wanted to learn more about. Julia is a high-level yet high-performance language which is as easy to write as Python and almost as fast as C. Some of my favourite features which I tried to understand more fully during AoC include the following.

First-class arrays

Arrays in Julia are very easy to use, working seamlessly with the rest of its type system and requiring no additional packages (I’m looking at you, numpy). For example, dimension is well-defined and included in the type declaration:

Array{Int, 1} or Vector{Int} is a one-dimensional array of integers.
Array{Int, 2} or Matrix{Int} is a two-dimensional array of integers.
Array{Array{Int, 1}, 1} or Vector{Vector{Int}} is a one-dimensional array of one-dimensional arrays of integers. It is not equivalent to Matrix{Int}.

This is useful in many circumstances – for a rectangular array, use Matrix{Int}. For a ragged array, use Vector{Vector{Int}}. Julia has no confusion between two-dimensional arrays with only one column and one-dimensional arrays: the former has size (n, 1) while the latter has size (n). Unlike dimension, the size of an array is not included in the type, allowing pushing and popping.

Just-in-time compilation

Julia is compiled just-in-time (JIT), an approach offering a compromise between ahead-of-time compilation (where the program is first compiled to machine code and then executed, like C) and interpretation (where the program is run line-by-line, like Python). This allows the program to be recompiled while it is running, so that the most performance-critical parts are made as fast as possible.

Fast loops

A great example of the benefit of JIT compilation is Julia’s fast loop implementation, which requires no explicit vectorisation (unlike Python, where loops are painfully slow unless vectorised with list comprehensions or numpy). Simply write out the loops you need and the JIT compiler will handle the execution. That said, Julia does have vectorisation macros which can allow for neater code. For example, to add one to the array A::Vector{Int}, we write A .+ 1, using the dot operator to broadcast over the elements of A.

Type annotation

Julia’s type system is another place where it combines the best of low-level compiled languages and high-level interpreted languages. Type annotations are optional in Julia. If you don’t want to think about types, you can opt to leave variables and function arguments untyped, as in Python or R. However adding type annotations such as x::Float64 or add_one(x::Float64) = x + 1 can help the compiler optimise your code better and prevent you from making mistakes such as passing the wrong object into a function.

Multiple dispatch

Julia allows the creation of composite types (also known as structs or objects).

struct Elf
    height::Int
    weight::Int
end

However, unlike objected-oriented languages such as Python, Julia does not define methods for its objects. Instead, we write a regular function and annotate its arguments to be of certain types. For example, bmi(elf::Elf) = elf.weight / elf.height^2 rather than elf.bmi() = elf.weight / elf.height^2. Multiple dispatch means that the type of every argument must agree in order for the function to be called.

There is plenty more to like about Julia, including its parametric type system, its great standard libraries for multi-threading, linear algebra and statistics, its package manager and its testing functionality, but I won’t go into any more detail here.

General comments

I learned more than I expected to while completing Advent of Code. While the solutions are perhaps not very surprising, the actual implementation details are worth thinking about more carefully. In particular, I definitely improved my understanding of the following.

Data structures

Since I have never studied computer science formally, I would often be confused by the obsession with “data structures” – what’s wrong with arrays? AoC definitely helped me appreciate this more – in fact I think the hardest part of most of the problems was finding the right data structures to use. I became more familiar with dictionaries (hash maps) and tuples, and their respective strengths and weaknesses. Type unions were also useful, particularly for handling missing or non-existent values through the Nothing and Union{T, Nothing} types.

Network algorithms

While I had seen dynamic programming methods such as Dijkstra’s algorithm for shortest paths before, I had no experience in implementing them. This turned out to be relatively straightforward, keeping track of temporary variables in a new array.

More challenging for me were the network search algorithms, including breadth-first search (BFS) and depth-first search (DFS). I didn’t really understand which one to use in a given scenario and ended up doing quite a bit of trial and error with various heuristics to keep the run time reasonable. Julia’s push! and pushfirst! functions allow easy switching between stack and queue behaviour, along with the analogous functions pop! and popfirst!.

Avoiding unnecessary allocation and copying

Many problems in AoC (particularly part 2 of the problems) contain inputs so large that the obvious solutions are impossible due to memory allocation. One common fix is to represent the data without allocating much memory, for example by representing ranges of numbers by their start and end points, either manually or by using an AbstractRange object, rather than collecting all the elements into an array. Another common solution was to use various divisibility tricks, storing only the remainders of numbers modulo some divisor, rather than keeping the whole number. This helped avoid storing huge integers.

A similar challenge was to avoid copying large arrays. Julia uses exclamation marks to distinguish “in-place” functions such as sort!, which modify their arguments, from “returning” functions such as sort which simply return a new value, in this case the sorted array. This often allows one to reuse the same memory, modifying the entries of an array without copying the whole object.

Day-by-day

Some brief comments on each day’s problems are given below, along with the approximate execution time after precompilation. My solutions are available on GitHub.

Day 1: Calorie Counting 0.034 s

Iterating through the input lines allowed us to calculate the total calories for each elf, storing the totals in a Vector{Int}. Finding the most calorific elf (part 1) and the most calorific three elves (part 2) was easy with the maximum and sort functions.

Day 2: Rock Paper Scissors 0.080 s

For each part I kept a lookup table of type Dict{String, Int} for the score based on the input. For example for part 1 we have "A X" => 1 + 3 while for part 2 we have "A X" => 3 + 0. Summing the scores over an iterator of the input file gave the answer without allocating much memory.

Day 3: Rucksack Reorganization 0.434 s

This problem was about finding common characters in several strings. While I could have just done this for two or three strings as required in the question, I wrote a recursive function to handle an arbitrary length Vector{String} based on intersecting the first pair repeatedly until only one string remains.

Day 4: Camp Cleanup 0.173 s

For this question we were given a list of pairs of integer ranges and asked for how many pairs do the two ranges contain each other (part 1) or overlap (part 2). Despite the numbers in this question being quite small, it still seemed sensible to represent the ranges by their endpoints using a Tuple{Int, Int} rather than collecting all the values in between into a Vector{Int}. Given the endpoints it was easy to write functions to check if one contains the other or if they overlap.

Day 5: Supply Stacks 0.179 s

Did someone say stacks? Moving one crate at a time (part 1) was easily achieved by using push! and pop! multiple times on the relevant Vector{Char} and this extended to moving multiple crates (part 2) by pushing them first to a temporary “holding array” and then to their final destinations. The hardest part was reading the oddly-formatted input.

Day 6: Tuning Trouble 0.020 s

This was the first puzzle where I predicted what the second part might be and planned accordingly. Both parts involved finding the first position in a string where the preceding $n$ characters were all distinct, with $n=4$ in the first part and $n=14$ in the second. As such I made sure the complexity of my solution did not depend much on the size of $n$. Using a simple loop through the string checking at each point if the preceding $n$ characters are distinct was enough to solve this.

Day 7: No Space Left On Device 0.548 s

This was the first day I found difficult, and also the first for which I used custom composite types. The goal was to imitate a simple file system which could run cd and ls. After some unsuccessful attempts at recursive types (directories in directories etc.), I settled on the following simple data structure, with structs File and Directory each containing name, size and parent fields:

mutable struct Filesystem
    structure::Dict{String, Union{File, Directory}}
    cwd::String
end

Firstly I made a pass through the problem input to get the structure of the file system and the sizes of the files, setting each directory size to NaN, Julia’s “not a number” value. The main challenge was to then recursively propagate the file sizes up to the directories which contain them. I did this by looping over every directory in the file system and checking that

Its current size was NaN
All of its children had a size which was not NaN

If so then the size of this directory was the sum of the sizes of its children. I repeated this process $d$ times where $d$ is the depth of the file system to ensure propagation had terminated. I’m sure there are better ways to do this, but it worked for me.

Day 8: Treetop Tree House 0.266 s

For this problem I wrote a running maximum function to keep track of the largest tree in a particular direction, and then used some simple logic to establish whether there was a line of sight from a tree to the edge of the grid. I handled the four different directions by permuting the input first using Julia’s reverse and permutedims functions.

Getting the scenic score for each tree was easily implemented by moving in each direction until finding a higher tree.

Day 9: Rope Bridge 0.186 s

For this question I needed only two functions, one to move the head of the rope according to the instructions and another to adjust the tail to keep up with the head. To do this one needed only to check that the tail segment is two squares away (in $L^\infty$ norm) from the head and then adjust it with tail += sign(head - tail).

The extension to longer ropes was easy, making sure to adjust the rope from head to tail in order to propagate the motion correctly. Finally the unique function made counting the visited squares trivial.

Day 10: Cathode-Ray Tube 0.157 s

The following two simple functions carried out the bulk of the work for this question, with part 2 requiring some modular arithmetic to get the location of the cursor on the CRT.

function noop!(register::Vector{Int})
    push!(register, register[end])
    return nothing
end

function addx!(v::Int, register::Vector{Int})
    noop!(register)
    push!(register, register[end] + v)
    return nothing
end

Day 11: Monkey in the Middle 0.923 s

I found this problem quite difficult, and eventually settled on the following composite type for each monkey. In light of part 2, which required modular arithmetic to prevent numbers growing too large, each item was stored as a dictionary of divisor -> remainder pairs. The operation field contained each monkey’s operation as an anonymous function while test stored the divisor used for its divisibility test. The potential destination monkeys for the items were given in dest, and I kept track of the number of times the monkey inspected an item in inspections. Once the input had been parsed, performing the rounds of throwing was quite straightforward.

mutable struct Monkey
    id::Int
    items::Vector{Dict{Int, Int}}
    operation::Function
    test::Int
    dest::Dict{Bool, Int}
    inspections::Int
end

Day 12: Hill Climbing Algorithm 1.151 s

This was a classic shortest path problem, with the network edges determined by the heights of neighbouring squares. For reading the data, Julia’s 'a':'z' syntax was a neat way to get the alphabet. I solved the main problem using Dijkstra’s algorithm, keeping track of visited squares and their distances in auxiliary arrays. By running the algorithm to completion, not stopping after reaching the target square but rather finding the shortest path to all reachable squares, part 2 followed immediately without having to run the algorithm again.

struct Hill
    heights::Matrix{Int}
    distances::Matrix{Union{Int, Float64}}
    visited::Matrix{Bool}
    S::Tuple{Int, Int}
    E::Tuple{Int, Int}
end

Note that each distance was of type Union{Int, Float64} to allow for the value Inf::Float64.

Day 13: Distress Signal 0.271 s

This problem involved parsing some possibly deeply-nested vectors of vectors of… of integers and comparing them to each other with some specific rules. My data structure probably wasn’t the best for this day: I don’t think Julia supports arbitrarily nested arrays as types, so I represented them with a Vector{String}, that is, I only treated the outer-most array as a vector and just kept its elements as strings such as "[[2],5]". Comparing using the rules was then fairly straightforward, recursing and unwrapping the strings into arrays until the comparison was determined. The unwrapping was done by counting opening and closing brackets to understand the depth of each element.

Part 2 was easy once I could compare the objects, as I could pass the entire list into Julia’s flexible sort function with a custom comparison function, using quicksort without having to implement it myself.

Day 14: Regolith Reservoir 0.694 s

This problem was fun to solve as it was easy to visualise and check everything was working properly. The first challenge was to parse the layout of the cave from the input, which amounted to modifying entries in a matrix along a line given the start and end points of that line. I then represented the cave using

mutable struct Cave
    layout::Matrix{Char}
    current::Tuple{Int, Int}
    start::Tuple{Int, Int}
    terminated::Bool
end

where layout showed the cave structure as in the problem statement, current gave the current location of the falling unit of sand, start was the starting point of the sand and terminated determined whether to stop the simulation. Implementing the falling sand logic was not too hard, and for part 2 I simply added an extra path of solid rock to represent the floor, making it at least as wide as twice the height of the starting point.

Day 15: Beacon Exclusion Zone 0.264 s

This day was quite tricky. The obvious thing to do was to represent the positions of everything in a Matrix{Char}, as suggested by the example maps given in the problem. However when looking at the problem input it quickly became apparent that this wouldn’t work as the numbers involved (and hence the size of the matrix needed) were very large. Hence I looked for a solution which avoided allocating large amounts of memory, storing only the key properties of each sensor and using Julia’s inbuilt UnitRange{Int} type to represent intervals of numbers.

For part 1, I calculated the interval of points precluded by a given sensor at a given $y$ value, then wrote a function to simplify a collection of such intervals into disjoint intervals. Counting the number of points in disjoint intervals was then easy.

It took me a long time to figure out Part 2. My final solution relied on the logic that if there is only one point which is not in any of the sensed regions, then it must be near a “corner” of two sensed regions. Thus I first calculated the intersection points of the boundaries of the sensed regions for each pair of sensors. I then searched the neighbours of these points to find a point which was in none of the sensed regions.

Day 16: Proboscidea Volcanium 0.923 s

This was one of the hardest problems, involving a network search which was large enough to require some thought to get it to finish quickly. Firstly I did some pre-processing of the input, using Dijkstra’s algorithm to get the shortest path length between any two valves, thus yielding a complete network. I then dropped all the valves (nodes) with a zero flow rate as there was no point going to them except en route to another valve.

For part 1, I used a depth-first search (DFS), storing the state of each valve (open or closed), the amount of time used up, the total pressure relieved and the current position to describe each state. I found that using UInt8 and UInt16 where possible gave a noticeable speed-up over the more typical Int64.

struct State
    opens::Vector{Bool}
    time::UInt16
    pressure::UInt16
    position::UInt8
end

For part 2, I first reasoned that there was no need for both you and the elephant to ever visit the same valve (in the processed complete network). Therefore I used a heuristic to find all of the “good” paths within the time limit, meaning those with a sufficiently high total pressure release, again using DFS. I then searched among these paths to find the best pair which visit disjoint sets of valves. The cutoff value for the good paths was chosen by trial and error, but could be found using a binary search.

Day 17: Pyroclastic Flow 0.645 s

This problem was fun to solve, especially the insights required for part 2. Part 1 involved implementing Tetris but without rotating the blocks.

I represented the tower of blocks using a Vector{Vector{Char}} rather than a Matrix{Char} so I could easily push new rows onto the top to make the whole assembly taller. Replicating the behaviour was straightforward, checking which action needed to be applied at each time step.

Part 2 was initially very challenging, asking to predict the height of the tower after 1000000000000 (yes, a trillion) blocks had fallen. Obviously it was not possible to simulate this fully, but I realised that the pattern of jets and the block shapes were periodic. After checking that the top part of the tower of blocks was also periodic, the problem became much easier. The answer could be calculated by running the simulation for a few thousand iterations to observe the repeating section and then counting the number of repeats needed to reach 1000000000000 blocks.

Day 18: Boiling Boulders 1.176 s

This was also a fun problem. Part 1 was easy, calculating the total area of all the cubes and then subtracting non-exposed faces which made contact with another cube.

For part 2 I wrote a function to “complete” a set of cubes by filling in any internal cavities. This let me reuse my surface area function from part 1 to get the exterior surface area. The completion function worked by drawing a box around the lava droplet and marking cubes on the boundary of this box as “outside” the droplet. I then marked any non-droplet neighbours of an “outside” cube as also being “outside”, iterating with a DFS until termination. The completed droplet was then simply the union of the original droplet and the cubes which were not outside.

Day 19: Not Enough Minerals 1.062 s

I thought this was the hardest problem on the calendar, and definitely took me the longest to solve. This was another network search-type problem, so I again went for a DFS approach. However this instance was significantly larger than that from Day 16, and I had to use the following pruning strategies to avoid searching too large a space.

If you can buy a geode bot, you must do so.
If in the previous state you could have bought bot X but instead did nothing, then you should not buy bot X now.
Suppose in the current state you could spend the rest of the time only building geode bots. If this doesn’t get more geodes than the current best strategy, then drop the current state.
If no object requires more than X of resource Y, then don’t buy more than X of bot Y.

Thankfully this approach was good enough to solve both parts 1 and 2, but it took a lot of trial and error to work out which optimisations were worth using.

Day 20: Grove Positioning System 0.274 s

This problem was a welcome break, and was quite easy. The main challenge here was using the right indices and applying the correct modulo functions. I kept track of both the mixed list and the original index of each element in that list, making it easy to work out which element to move next. I was worried that part 2 might ask me to mix the list a huge number of times, but thankfully it didn’t and I was able to use the same code for both parts.

Day 21: Monkey Math 0.366 s

Part 1 of this problem was easy but after seeing part 2 I rewrote all my code, keeping track of expressions for each monkey rather than values. I then wrote a function to check for a given monkey whether both of its “child” monkeys (the monkey that will provide its inputs) had had their expressions parsed. If so, I could parse the current monkey by substituting the children’s expressions into its own expression. Repeating this for all the monkeys until termination propagated the expressions right up to the root monkey. For part 1, this could then be directly evaluated using Julia’s metaprogramming facilities: Meta.parse converts a String to an Expr, and this is evaluated to a number with eval.

For part 2 I retrieved the expressions of the two children of root. One of these evaluated to a number directly, and the other retained the variable humn. To find the value of humn which solves these equal to each other, I implemented a binary search. This was probably not guaranteed to work in all situations, but was fine for my instance.

Day 22: Monkey Map 0.863 s

This was a challenging but fun problem, requiring the most lines of code out of all of the problems. Part 1 was quite easy, especially since we had already solved a few of these “follow the rules on a grid”-type problems (day 9, day 14, day 17), with the only tricky task being to write down the rules for wrapping around to the other side of the map.

Part 2 was much harder but also quite interesting. The central data structure I used was to represent each face of the cube by the following.

mutable struct Face
    id::Int
    board::Matrix{Char}
    face_coords::Matrix{Int}
    corner_loc::Tuple{Int, Int}
end

Here, id identified each face uniquely, board provided the layout of open tiles and walls on the face, face_coords gave the coordinates in 3D space of the top-left, top-right and bottom-left corners of the face, and corner_loc recorded where the face was located on the original input net. To parse the faces, I set the first face to have coordinates [-1 1 -1 1; 1 1 -1 -1; 1 1 1 1] and then traced over the net, applying the appropriate matrix operation every time I went over an edge to keep track of the coordinates of each face. The main part of the problem was then to handle the logic for going over an edge on the resulting cube. I did this in stages, first identifying the coordinates of the edge to cross, then matching this to determine which face we ended up on. Again matching coordinates identified which side of this face we emerged on, and I then had to work out where on this edge we would appear, and which way we would then be facing. It took me a long time to realise that I had forgotten the last step: we had to check we wouldn’t emerge into a wall, as then we would not make the transition at all. My code ended up being pretty long, but I think it should work in generality as I never “hard-coded” my cube’s net.

Day 23: Unstable Diffusion 1.670 s

This problem was conceptually not too hard but seemed to need many lines of code. I initially tried using a dictionary, but it turned out that iterating through this was too slow due to the number of elves, and in fact just keeping an array of all the possible locations was faster, since the operations are all local. I therefore used

mutable struct Elf
    id::Int
    prop_loc::Union{Tuple{Int, Int}, Nothing}
end

and Elves = Matrix{Union{Elf, Nothing}}. Each round consisted of a few tasks: firstly I checked if any elves were near the edge of the array, extending the array in all directions by ten units if so. Then I found the proposed location for each elf according to the rules. Next I updated the locations, checking that no elves tried to go to the same location. Finally I reset all the proposed locations to nothing.

For part 2 I was concerned that I might need a huge number of rounds before termination, but it ended up not being too bad and I could just reuse the code, checking at each step if any elf still had neighbours.

Day 24: Blizzard Basin 1.711 s

One final network search problem, this time with a time-varying network. I stored the blizzard locations as Blizzard = Matrix{Vector{Int}}, recording the number of blizzards in each direction at every location.

For part 1 I first calculated all of the blizzard locations at each time step following the rules, up to some time limit chosen by trial and error. I then used DFS again for the main search problem, with

struct State
    loc::Tuple{Int, Int}
    time::Int
end

I used the following pruning strategy to reduce run-time. I noted that the blizzards were periodic with period given by the least common multiple of the dimensions of the valley. Thus if we were in the same location at the same time modulo this period, then the state had already been seen and could be discarded.

For part 2 I used the same approach, swapping the start and end points and resuming the blizzards after each trip to get the total round trip time.

Day 25: Full of Hot Air 0.105 s

This final problem turned out to be more tricky than I expected, though didn’t need many lines in the end. Manipulating numbers in a different base didn’t seem too bad, but the inclusion of negative coefficients made this very confusing. Writing a snafu_to_decimal function was straightforward, looking up the coefficients in a dictionary and calculating in base five. The decimal_to_snafu function was much harder, and my final solution was recursive, first finding a large enough power of five for the leading digit and calculating the coefficient, then calling the function again with the power reduced by one on the remainder to get the next digit. I finally trimmed any leading zeros.

Concluding remarks

Julia was overall a good language to use, offering a compromise between ease of use (for those familiar with Python, R and MATLAB, at least), and performance. One downside is that Julia is not a scripting language and has a poor startup time. This means that most of the time is spent compiling rather than running the program, and it is impossible to separate these two steps. I aimed for each solution to execute in around one second, and managed to get them all under two seconds, which I’m fairly happy with. Next time I might try to learn Rust.

Yurinskii’s Coupling for Martingales

2022-10-04T00:00:00+00:00

I’m pleased to share my new preprint, titled “Yurinskii’s Coupling for Martingales” and coauthored with Matias Cattaneo and Ricardo Masini.

It can be found at arXiv:2210.00362.

Abstract

Yurinskii’s coupling is a popular tool for finite-sample distributional approximation in mathematical statistics and applied probability, offering a Gaussian strong approximation for sums of random vectors under easily verified conditions with an explicit rate of approximation. Originally stated for sums of independent random vectors in $\ell^2$-norm, it has recently been extended to the $\ell^p$-norm, where $1 \leq p \leq \infty$, and to vector-valued martingales in $\ell^2$-norm under some rather strong conditions. We provide as our main result a generalization of all of the previous forms of Yurinskii’s coupling, giving a Gaussian strong approximation for martingales in $\ell^p$-norm under relatively weak conditions. We apply this result to some areas of statistical theory, including high-dimensional martingale central limit theorems and uniform strong approximations for martingale empirical processes. Finally we give a few illustrative examples in statistical methodology, applying our results to partitioning-based series estimators for nonparametric regression, distributional approximation of $\ell^p$-norms of high-dimensional martingales, and local polynomial regression estimators. We address issues of feasibility, demonstrating implementable statistical inference procedures in each section.