Convex ordering on Galton-Watson trees

This blog was recently revived, via a post about convex ordering and its relevance to the problem of sampling with and without replacement that forms part of the potpourri of related results all sometimes referred to as Hoeffding’s inequality.

The previous post had been lying almost-complete but dormant for over two years. I revisited it because of a short but open-ended question about trees posed in our research group meeting by Serte Donderwinkel, one of our high-flying doctoral students.

Simplified Question:

For a Galton-Watson tree, can one obtain upper bounds in probability on the height of the tree, uniformly across all offspring distributions with mean \mu?

Note that in this setting, it is helpful to have in mind the classical notation (Z_0,Z_1,Z_2,\ldots) for a Galton-Watson process, where typically Z_0=1, and Z_{n+1} is the sum of Z_n IID copies of the offspring distribution. Then we have

\mathbb{P}(\mathrm{height}(\mathcal{T}) < k) = \mathbb{P}(Z_k=0).

1) Subcritical case. When \mu<1, we certainly have \mathbb{P}(Z_k>0)\le \mathbb{E}[Z_k]=\mu^k.

Furthermore, if we’re studying all such offspring distributions, this is the best possible upper bound, by considering the offspring distribution given by Z_1=1 with probability \mu and zero otherwise.

2) In the critical or supercritical case, \mu\ge 1 it is possible that the height is infinite with probability one.

So neither case is especially interesting for now.

Refined question:

What if instead we aren’t trying to obtain a bound uniformly across all offspring distributions with given mean \mu, but instead across a subset \mathcal{X} of these distributions? How do we determine which distribution in \mathcal{X} maximises the probability of reaching height k?

This is the question Serte was asking in our group meeting, in the setting where \mu=1+o(1) and the height k has a particular scaling. Also, as far as I understand, the approach outlined in this post didn’t provide strong enough bounds in this particular context. Happily, Serte has recently tied up all the corners of this project concerning the supercritical Galton-Watson forest, and interested readers can find her preprint here on the Arxiv.

Nonetheless the interpretation via convex ordering feels perfect for a blog post, rather than being lost forever.

Convex ordering for offspring distributions

The main observation is that given two offspring distributions X and Y, such that X\le_{cx} Y (which recall means that the means are the same but X is more concentrated) then a number of distributions associated to the Galton-Watson trees for X and Y also satisfy convex ordering relations.

As a warm-up, and because it was the original genesis, we first study heights. We will use the notation

(Z_0^X,Z_1^X,Z_2^X,\ldots), (Z_0^Y,Z_1^Y,Z_2^Y,\ldots),

to denote the two Galton-Watson processes. We shall compare \mathbb{P}(Z^X_k=0) and \mathbb{P}(Z^Y_k=0). If we write \delta_0(\cdot) for the function defined on the non-negative integers such that

\delta_0(0)=1,\quad \delta_0(n)=0,\,n\ge 1,

it holds that \delta_0(\cdot) is convex. In particular, if X\le_{cx}Y, then \mathbb{E}[\delta_0(X)]\le \mathbb{E}[\delta_0(Y)], which exactly says that

\mathbb{P}(Z^X_1 = 0)\le \mathbb{P}(Z^Y_1 = 0).

We can then prove that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) by induction on k\ge 1. Note that \mathbb{P}(Z^X_k=0)^n is a convex function of n, regardless of the value of this probability, and so we have

\mathbb{P}(Z^X_{k+1}=0) = \mathbb{E}\left[ (\mathbb{P}(Z^X_k=0))^X\right] \le \mathbb{E}\left[(\mathbb{P}(Z^X_k=0))^Y\right].

By the induction hypothesis, this final quantity is at most

\mathbb{E}\left[(\mathbb{P}(Z^Y_k=0))^Y\right] = \mathbb{P}(Z^Y_{k+1}=0).

In conclusion, we have shown that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) holds for all k, and thus

\mathrm{height}(\mathcal{T}^X) \ge_{st} \mathrm{height}(\mathcal{T}^Y).

To return to the original context, suppose we have a large class of offspring distributions \mathcal{Y} and a subclass \mathcal{X}\subseteq \mathcal{Y} such that for all Y\in\mathcal{Y}, there exists X\in \mathcal{X} such that X\le_{cx} Y. Then one can obtain uniform bounds on the heights of Galton-Watson trees with offspring distributions drawn from \mathcal{Y} by checking those generated from distributions in \mathcal{X} (which is particularly convenient if, for example, \mathcal{X} is finite).

Convex ordering of generation sizes

The above argument solves the original problem, but brushes over the natural question: is it true that Z^X_k \le_{cx} Z^Y_k?

The answer is yes. Here’s a proof.

This follows from the following general statement:

Lemma 1: Suppose X\le_{cx} Y are non-negative valued RVs and the non-negative integer valued RVs M,N also satisfy M \le_{cx} N. Then

X_1+\ldots+X_M \le_{cx} Y_1+\ldots Y_N,

where X_1,X_2,\ldots are IID copies of X and, independently, Y_1,Y_2,\ldots are IID copies of Y.

Lemma 2: Suppose W_1\le_{cx}Z_1 and W_2\le_{cx} Z_2, and the four random variables are independent. Then W_1+W_2\le_{cx}Z_1+Z_2.

Proof of Lemma 2: First, note that for any random variable X, and convex function f

\mathbb{E}\left[f(Z+x)\right] is a convex function of x.

(Indeed, this holds since “f(z+x) is convex” holds for every z, and any definition of convex will pass to the expectation.)

Now we can attack the lemma directly, we may write

\mathbb{E}\left[ f(W_1+W_2)\right]=\mathbb{E}\left[\, \mathbb{E}[f(W_1+W_2) \mid W_2 ] \,\right] \le \mathbb{E}\left[\, \mathbb{E}[f(W_1+Z_2)\mid Z_2 ] \, \right].

But then for any z_2, we know f(\cdot+z_2) is convex, so \mathbb{E}[f(W_1+z_2)]\le \mathbb{E}[f(Z_1+z_2)], and it follows that

\mathbb{E}\left[ f(W_1+W_2)\right]\le \mathbb{E} \left[ f(Z_1+Z_2)\right],

which proves the lemma.

Corollary 3: When W_1,\ldots,W_m, Z_1,\ldots,Z_m are independent, and satisfy W_i \le_{cx} Z_i, then we have W_1+\ldots+W_m\le_{cx} Z_1+\ldots+Z_m.

Proof of Lemma 1: Note that

\mathbb{E}\left[ f(X_1+\ldots+X_M)\mid M=n\right] \le \mathbb{E}\left[ f(Y_1+\ldots+Y_N)\mid N=n\right],

follows from Corollary 3. So a useful question to consider is whether \mathbb{E}\left[f(Y_1+\ldots+Y_n)\right] (*) is a convex function of n?

Denote this quantity by F(n). To check convexity of a function defined on the integers, it suffices to verify that F(n+1)-F(n)\ge F(n)-F(n-1).

There is a canonical coupling between the RVs used to define all of F(n-1),F(n),F(n+1), but it will be convenient to adjust the coupling, and write:

F(n+1)-F(n)= \mathbb{E}\left[ f(Y_1+\ldots+Y_n + Y^*) - f(Y_1+\ldots+Y_n)\right],

F(n)-F(n-1)=\mathbb{E}\left[f(Y_1+\ldots+Y_{n-1}+Y^*) - f(Y_1+\ldots+Y_{n-1})\right],

where Y^* is a further independent copy of Y. But note that for any choice C\ge c and y\in \mathbb{R},

f(C+y) - f(C) - f(c+y) + f(c)\ge 0. (*)

(Essentially, this says that the ‘chord’ of f on the interval [c,C+y] lies above the chord on interval [C,c+y] or [c+y,C], which some people choose to call Karamata’s inequality, but I think is more helpful to think of as part of the visual definition of convexity.)

In any case, setting y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n and taking expectations, we obtain

\mathbb{E}\left[ f(Y_1+\ldots+Y_n+Y^*) - f(Y_1+\ldots+Y_n)\right.

\left.- f(Y_1+\ldots+Y_{n-1}+Y^*) + f(Y_1+\ldots+Y_{n-1})\right]\ge 0,

as required. So F(n) is convex. We may now finish off as

\mathbb{E}\left[ X_1+\ldots+X_M\right] = \mathbb{E}\left[ \,\mathbb{E}[X_1+\ldots+X_M\mid M]\,\right] \le \mathbb{E}\left[\, \mathbb{E}[Y_1+\ldots+Y_M\mid M]\,\right] = \mathbb{E}[f(M)]\le \mathbb{E}[f(N)] = \mathbb{E}[Y_1+\ldots+Y_N],

completing the proof of Lemma 1.

Final comments

  • The analysis in this post is not sufficient to study the total population sizes of two Galton-Watson trees generated by X and Y. Note that in Lemma 2, it is important that the random variables are independent. Otherwise, we could, for example, consider \mathbb{E}[X]=\mathbb{E}[Y]=0 with X\le_{cx}Y but clearly it should not hold that X_1+X_2 \le_{cx} Y + (-Y) = 0. So for total population size, since (Z^X_k,\,k\ge 1) are not independent, an alternative approach would be required.
  • A further characterisation of convex ordering is given by Strassen’s theorem [Str65], which is touched on in the previous post, and to which I may return to in a future post on this topic. This may be a more promising avenue for established a convex ordering result on total population size.
  • Lemma 1 requires that X,Y are non-negative. Note that during the argument we set y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n, and when we relax the non-negative support condition, it is no longer guaranteed that C\ge c, which is crucial for the step which follows.
  • In a recent article in ECP addressing Lemma 1 by a different method, Berard and Juillet [BJ20] provide a simple example showing that the non-negative assumption is genuinely necessary. Consider the random variable \tau\in \{0,2\} with equal probability so 1\le_{cx} \tau. But then, taking both X and Y to be simple random walk on \mathbb{Z}, we do not have S_1\le_{cx}S_{\tau}.

References

[BJ20] – Berard, Juillet – A coupling proof of convex ordering for compound distributions, 2020

[Str65] – Strassen – The existence of probability measures with given marginals, 1965

EGMO 2016 Paper I

We’ve just our annual selection and training camp for the UK IMO team in Cambridge, and I hope it was enjoyed by all. I allotted myself the ‘graveyard slot’ at 5pm on the final afternoon (incidentally, right in the middle of this, but what England fan could have seen that coming in advance?) and talked about random walks on graphs and the (discrete) heat equation. More on that soon perhaps.

The UK has a team competing in the 5th European Girls Mathematical Olympiad (hereafter EGMO 2016) right now in Busteni, Romania. The first paper was sat yesterday, and the second paper is being sat as I write this. Although we’ve already sent a team to the Romania this year (where they did rather well indeed! I blame the fact that I wasn’t there.), this feels like the start of the olympiad ‘season’. It also coincides well with Oxford holidays, when, though thesis deadlines loom, I have a bit more free time for thinking about these problems. Anyway, last year I wrote a summary of my thoughts and motivations when trying the EGMO problems, and this seemed to go down well, so I’m doing the same this year. My aim is not to offer official solutions, or even outlines of good solutions, but rather to talk about ideas, and how and why I decided whether they did or didn’t work. I hope some of it is interesting.

You can find the paper in many languages on the EGMO 2016 website. I have several things to say about the geometry Q2, but I have neither enough time nor geometric diagram software this morning, so will only talk about questions 1 and 3. If you are reading this with the intention of trying the problems yourself at some point, you probably shouldn’t keep reading, in the nicest possible way.

Question 1

[Slightly paraphrased] Let n be an odd positive integer and x_1,\ldots,x_n\ge 0. Show that

\min_{i\in[n]} \left( x_i^2+x_{i+1}^2\right) \le \max_{j\in[n]} 2x_jx_{j+1},

where we define x_{n+1}=x_1 cyclically in the natural way.

Thought 1: this is a very nice statement. Obviously when i and j are equal, the inequality holds the other way round, and so it’s interesting and surprising that constructing a set of pairs of inequalities in the way suggested gives a situation where the ‘maximum minimum’ is at least the ‘minimum maximum’.

Thought 2: what happens if n is actually even? Well, you can kill the right-hand-side by taking at least every other term to be zero. And if n is even, you really can take every other term to be even, while leaving the remaining terms positive. So then the RHS is zero and the LHS is positive.

The extension to this thought is that the statement is in danger of not holding if there’s a lot of alternating behaviour. Maybe we’ll use that later.

Idea 1: We can write

2(x_i^2+x_{i+1}^2)=(x_i+x_{i+1})^2 + |x_i-x_{i+1}|^2, \quad 4x_ix_{i+1}=(x_i+x_{i+1})^2 - |x_i-x_{i+1}|^2,

which gives insight into ‘the problem multiplied by 2’. This was an ‘olympiad experience’ idea. These transformations between various expressions involving sums of squares turn out to be useful all the time. Cf BMO2 2016 question 4, and probably about a million other examples. As soon as you see these expressions, your antennae start twitching. Like when you notice a non-trivial parallelogram in a geometry problem, but I digress. I’m not sure why I stuck in the absolute value signs.

This was definitely a good idea, but I couldn’t find a way to make useful deductions from it especially easily. I tried converting the RHS expression for i (where LHS attains minimum) into the RHS expression for any j by adding on terms, but I couldn’t think of a good way to get any control over these terms, so I moved on.

Idea 2: An equality case is when they are all equal. I didn’t investigate very carefully at this point whether this might be the only equality case. I started thinking about what happens if you start with an ‘equal-ish’ sequence where the inequality holds, then fiddle with one of the values. If you adjust exactly one value, then both sides might stay constant. It seemed quite unlikely that both would vary, but I didn’t really follow this up. In any case, I didn’t feel like I had very good control over the behaviour of the two sides if I started from equality and built up to the general case by adjusting individual values. Or at least, I didn’t have a good idea for a natural ordering to do this adjustment so that I would have good control.

Idea 3: Now I thought about focusing on where the LHS attains this minimum. Somewhere, there are values (x,y) next to each other such that x^2+y^2 is minimal. Let’s say x\le y. Therefore we know that the element before x is at least y, and vice versa, ie we have

\ldots, \ge y, x, y, \ge x,\ldots.

and this wasn’t helpful, because I couldn’t take this deduction one step further on the right. However, once you have declared the minimum of the LHS, you are free to make all the other values of x_i smaller, so long as they don’t break this minimum. Why? Because the LHS stays the same, and the RHS gets smaller. So if you can prove the statement after doing this, then the statement was also true before doing this. So after thinking briefly, this means that you can say that for every i, either x_{i-1}^2+x_i^3 or x_i^2+x_{i+1}^2 attains this minimum.

Suddenly this feels great, because once we know at least one of the pairs corresponding to i attains the minimum, this is related to parity of n, which is in the statement. At this point, I was pretty confident I was done. Because you can’t partition odd [n] into pairs, there must be some i which achieves a minimum on both sides. So focus on that.

Let’s say the values are (x,y,x) with x\le y. Now when we try to extend in both directions, we actually can do this, because the values alternate with bounds in the right way. This key is to use the fact that the minimum x^2+y^2 must be attained at least every other pair. (*) So we get

\ldots, \le x,\ge y,x,y,x,\ge y,\le x,\ldots.

But it’s cyclic, so the ‘ends’ of this sequence join up. If n\equiv 1 modulo 4, we get \ge y,\ge y next to each other, which means the RHS of the statement is indeed at least the LHS. If n\equiv 3 modulo 4, then we get \le x,\le x next to each other, which contradicts minimality of x^2+y^2 unless x=y. Then we chase equality cases through the argument (*) and find that they must all be equal. So (after checking that the case x\ge y really is the same), we are done.

Thought 3: This really is the alternating thought 2 in action. I should have probably stayed with the idea a bit longer, but this plan of reducing values so that equality was achieved often came naturally out of the other ideas.

Thought 4: If I had to do this as an official solution, I imagine one can convert this into a proof by contradiction and it might be slightly easier, or at least easier to follow. If you go for contradiction, you are forcing local alternating behaviour, and should be able to derive a contradiction when your terms match up without having to start by adjusting them to achieve equality everywhere.

Question 3

Let m be a positive integer. Consider a 4m x 4m grid, where two cells are related to each other if they are different but share a row or a column. Some cells are coloured blue, such that every cell is related to at least two blue cells. Determine the minimum number of blue cells.

Thought 1: I spent the majority of my time on this problem working with the idea that the answer was 8m. Achieved by taking two in each row or column in pretty much any fashion, eg both diagonals. This made me uneasy because the construction didn’t take advantage of the fact that the grid size was divisible by 4. I also couldn’t prove it.

Thought 2: bipartite graphs are sometimes useful to describe grid problems. Edges correspond to cells and each vertex set to row labels or column labels.

Idea 1: As part of an attempt to find a proof, I was thinking about convexity, and why having exactly two in every row was best, so I wrote down the following:

Claim A: No point having three in a row.

Claim B: Suppose a row has only one in it + previous claim => contradiction.

In Cambridge, as usual I organised a fairly comprehensive discussion of how to write up solutions to olympiad problems. The leading-order piece of advice is to separate your argument into small pieces, which you might choose to describe as lemmas or claims, or just separate implicitly by spacing. This is useful if you have to do an uninteresting calculation in the middle of a proof and don’t want anyone to get distracted, but mostly it’s useful for the reader because it gives an outline of your argument.

My attempt at this problem illustrates an example of the benefit of doing this even in rough. If your claim is a precise statement, then that’s a prompt to go back and separately decide whether it is actually true or not. I couldn’t prove it, so started thinking about whether it was true.

Idea 2: Claim A is probably false. This was based on my previous intuition, and the fact that I couldn’t prove it or get any handle on why it might be true. I’d already tried the case m=1, but I decided I must have done it wrong so tried it again. I had got it wrong, because 6 is possible, and it wasn’t hard from here (now being quite familiar with the problem) to turn this into a construction for 6m in the general case.

Idea 3: This will be proved by some sort of double-counting argument. Sometimes these arguments turn on a convexity approach, but when the idea is that a few rows have three blue cells, and the rest have one, this now seemed unlikely.

Subthought: Does it make sense for a row to have more than three blue cells? No. Why not? Note that as soon as we have three in a row, all the cells in that row are fine, irrespective of the rest of the grid. If we do the problem the other way round, and have some blues, and want to fill out legally the largest possible board, why would we put six in one row, when we could add an extra row, have three in each (maintaining column structure) and be better off than we were before. A meta-subthought is that this will be impossible to turn into an argument, but we should try to use it to inform our setup.

Ages and ages ago, I’d noticed that you could permute the rows and columns without really affecting anything, so now seemed a good time to put all the rows with exactly one blue cell at the top (having previously established that rows with no blue cell were a disaster for achieving 6m), and all the columns with one blue cell at the left. I said there were r_1,c_1 such rows and columns. Then, I put all the columns which had a blue cell in common with the r_1 rows next to the c_1 columns I already had. Any such column has at least three blues in it, so I said there were c_3 of these, and similarly r_3 rows. The remaining columns and rows might as well be r_0,c_0 and hopefully won’t matter too much.

From here, I felt I had all the ingredients, and in fact I did, though some of the book-keeping got a bit fiddly. Knowing what you are aiming for and what you have means there’s only one way to proceed: first expressions in terms of these which are upper bounds for the number of columns (or twice the number of columns = rows if you want to keep symmetry), and lower bounds in terms of these for the number of blue cells. I found a noticeable case-distinction depending on whether r_1\le 3c_3 and c_1\le 3r_3. If both held or neither held, it was quite straightforward, and if exactly one held, it got messy, probably because I hadn’t set things up optimally. Overall, fiddling about with these expressions occupied slightly more time than actually working out the answer was 6m, so I don’t necessarily have a huge number of lessons to learn, except be more organised.

Afterthought 2: Thought 2 said to consider bipartite graphs. I thought about this later while cycling home, because one can’t (or at least, I can’t) manipulate linear inequalities in my head while negotiating Oxford traffic and potholes. I should have thought about it earlier. The equality case is key. If you add in the edges corresponding to blue cells, you get a series of copies of K_{1,3}, that is, one vertex with three neighbours. Thus you have three edges for every four vertices, and everything’s a tree. This is a massively useful observation for coming up with a very short proof. You just need to show that there can’t be components of size smaller than 4. Also, I bet this is how the problem-setter came up with it…