Convex ordering on Galton-Watson trees

This blog was recently revived, via a post about convex ordering and its relevance to the problem of sampling with and without replacement that forms part of the potpourri of related results all sometimes referred to as Hoeffding’s inequality.

The previous post had been lying almost-complete but dormant for over two years. I revisited it because of a short but open-ended question about trees posed in our research group meeting by Serte Donderwinkel, one of our high-flying doctoral students.

Simplified Question:

For a Galton-Watson tree, can one obtain upper bounds in probability on the height of the tree, uniformly across all offspring distributions with mean \mu?

Note that in this setting, it is helpful to have in mind the classical notation (Z_0,Z_1,Z_2,\ldots) for a Galton-Watson process, where typically Z_0=1, and Z_{n+1} is the sum of Z_n IID copies of the offspring distribution. Then we have

\mathbb{P}(\mathrm{height}(\mathcal{T}) < k) = \mathbb{P}(Z_k=0).

1) Subcritical case. When \mu<1, we certainly have \mathbb{P}(Z_k>0)\le \mathbb{E}[Z_k]=\mu^k.

Furthermore, if we’re studying all such offspring distributions, this is the best possible upper bound, by considering the offspring distribution given by Z_1=1 with probability \mu and zero otherwise.

2) In the critical or supercritical case, \mu\ge 1 it is possible that the height is infinite with probability one.

So neither case is especially interesting for now.

Refined question:

What if instead we aren’t trying to obtain a bound uniformly across all offspring distributions with given mean \mu, but instead across a subset \mathcal{X} of these distributions? How do we determine which distribution in \mathcal{X} maximises the probability of reaching height k?

This is the question Serte was asking in our group meeting, in the setting where \mu=1+o(1) and the height k has a particular scaling. Also, as far as I understand, the approach outlined in this post didn’t provide strong enough bounds in this particular context. Happily, Serte has recently tied up all the corners of this project concerning the supercritical Galton-Watson forest, and interested readers can find her preprint here on the Arxiv.

Nonetheless the interpretation via convex ordering feels perfect for a blog post, rather than being lost forever.

Convex ordering for offspring distributions

The main observation is that given two offspring distributions X and Y, such that X\le_{cx} Y (which recall means that the means are the same but X is more concentrated) then a number of distributions associated to the Galton-Watson trees for X and Y also satisfy convex ordering relations.

As a warm-up, and because it was the original genesis, we first study heights. We will use the notation

(Z_0^X,Z_1^X,Z_2^X,\ldots), (Z_0^Y,Z_1^Y,Z_2^Y,\ldots),

to denote the two Galton-Watson processes. We shall compare \mathbb{P}(Z^X_k=0) and \mathbb{P}(Z^Y_k=0). If we write \delta_0(\cdot) for the function defined on the non-negative integers such that

\delta_0(0)=1,\quad \delta_0(n)=0,\,n\ge 1,

it holds that \delta_0(\cdot) is convex. In particular, if X\le_{cx}Y, then \mathbb{E}[\delta_0(X)]\le \mathbb{E}[\delta_0(Y)], which exactly says that

\mathbb{P}(Z^X_1 = 0)\le \mathbb{P}(Z^Y_1 = 0).

We can then prove that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) by induction on k\ge 1. Note that \mathbb{P}(Z^X_k=0)^n is a convex function of n, regardless of the value of this probability, and so we have

\mathbb{P}(Z^X_{k+1}=0) = \mathbb{E}\left[ (\mathbb{P}(Z^X_k=0))^X\right] \le \mathbb{E}\left[(\mathbb{P}(Z^X_k=0))^Y\right].

By the induction hypothesis, this final quantity is at most

\mathbb{E}\left[(\mathbb{P}(Z^Y_k=0))^Y\right] = \mathbb{P}(Z^Y_{k+1}=0).

In conclusion, we have shown that \mathbb{P}(Z^X_k=0)\le \mathbb{P}(Z^Y_k=0) holds for all k, and thus

\mathrm{height}(\mathcal{T}^X) \ge_{st} \mathrm{height}(\mathcal{T}^Y).

To return to the original context, suppose we have a large class of offspring distributions \mathcal{Y} and a subclass \mathcal{X}\subseteq \mathcal{Y} such that for all Y\in\mathcal{Y}, there exists X\in \mathcal{X} such that X\le_{cx} Y. Then one can obtain uniform bounds on the heights of Galton-Watson trees with offspring distributions drawn from \mathcal{Y} by checking those generated from distributions in \mathcal{X} (which is particularly convenient if, for example, \mathcal{X} is finite).

Convex ordering of generation sizes

The above argument solves the original problem, but brushes over the natural question: is it true that Z^X_k \le_{cx} Z^Y_k?

The answer is yes. Here’s a proof.

This follows from the following general statement:

Lemma 1: Suppose X\le_{cx} Y are non-negative valued RVs and the non-negative integer valued RVs M,N also satisfy M \le_{cx} N. Then

X_1+\ldots+X_M \le_{cx} Y_1+\ldots Y_N,

where X_1,X_2,\ldots are IID copies of X and, independently, Y_1,Y_2,\ldots are IID copies of Y.

Lemma 2: Suppose W_1\le_{cx}Z_1 and W_2\le_{cx} Z_2, and the four random variables are independent. Then W_1+W_2\le_{cx}Z_1+Z_2.

Proof of Lemma 2: First, note that for any random variable X, and convex function f

\mathbb{E}\left[f(Z+x)\right] is a convex function of x.

(Indeed, this holds since “f(z+x) is convex” holds for every z, and any definition of convex will pass to the expectation.)

Now we can attack the lemma directly, we may write

\mathbb{E}\left[ f(W_1+W_2)\right]=\mathbb{E}\left[\, \mathbb{E}[f(W_1+W_2) \mid W_2 ] \,\right] \le \mathbb{E}\left[\, \mathbb{E}[f(W_1+Z_2)\mid Z_2 ] \, \right].

But then for any z_2, we know f(\cdot+z_2) is convex, so \mathbb{E}[f(W_1+z_2)]\le \mathbb{E}[f(Z_1+z_2)], and it follows that

\mathbb{E}\left[ f(W_1+W_2)\right]\le \mathbb{E} \left[ f(Z_1+Z_2)\right],

which proves the lemma.

Corollary 3: When W_1,\ldots,W_m, Z_1,\ldots,Z_m are independent, and satisfy W_i \le_{cx} Z_i, then we have W_1+\ldots+W_m\le_{cx} Z_1+\ldots+Z_m.

Proof of Lemma 1: Note that

\mathbb{E}\left[ f(X_1+\ldots+X_M)\mid M=n\right] \le \mathbb{E}\left[ f(Y_1+\ldots+Y_N)\mid N=n\right],

follows from Corollary 3. So a useful question to consider is whether \mathbb{E}\left[f(Y_1+\ldots+Y_n)\right] (*) is a convex function of n?

Denote this quantity by F(n). To check convexity of a function defined on the integers, it suffices to verify that F(n+1)-F(n)\ge F(n)-F(n-1).

There is a canonical coupling between the RVs used to define all of F(n-1),F(n),F(n+1), but it will be convenient to adjust the coupling, and write:

F(n+1)-F(n)= \mathbb{E}\left[ f(Y_1+\ldots+Y_n + Y^*) - f(Y_1+\ldots+Y_n)\right],

F(n)-F(n-1)=\mathbb{E}\left[f(Y_1+\ldots+Y_{n-1}+Y^*) - f(Y_1+\ldots+Y_{n-1})\right],

where Y^* is a further independent copy of Y. But note that for any choice C\ge c and y\in \mathbb{R},

f(C+y) - f(C) - f(c+y) + f(c)\ge 0. (*)

(Essentially, this says that the ‘chord’ of f on the interval [c,C+y] lies above the chord on interval [C,c+y] or [c+y,C], which some people choose to call Karamata’s inequality, but I think is more helpful to think of as part of the visual definition of convexity.)

In any case, setting y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n and taking expectations, we obtain

\mathbb{E}\left[ f(Y_1+\ldots+Y_n+Y^*) - f(Y_1+\ldots+Y_n)\right.

\left.- f(Y_1+\ldots+Y_{n-1}+Y^*) + f(Y_1+\ldots+Y_{n-1})\right]\ge 0,

as required. So F(n) is convex. We may now finish off as

\mathbb{E}\left[ X_1+\ldots+X_M\right] = \mathbb{E}\left[ \,\mathbb{E}[X_1+\ldots+X_M\mid M]\,\right] \le \mathbb{E}\left[\, \mathbb{E}[Y_1+\ldots+Y_M\mid M]\,\right] = \mathbb{E}[f(M)]\le \mathbb{E}[f(N)] = \mathbb{E}[Y_1+\ldots+Y_N],

completing the proof of Lemma 1.

Final comments

  • The analysis in this post is not sufficient to study the total population sizes of two Galton-Watson trees generated by X and Y. Note that in Lemma 2, it is important that the random variables are independent. Otherwise, we could, for example, consider \mathbb{E}[X]=\mathbb{E}[Y]=0 with X\le_{cx}Y but clearly it should not hold that X_1+X_2 \le_{cx} Y + (-Y) = 0. So for total population size, since (Z^X_k,\,k\ge 1) are not independent, an alternative approach would be required.
  • A further characterisation of convex ordering is given by Strassen’s theorem [Str65], which is touched on in the previous post, and to which I may return to in a future post on this topic. This may be a more promising avenue for established a convex ordering result on total population size.
  • Lemma 1 requires that X,Y are non-negative. Note that during the argument we set y=Y^*, c=Y_1+\ldots+Y_{n-1}, C=Y_1+\ldots+Y_n, and when we relax the non-negative support condition, it is no longer guaranteed that C\ge c, which is crucial for the step which follows.
  • In a recent article in ECP addressing Lemma 1 by a different method, Berard and Juillet [BJ20] provide a simple example showing that the non-negative assumption is genuinely necessary. Consider the random variable \tau\in \{0,2\} with equal probability so 1\le_{cx} \tau. But then, taking both X and Y to be simple random walk on \mathbb{Z}, we do not have S_1\le_{cx}S_{\tau}.

References

[BJ20] – Berard, Juillet – A coupling proof of convex ordering for compound distributions, 2020

[Str65] – Strassen – The existence of probability measures with given marginals, 1965

Hall’s Marriage Theorem

Hall’s Marriage Theorem gives conditions on when the vertices of a bipartite graph can be split into pairs of vertices corresponding to disjoint edges such that every vertex in the smaller class is accounted for. Such a set of edges is called a matching. If the sizes of the vertex classes are equal, then the matching naturally induces a bijection between the classes, and such a matching is called a perfect matching.

The number of possible perfect matchings of K_{n,n} is n!, which is a lot to check. Since it’s useful to have bijections, it’s useful to have matchings, so we would like a simple way to check whether a bipartite graph has a matching. Hall’s Marriage Theorem gives a way to reduce the number of things to check to 2^n, which is still large. However, much more importantly, the condition for the existence of a matching has a form which is much easier to check in many applications. The statement is as follows:

Given bipartite graph G with vertex classes X and Y, there is a matching of X into Y precisely when for every subset A\subset X, |\Gamma(A)|\ge |A|, where \Gamma(A) is the set of vertices joined to some vertex in A, called the neighbourhood of A.

Taking A=X, it is clear that |Y|\ge |X| is a necessary condition for the result to hold, unsurprisingly. Perhaps the most elementary standard proof proceeds by induction on the size of X, taking the smallest A to give a contradiction, then using the induction hypothesis to lift smaller matchings up to the original graph. This lifting is based on the idea that a subset relation between sets induces a subset relation between their neighbourhoods.

In this post, I want to consider this theorem as a special case of the Max-Flow-Min-Cut Theorem, as this will support useful generalisations much more easily. The latter theorem is a bit complicated notationally to set up, and I don’t want it to turn into the main point of this post, so I will summarise. The Wikipedia article, and lots of sets of lecture notes are excellent sources of more detailed definitions.

The setting is a weakly-connected directed graph, with two identified vertices, the source, with zero indegree, and the sink, with zero outdegree. Every other vertex lies on a path (not necessarily unique) between the source and the sink. Each edge has a positive capacity, which should be thought of as the maximum volume allowed to flow down the edge. A flow is a way of assigning values to each edge so that they do not exceed the capacity, and there is volume conservation at each interior vertex. That is, the flow into the vertex is equal to flow out of the vertex. The value of the flow is the sum of the flows out of the source, which is necessarily equal to the sum of flows into the sink.

A cut is a partition of the vertices into two classes, with the source in one and the sink in the other. The value of a cut is then the sum of the capacities of any edges going from the class containing the source to the class containing the sink. In most examples, the classes will be increasing, in the sense that any path from source to sink changes class exactly once.

The Max-Flow-Min-Cut Theorem asserts that the maximum value of a flow through the system is equal to the minimum value of a cut. The proof is elementary, though it relies on defining a sensible algorithm to construct a minimal cut from a maximal flow that is not going to be interesting to explain without more precise notation available.

First we explain why Hall’s Marriage Theorem is a special case of this result. Suppose we are given the setup of HMT, with edges directed from X to Y with infinite capacity. We add edges of capacity 1 from some new vertex x_0 to each vertex of X, and from each vertex of Y to a new vertex y_0. The aim is to give necessary and sufficient conditions for the existence of a flow of value |X|. Note that one direction of HMT is genuinely trivial: if there is a matching, then the neighbourhood size condition must hold. We focus on the other direction. If the maximum flow is less than |X|, then there should be a cut of this size as well. We can parameterise a cut by the class of vertices containing the source, say S. Let A=SnX. If \Gamma(A)\not\subset S, then there will be an infinite capacity edge in the cut. So if we are looking for a minimal cut, this should not happen, hence \Gamma(A)\subset S if S is minimal. Similarly, there cannot be any edges from X\A to \Gamma(A). The value of the cut can then be given by

|\Gamma(A)|+|X|-|A|, which is at least |X| if the neighbourhood size assumption is given. Note we can use the same method with the original edges having capacity one, but we have to track slightly more quantities.

This topic came up because I’ve been thinking about fragmentation chains over this holiday. I have a specific example concerning forests of unrooted trees in mind, but won’t go into details right now. The idea is that we often have distributions governing random partitions of some kind, let’s say of [n]. Conditioning on having a given number of classes might give a family of distributions P_{n,k} for the partitions of [n] into k parts. We would be interested to know how easy it is to couple these distributions in a nice way. One way would be via a coalescence or fragmentation process. In the latter, we start with [n] itself, then at each step, split one of the parts into two according to some (random, Markovian) rule. We are interested in finding out whether such a fragmentation process exists for a given distribution.

It suffices to split the problem into single steps. Can we get from P_{n,k} to P_{n,k+1}?

The point I want to make is that this is just a version of Hall’s Marriage Theorem again, at least in terms of proof method. We can take X to be the set of partitions of [n] into k parts, and Y the set of partitions into (k+1) parts. Then we add a directed edge with infinite capacity between x\in X and y\in Y if y can be constructed from x by breaking a part into two. Finally, we connect a fresh vertex x_0 to each edge in X, only now we insist that the capacity is equal to P_{n,k}(x), and similarly an edge from y to y_0 with capacity equal to P_{n,k+1}(y). The existence of a fragmentation chain over this step is then equivalent to the existence of a flow of value 1 in the directed graph network.

Although in many cases this remains challenging to work with, which I will explore in a future post perhaps, this is nonetheless a useful idea to have in mind when it comes to deciding on whether such a construction is possible for specific examples.

The Rearrangement Inequality

A favourite result of many students doing olympiad inequality problems is the so-called Rearrangement Inequality. This is a mathematical formulation of the idea well-known to even the smallest of child that if you prefer cakes to carrots then if you are offered two of one and one of the other, you should take two of the one you prefer!

At a more formal level, it says that given two strings of non-negative numbers

a_1\le a_2,\le \ldots\le a_n, \quad b_1\le b_2\le \ldots\le b_n,

if you want to form a sum of products of pairs, like

a_1b_4+a_2b_1+a_3b_3+\ldots,

you get the largest result if you take

a_1b_1+a_2b_2+\ldots+a_nb_n.

Formally, for any permutation \sigma \in S_n,

a_1b_1+\ldots+a_nb_n\ge a_1b_{\sigma(1)}+\ldots+a_nb_{\sigma(n)}\ge a_1b_n+\ldots+a_nb_1.

That is, you multiply the largest terms in each sequence together.

The notation to describe to equality case is a bit annoying. Essentially, the sums are equal if and only if the summands exactly correspond. If the sequences are strictly increasing, then equality holds only if the permutation \sigma=\text{id}.

This result is nice because, although it is rarely explicitly useful, it goes in a different direction from the standard scheme of results strengthening AM-GM, Cauchy-Schwarz and so on, and is in some sense more intuitive than these more well-known inequalities, at least in the form presented in an olympiad context.

I was thinking about this partly because it’s a nice result in its own right, but also because it came up in a research problem to do with comparing the expected likelihood of different tree isomorphism classes arising in an inhomogeneous, but relatively well-behaved, random graph model. The probability of forming a given tree is a homogeneous multivariate polynomial in the ages of the vertices that would form the tree. It is then necessary to integrate over the joint distribution (which fortunately is a product in the limit) of the ages of the vertices. I was playing around with this by considering what seemed to be the extreme cases: the star and the path. I was working with the relatively simple case n=4, and it struck me that perhaps the polynomial for the star was always at least as large as that for the path. This would be convenient as it would avoid the need for a horrific-looking integral calculation. This turned out to be true. My first method was a heavy but uncontroversial convexity and stationary point argument, but I found a pair of vectors embedded in the desired inequality on which I could deploy rearrangement.

Anyway, I thought I should be able to come up with a nice proof, and I think this is one. I think this is particularly nice because it is a demonstration that one can do a proof by induction without explicitly inducting on the natural numbers.

We begin with a base case, which is the theorem for n=2, even though we will not be doing induction in the canonical way. We are required to prove that given

a_1\le a_2,\quad b_1\le b_2,

that

a_1b_1+a_2b_2\ge a_1b_2+a_2b_1,

since these are the only available permutations. Moving some terms around gives

(a_2-a_1)(b_2-b_1)\ge 0,

which is true by construction, and so the n=2 result follows.

We now move straight to the general n case. We focus on the left of the two inequalities in the statement of the result, since the other will follow by an identical method, applied in reverse. We consider the case where \sigma is a transposition. For example, we might consider 12435. When we write out the result we want:

a_1b_1+a_2b_2+a_3b_3+a_4b_4+a_5b_5\ge a_1b_1+a_2b_2+a_3b_4+a_4b_3+a_5b_5,

we realise that many of the terms cancel, and the content of the theorem reduces to the n=2 case we have already dealt with. Obviously, this holds equally well whenever \sigma is a transposition. Similarly, if \sigma is a product of two disjoint transpositions, which means that two disjoint pairs of elements are interchanged, we can apply the n=2 case twice, then add on the extra terms to get the result.

In fact, we can do much better than this, by using the fact that any permutation can be expressed as a product of transpositions. We need to be careful about the risk of asserting that every time we multiply the permutation \sigma by a transposition, the value of the associated sum-product expression gets smaller. While the idea is correct, this cannot be generally true. After all, applying the same transposition twice returns us to the identity permutation!

We can nonetheless say something useful. If we start with a permutation

\sigma(1),\sigma(2),\ldots,\sigma(n),

and we interchange the ith and jth elements, to get,

\tau=\sigma(1),\ldots,\sigma(i-1),\sigma(j),\sigma(i+1),\ldots,\sigma(j-1),\sigma(i),\sigma(j+1),\ldots,\sigma(n),

then the product sum corresponding to \tau is less than or equal to the product sum corresponding to \sigma if $\sigma(i)\leq sigma(j)$, under the implicit assumption that i<j. In other words, we can prove the rearrangement inequality for any permutation \sigma that can be obtained from the identity by repeatedly interchanging elements that are initially in increasing order. Essentially, we have defined a partial ordering on the set of permutations.

It suffices to check that all permutations have this property. In fact, this is relatively easy. We can move element n to its required position in \sigma by successively swapping with (n-1), (n-2), etc. If we set this up as an inductive argument, we can finish by applying the hypothesis to the remaining (n-1) elements, which are in the same order as the identity permutation on [n-1].

So we have proved the left-hand side of the Rearrangement Inequality. In fact, this partial ordering framework makes it clear how to prove the right-hand side. By an identical argument, we can get from any permutation to the reverse identity by a similar set of operations.

The Combinatorial Nullstellensatz

I’ve been taking a TCC course this term on Additive Combinatorics, delivered via video link from Bristol by Julia Wolf. At some point once the dust of this term has settled, I might write some things about the ideas of the course I’ve found most interesting, in particular the tools of discrete Fourier analysis to get a hold on some useful combinatorial properties of subsets of \mathbb{Z}/n\mathbb{Z} for example.

For this post, I want to talk instead about a topic that was merely mentioned in passing, the Combinatorial Nullstellensatz. The majority of this post is based on Alon’s original paper, which can be found here, and Chapter 9 of Tao and Vu’s book Additive Combinatorics. My aim is to motivate the theorem, give a proof, introduce one useful application from additive combinatorics, and solve Q6 from IMO 2007 as a direct corollary.

What does Nullstellensatz mean? Roughly speaking, it seems to mean ‘a theorem specifying the zeros’. We will be specifying the zeros of a polynomial. We are comfortable with how the zeros of a complex-valued polynomial of one variable behave. The number of zeros is given precisely by the degree of the polynomial (allowing appropriately for multiplicity). It is generally less clear how we might treat the zeros of a polynomial of many variables. The zero set is likely to be some surface, perhaps of dimension one less than the number of variables. In particular, it no longer really makes sense to talk about whether this set is finite or not. The Combinatorial Nullstellensatz gives us some control over the structure of this set of zeros.

The idea behind the generalisation is to view the Fundamental Theorem of Algebra as a statement not about existence of roots, but rather about (combinatorial) existence of non-roots. That is, given a polynomial P(x) of degree n, for any choice of (n+1) complex numbers, at least one of them is not a root of P. This may look like a very weak statement in this context, where we only expect finitely many roots anyway, but in a multivariate setting it is much more intuitively powerful.

Recall that the degree of a monomial is given by the sum of the exponents of the variables present. So the degree of 4x^2 y^3 z is 6. The degree of a polynomial is then given by the largest degree of a monomial in that polynomial. A polynomial P(x_1,\ldots,x_n) over a field F with degree d might have lots of monomial terms of degree d. Suppose one of these monomials is x_1^{d_1}\ldots x_n^{d_n}, where \sum d_i=d. Then one version of the Combinatorial Nullstellensatz asserts that whenever you take subsets of the base field S_i\subset F with |S_i|\ge d_i+1, then there is a point with x_i\in S_i such that P(x_1,\ldots,x_n)=0.

In other words, you can’t have a box (ie product of sets) of dimension d_1+1 \times d_2+1 \times\ldots\times d_n+1 on which the polynomial is zero.

Unsurprisingly, the proof proceeds by induction on the number of variables. Alon’s result proceeds via a more general theorem giving information about the possibility of writing multinomial polynomials as linear combinations of polynomials in one variable.

We would like to start this induction by fixing the x_n co-ordinate, then viewing P as a polynomial in x_1,\ldots,x_{n-1} only. One problem with this approach is that the largest degree monomials in P are not necessarily still the largest degree monomials in P with x_n fixed. So we need to apply a division algorithm argument.

I’m going to miss some steps so as to keep this of suitable blog post length. The key idea is to apply the division algorithm to P with respect to the simplest polynomial that is zero on all of S_n, which we define as:

g(x_n)=\prod_{s_n\in S_n}(x_n-s_n).

We can decompose as

P(x_1,\ldots,x_n)=q_n(x_1,\ldots,x_n)g(x_n)+\sum_{j=0}^{|S_n|-1}r_{n,j}(x_1,\ldots,x_{n-1})x_n^j.

So now we ask where the term x_1^{d_1}\ldots x_n^{d_n} is coming from, bearing in mind that d_n<|S_n|. The lower order terms in g cannot contribute to this, as  they cannot be of maximal degree. Also, the first term in q_n(\mathbf{x})g(x_n) cannot contribute as the exponent of x_n is too large. So the term in question must be coming from r_{n,d_n}(x_1,\ldots,x_{n-1})x_n^{d_n}. So now we can apply the induction hypothesis to the polynomial r_{n,d_n} to find $x_1\in S_1,\ldots, x_{n-1}\in S_{n-1}$ such that r_{n,d_n}(x_1,\ldots,x_{n-1} is non-zero. With these values, we can view the remainder as a polynomial in x_n of degree |S_n|>d_n, and so there is an x_n\in S_n such that

\sum_{j=0}^{|S_n|}r_{n,j}(x_1,\ldots,x_{n-1})x_n^j)\neq 0.

This concludes the proof by induction.

I want to discuss two relatively simple applications. The first is the Cauchy-Davenport Theorem, which one might view as the first non-trivial theorem in additive combinatorics, giving a bound on the size of a sumset.

Theorem (Cauchy-Davenport): Given A, B non-empty subsets of Z_p for p a prime, then

|A+B|\geq \min\{p,|A|+|B|-1\}.

( A+B:=\{c: c=a+b,a\in A,b\in B\} )

Note that the result isn’t especially surprising. Providing some sort of ordering to the elements of A and B might be a sensible way to proceed. Certainly if they were sets in \mathbb{Z}, this would give a proof immediately.

Proof: Only the case |A|+|B| <= p is interesting. Following Alon’s argument, suppose that |A+B| <= |A|+|B|-2, and let C=A+B. Set f(x,y)=\prod_{c\in C}(x+y-c), so f(a,b)=0 for all a\in A,b\in B.

Then the coefficient of x^{|A|-1}y^{|B|-1} in f is \binom{|A|+|B|-2}{|A|-1} as we have to choose which of the terms in the product supply an x and which supply a y. This is non-zero (in Z_p recall) since the upper integer is less than p. The Combinatorial Nullstellensatz then gives a contradiction.

My second example is from the IMO in Vietnam which I attended. I spent a lot of time thinking about this problem, but made no progress.

IMO 2007 Question 6: Let n be a positive integer. Consider

S=\{(x,y,z) | x,y,z\in \{0,1,\ldots,n\}, x+y+z>0\}

as a set of (n+1)^3-1 points in 3D space. Determine the smallest number of planes, the union of which contains S but does not include (0,0,0).

Answer: 3n. Consider the planes x+y+z = k for k varying between 1 and 3n. The aim is to prove that you cannot do it with fewer.

To prove this, suppose we can do with fewer planes, say k. We write the equation of a plane as

ax+by+cz-d=0.

Note that the d’s are non-zero as (0,0,0) must not be a solution. Then take the product of all these degree one polynomials together and subtract a multiple of

\prod_{i=1}^n (x-i)(y-i)(z-i),

with the multiple chosen so the resulting polynomial has a root at (0,0,0). (This constant must be non-zero to cancel the non-zero product of the d’s.) This resulting polynomial is degree 3n by construction, and x^ny^nz^n has a non-zero coefficient, but it is zero on the box [0,n]^3, which contradicts Combinatorial Nullstellensatz.