Independence and Association

Back when we did GCSE probability, we gave a definition of independent events as:

A and B are said to be independent if \mathbb{P}(A)\mathbb{P}(B)=\mathbb{P}(A\cap B).

We might also apply Bayes’ definition of conditional probability to say

\mathbb{P}(A|B)=\mathbb{P}(A)\quad\iff\quad A,B\text{ independent}\quad\iff\quad\mathbb{P}(B|A)=\mathbb{P}(B)

provided all the terms exist. (Eg the definition of \mathbb{P}(B|A) is at the very least non-obvious if the probability of A is 0.) In my opinion, this is a more naturally intuitive definition. For example, I think that when you toss two coins, the fact that the probability of the second coin being a tail is unaffected by whether the first is heads is more naturally ‘obvious’ than the fact that the joint probability of the two events is 1/4.

But, before getting too into anything philosophical, it is worth thinking about an equivalent situation for non-independent events. We remark that by an identical argument to above:

\mathbb{P}(A|B)\geq\mathbb{P}(A)\quad\iff\quad \mathbb{P}(A\cap B)\geq\mathbb{P}(A)\mathbb{P}(B)\quad\iff\quad\mathbb{P}(B|A)\geq\mathbb{P}(B)

Informally, this says that if we know A occurs, it increases the likelihood of B occuring. If we were talking about two random variables, we might say that they were positively correlated. But of course, by considering RVs 1_A,1_B, the result above is precisely the statement that the indicator functions have positive correlation.

Aim: To find a sufficient condition for positive correlation of random variables in a product measure.

Consider the following. Suppose A is an event which is positively correlated with the appearance of each edge. We might suspect that two such events A and B would be positively correlated. Instead, we consider a more concrete description. Recall that an event A is a subset of \Omega=\{0,1\}^E. Given w\in\Omega,e\in E, we say w^e\in\Omega defined by taking w and setting edge e to be open (note it may be open already). Now, we say event A is increasing, if

\forall w\in\Omega,\forall e\in E: w\in A\Rightarrow w^e\in A.

Note that this certainly implies the property previously mentioned, but the converse is not necessarily true.

Anyway, our revised aim will be to show that increasing events A and B are positively correlated for product measure.

For now, we approach the problem from the other direction, namely we attempt to find which measures on \{0,1\}^E have the property that A and B are positively correlated for all increasing A, B. Note that as before, we can think of this as \mathbb{E}1_A1_B\geq\mathbb{E}1_A\mathbb{E}1_B, and again here it is useful to rephrase our framework in terms of random variables. There is a natural (product) partial ordering of \Omega=\{0,1\}^E, and from this there is an easy notion of increasing random variables. Recall a random variable is defined as a measurable map \Omega\rightarrow\mathbb{R} so no further work is required.

X is increasing if w\geq w'\Rightarrow X(w)\geq X(w').

So we clarify our aim, which is to find a condition on the measure \mu such that \mu(XY)\geq \mu(X)\mu(Y) for all increasing X, Y. When this occurs, we say \mu is positively associated. Note that this is equivalent to \mu(A\cap B)\geq \mu(A)\mu(B) for all increasing events A, B. Why? We can build up X and Y from increasing indicator functions like \{X\geq x\} in a usual monotone class argument.

On the way, we need a partial ordering on the set of probability measures. Obviously, if \mu(A)\leq \nu(A) for all events A, then in fact \mu=\nu! So instead we say \mu\leq_{st}\nu if \mu(A)\leq \nu(A) for all increasing A. This is called the stochastic ordering, and there is a technical result of Strassen, proving the intuitively obvious claim that if \mu_1\leq \mu_2, then we can couple the measures in a natural way. Formally:

Theorem: \mu_1\leq\mu_2 \iff \exists a probability measure \nu on \Omega^2 such that the marginals are \mu_1,\mu_2 and

\nu(\{(w_1,w_2):w_1\leq w_2\})=1.

Our main result will be the FKG inequality which asserts that when \mu satisfies the following FKG lattice property

\mu(w_1\vee w_2)\mu(w_1\wedge w_2)\geq \mu(w_1)\mu(w_2),\quad\forall w_1,w_2\in\Omega

then \mu is positively associated. We will prove the case |E|<\infty.

We proceed by showing that \mu_1\leq\mu_2\propto Y\mu_1, rescaled, for Y an increasing RV. [Note that we are now suppressing the ‘st’ subscript, as context makes the use clear.]

To show this, we prove the more general Holley’s Theorem:

This states that if two positive probability measures satisfy a related lattice condition:

\mu_2(w_1\vee w_2)\mu_1(w_1\wedge w_2)\geq \mu_1(w_1)\mu_2(w_2)\quad\forall w_1,w_2\in\Omega

then we have the stochastic domination result: \mu_1\leq \mu_2.

Note that the lattice condition states, very informally, that adding edges results in a greater relative increase with respect to the measure \mu_2, which has a natural similarity to the definition of stochastic domination.

We prove this, perhaps unexpectedly, by resorting to a Markov chain. We note that there is a Markov chain on \Omega with equilibrium distribution given by \mu_1. This is simple: the non-zero transition rates are those given by the addition or removal of a single edge. Assume that edges are added at unit rate, and that edges are removed with rate: G(w^e,w_e)=\frac{\mu_1(w_e)}{\mu_1(w^e)}.

Similarly, we can construct a Markov chain on state space \Omega^2, where non-zero transitions are given by the addition of an edge to both states in the pair, the removal of an edge from both states in the pair, and the removal of an edge from only the first edge in the pair. Note that, as before, we may be ‘adding’ an edge which is already present. Assuming we start in this set, this choice means that we are restricting the sample space to \{(\pi,w):\pi\leq w\}. We need the transition rate of the third type of transition to have the form: \frac{\mu_1(\pi_e)}{\mu_1(\pi^e)}-\frac{\mu_2(w_e)}{\mu_2(w^e)}. So the lattice condition precisely confirms that this is non-negative, and thus we have a well-constructed Markov chain. The marginals have equilibrium distributions \mu_1,\mu_2 by construction, and by the general theory of Markov chains, there is an equilibrium distribution, and this leaves us in precisely the right position to apply Strassen to conclude the result.#

Summary of consequences: We have demonstrated that product measure is positively associated, as it certainly satisfies the FKG condition. Recall that this is what we had suspected intuitively for reasons given at the start of this account. Next time, I will talk about the most natural companion result, the BK inequality, and the stronger Reimer’s Inequality.

References: Both the motivation and the material is derived from Prof. Grimmett’s Part III course, Percolation and Related Topics, which was one of the mathematical highlights of the year. This account of the subject is a paraphrase of his lecture notes, which were themselves based on his book Probability on Graphs. Mistakes, naturally, are mine. Background on the course, and an online source of the book can be found on the course website here.

The Sample Space for a Die

Also featuring: a non-Lebesgue measurable set and Dynkin’s Lemma.

At the National Maths Summer School last week, the senior students and I spent a while talking about probability space, and in particular, when it was reasonable to assign a probability to a potential event. We considered rolling a standard die, and the probabilities \mathbb{P}(D\in\{\}), the empty event, and \mathbb{P}(D=7). Though it is tempting to conclude that the latter must be zero, in the end we decided that it should not actually be defined at all.

Why? Well, if we accept \mathbb{P}(D=7)=0, then by extension we must accept \mathbb{P}(D=137) and \mathbb{P}(D=\$1.50) also both exist and are zero. What have we gained? In reality very little. But the cost is this: we might define an event to be any subset of the sample space. Before, our sample space was \Omega=\{1,2,3,4,5,6\}, and so there are exactly 64 events, including the possibly counter-intuitive empty event \{\}. This is finite, which is always nice. With the extra events, however, we must extend the sample space to \Omega=\{1,2,3,4,5,6,7,\ldots, where “…” means “the rest of the universe”. This is a fairly exotic mathematical object, and really has no place in any sensible discussion.

This reminded me of one of my favourite results from Part II Probability and Measure. Of course, for uncountable sample spaces, we cannot necessarily assume all subsets of \Omega are measurable. Instead we build up a sigma-algebra of measurable sets, most importantly for Lebesgue measure on \mathbb{R}. An immediate question to ask is: are all subsets of \mathbb{R} Lebesgue-measurable?

And the answer is ‘no’. Why? The standard counterexample is as follows. Consider Lebesgue measure on the unit interval U=\mathbb{R}/\mathbb{Z}, with endpoints identified. Now consider the rationals in U, which are actually a subgroup \mathbb{Q}\leq U, with uncountably many cosets. Pick an element from each coset (*). Call this sets A. Then, working modulo 1, U=\cup_{q\in\mathbb{Q}\cap U}A+q. If A is Lebesgue-measurable, then so is A+q, and \mu(A+q)=\mu(A) (**). Combining these two results, using countable additivity:

\mu(U)= (0 if \mu(A)=0, \infty otherwise).

This is a contradiction, and hence we conclude that A is not Lebesgue-measurable.

Remark on (*): This relies on the Axiom of Choice. In fact, the existence of non-Lebesgue measurable sets MAY be equivalent to AC.

Remark on (**): I was suddenly unsure that this was obvious. I mean, this is such a weird set that it is in fact not measurable: why should its hypothetical measure be translation invariant? It is tempting to argue vaguely, by saying that the construction of Lebesgue measure is invariant under translation at all steps. As so often with elementary measure theory, recourse to Dynkin’s Lemma is more reliable.

Let D be the collection of measurable sets whose measure is invariant under translation. By definition, D is invariant under translation (of its elements). D certainly contains all intervals in U, which is a pi-system generating \mathcal{B}([0,1]). But, in a classic proof by suggestive notation, we can check that D is a d-system. The presence of the empty sets is clear. If we have B\subset A, both in D, then also A\backslash B\in D, as the translates of x\in A\backslash B must be in A, but cannot be in B, as B is translation invariant. Finally, given A_1\subset A_2\subset\ldots\subset D, then x\in\cup A_i\Rightarrow x\in A_n for some n, so all of x’s translates are in A_n, and hence in \cup A_i.

Now we can deploy Dynkin’s Lemma. D must be the sigma-algebra of all measurable sets, as we wanted.

NMSS 2012 – Strong Law of Large Numbers for a Coin Flip

The 2012 National Mathematics Summer School, held at Queens’ College affiliated to the University of Birmingham, and run by the United Kingdom Mathematics Trust, is drawing to a close today. I gave a problem-based talk on Probability to two groups of 20 junior students (15/16 year olds selected based on strong performance in national competitions for their agegroups), and a lecture to the six senior students (some of 2011’s strongest and most enthusiastic junior students) on the SLLN for the simplest non-trivial random variable imaginable: a coin flip.

In case any of the students, or indeed anyone else, is interested, a text of the problems, and the worked solutions that took up the majority of the lecture will be available here for a short while. Do email me if there are any questions!

Senior Probability Solutions. [Link removed. Email me if interested]