Lecture 8 – Bounds in the critical window

Posted on March 23, 2019 by dominicyeo

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

Preliminary – positive correlation, Harris inequality

I wrote about independence, association, and the FKG property a long time ago, while I was still an undergraduate taking a first course on Percolation in Cambridge. That post is here. In the lecture I discussed the special case of the FKG inequality applied in the setting of product measure setting, of which the Erdos-Renyi random graph is an example, and which is sometimes referred to as the Harris inequality.

Given two increasing events A and B, say for graphs on [n], then if $\mathbb{P}$ is product measure on the edge set, we have

$\mathbb{P}(A\cap B)\ge \mathbb{P}(A)\mathbb{P}(B).$

Intuitively, since both A and B are ‘positively-correlated’ with the not-rigorous notion of ‘having more edges’, then are genuinely positively-correlated with each other. We will use this later in the post, in the form $\mathbb{E}[X|A]\ge \mathbb{E}[X]$ , whenever X is an increasing RV and A is an increasing event.

The critical window

During the course, we’ve discussed separately the key qualitative features of the random graph $G(n,\frac{\lambda}{n})$ in the

subcritical regime when $\lambda<1$ , for which we showed that all the components are small, in the sense that $\frac{1}{n}|L_1| \stackrel{\mathbb{P}}\rightarrow 0$ , although the same argument would also give $|L_1|\le K\log n$ with high probability if we used stronger Chernoff bounds;
supercritical regime when $\lambda>1$ , for which there is a unique giant component , ie that $\frac{1}{n}|L_1|\stackrel{\mathbb{P}}\rightarrow \zeta_\lambda>0$ , the survival probability of a Galton-Watson branching process with Poisson( $\lambda$ ) offspring distribution. Arguing for example by a duality argument shows that with high probability all other components are small in the same sense as in the subcritical regime.

In between, of course we should study $G(n,\frac{1}{n})$ , for which it was known that $L_1\stackrel{d}\sim n^{2/3},\, L_2\stackrel{d}\sim n^{2/3},\ldots$ . (*) That is, the largest components are on the scale $n^{2/3}$ , and there are lots of such critical components.

In the early work on random graphs, the story ended roughly there. But in the 80s, these questions were revived, and considerable work by Bollobas and Luczak, among many others, started investigating the critical setting in more detail. In particular, between the subcritical and the supercritical regimes, the ratio $\frac{|L_2|}{|L_1|}$ between the sizes of the largest and second-largest components goes from ‘concentrated on 1’ to ‘concentrated on 0’. So it is reasonable to ask what finer scaling of the edge probability $p(n)$ around $\frac{1}{n}$ should be chosen to see this transition happen.

Critical window

In this lecture, we studied the critical window, describing sequences of probabilities of the form

$p(n)=\frac{1+\lambda n^{-1/3}}{n},$

where $\lambda\in(-\infty,+\infty)$ . (Obviously, this is a different use of $\lambda$ to previous lectures.)

It turns out that as we move $\lambda$ from $-\infty$ to $+\infty$ , this window gives exactly the right scaling to see the transition of $\frac{|L_2|}{|L_1|}$ described above. Work by Bollobas and Luczak and many co-authors and others in the 80s establish a large number of results in this window, but for the purposes of this course, this can be summarised as saying that the critical window has the same scaling behaviour as $p(n)=1/n$ , with a large number of components on the scale $\sim n^{2/3}$ (see (*) earlier), but different scaling limits.

Note: Earlier in the course, we have discussed local limits, in particular for $G(n,\lambda/n)$ , where the local limit is a Galton-Watson branching process tree with offspring distribution $\mathrm{Poisson}(\lambda)$ . Such local properties are not sufficient to distinguish between different probabilities within the critical window. Although there are lots of critical components, it remains the case that asymptotically almost all vertices are in ‘small components’.

The precise form of the scaling limit for

$\frac{1}{n^{2/3}} \left( |L_1|, |L_2|, |L_3|,\ldots \right)$

as $n\rightarrow\infty$ was shown by Aldous in 1997, by lifting a scaling limit result for the exploration process, which was discussed in this previous lecture and this one too. Since Brownian motion lies outside the assumed background for this course, we can’t discuss that, so this lecture establishes upper bounds on the correct scale of $|L_1|$ in the critical window. Continue reading →

The Envelope ‘Paradox’

Posted on September 17, 2016 by dominicyeo

At the recent IMO in Hong Kong, there were several moments where the deputy leaders had to hang around, and I spent some of these moments discussing the following problem with Stephen Mackereth, my counterpart from New Zealand. He’s a mathematically-trained philosopher, so has a similar level of skepticism to me, but for different reasons, regarding supposed paradoxes in probability. Because, as we will see shortly, I don’t think this is a paradox in even the slightest fashion, I think there’s probably too much written about this on the internet already. So I’m aware that contributing further to this oeuvre is hypocritical, but we did the thinking in HKUST’s apparently famous Einstein Cafe, so it makes sense to write down the thoughts.

[And then forget about it for eight weeks. Oops.]

The ‘Paradox’

Here’s the situation. A cryptic friend gives you an envelope containing some sum of money, and shows you a second envelope. They then inform you that one of the envelopes contains twice as much money as the other. It’s implicit in this that the choice of which is which is uniform. You have the option to switch envelopes. Should you?

The supposed paradox arises by considering the amount in your envelope, say X. In the absence of further information, it is equally likely that the other envelope contains X/2 as 2X. Therefore, the average value of the other envelope is

$\frac12 \left(\frac{X}{2}+2X \right)= \frac54 X > X.$

So you should switch, since on average you gain money. But this is paradoxical, since the assignment of larger and smaller sums was uniform, so switching envelope should make no difference.

Probabilistic setup

This is not supposed to be a problem on a first-year probability exercise sheet. It’s supposed to be conducive to light discussion. So saying “I won’t engage with this problem until you tell me what the probability space is” doesn’t go down terribly well. But it is important to work out what is random, and what isn’t.

There are two sources of randomness, or at least ignorance. Firstly, there is the pair of values contained in the envelopes. Secondly, there is the assignment of this pair of values to the two envelopes. The second is a source of randomness, and this problem is founded on the premise that this second stage is ‘symmetric enough’ to smooth over any complications in the first stage. If we think that probability isn’t broken (and that’s what I think), then the answer is probably that the second stage isn’t symmetric enough.

Or, that the first stage isn’t very well-defined. In what follows, I’m going to make the second stage very symmetric, at the expense of setting up the first stage in what seems to me a reasonable way using the conventional language of probability theory to record our ignorance about the values in play.

So what’s the first stage? We must have a set of possible pairs of values taken by the envelopes. Let’s call this A, so

$A\subset \mathbb{A}:=\{(x,2x)\,:\, x\in (0,\infty)\}.$

Maybe we know what A is, but maybe we don’t, in which we should take $A=\mathbb{A}$ , on the grounds that any pair is possible. Suppose that your friend has chosen the pair of values according to some distribution on $\mathbb{A}$ , which we’ll assume has a density f, which is known by you. Maybe this isn’t the actual density, but it serves perfectly well if you treat it as *your* opinion on the likelihood. Then this actually does reduce to a problem along the lines of first-year probability, whether or not you get to see the amount in your envelope.

Suppose first that you do get to see the amount, and that it is x. Then the conditional probabilities that the pair is (x/2,x) or (x,2x) are, respectively

$\frac{f(x/2,x)}{f(x/2,x)+f(x,2x)},\quad \frac{f(x,2x)}{f(x/2,x)+f(x,2x)}.$

So you can work out your expected gain by switching, and decide accordingly. If you don’t know the value in your envelope, you can still work out the probability that it is better (in expectation) to switch, but this isn’t really a hugely meaningful measure, unless it is zero or one.

It’s worth noting that if you can view inside your envelope, and you know A has a particular form, then the game becomes determined. For example, if

$A\subset \{(n,2n), n\text{ an odd integer}\},$

then life is very easy. If you open your envelope and see an odd integer, you should switch, and if you see an even integer you shouldn’t.

We’ll return at the end to discuss a case where it is always better to switch, and why this isn’t actually a paradox.

Improper prior and paradox of resampling when $\mathbb{E}=\infty$

For now though, let’s assume that we don’t know anything about the amounts of money in the envelopes. Earlier, we said that “in the absence of further information, it is equally likely that the other envelope contains X/2 as 2X”. In the language of a distribution on $\mathbb{A}$ , we are taking the uniform measure. Of course this not a distribution, in the same way that there isn’t a uniform distribution on the positive reals.

However, if this is your belief about the values in the pair of envelopes, what do you think is the mean value of the content of your envelope? Well, you think all values are equally likely. So, even though this isn’t a distribution, you pretty much think the value of your envelope has infinite expectation.

[This is where the philosophy comes in I guess. Is (expressing uniform ignorance about the content of the other envelope given knowledge of your own) the same as (expressing uniform ignorance of both envelopes at the beginning)? I think it is, even though it has a different consequence here, since the former can be turned into a proper distribution, whereas the latter cannot.]

Let’s briefly consider an alternative example. It’s fairly easy to conjure up distributions which are almost surely finite but which have infinite expectation. For example $\mathbb{P}(X=2^k)=2^{-k}$ for k=1,2,…, which is the content of the *St. Petersburg paradox*, another supposed paradox in probability, but one whose resolution is a bit more clear.

Anyway, let X and Y be independent copies of such a distribution. Now suppose your friend offers you an envelope containing amount X. You look at the value, and then you are offered the opportunity to switch to an envelope containing amount Y. Should you?

Well, if expectation is what you care about, then you definitely should. Because with probability one, you are staring at a finite value in your envelope, whereas the other unknown envelope promises infinite expectation, which is certainly larger than the value that you’re looking at.

Is this also a paradox? I definitely don’t think it is. The expectation of the content of your envelope is infinite, the expected gain is infinite with probability one, which is consistent with the expected content of the other envelope being infinite. [Note that you don’t want to be claiming that the expectation of X-Y is zero.]

An example density function

As an exercise that isn’t necessarily hugely interesting, let’s assume that f, the distribution of the smaller of the pair, is $\mathrm{Exp}(\lambda)$ . So the mean of this smaller number is $1/\lambda$ . Then, conditional on seeing x in my envelope, the expected value of the number in the other envelope is

$\frac{\frac{x}{2} e^{-\lambda x/2} + 2x e^{-\lambda x}}{e^{-\lambda x/2}+ e^{-\lambda x}}.$ (*)

Some straightforward manipulation shows that this quantity is at least x (implying it’s advantageous to switch) precisely when

$e^{-\lambda x/2}\ge \frac12.$

That is, when $x\le \frac{2\log 2}{\lambda}$ . The shape of this interval should fit our intuition, namely that the optimal strategy should be to switch if the value in your envelope is small enough.

The point of doing this calculation is to emphasise that it ceases to be an interesting problem, and certainly ceases to be a paradox of any kind, once we specify f concretely. It doesn’t matter whether this is some true distribution (ie the friend is genuinely sampling the values somehow at random), or rather a perceived likelihood (that happens to be normalisable).

What if you should always switch?

The statement of the paradox only really has any bite if the suggestion is that we should always switch. Earlier, we discussed potential objections to considering the uniform prior in this setting, but what about other possible distributions f which might lead to this conclusion?

As at (*), we can conclude that when $f(x)+f(x/2)>0$ , we should switch on seeing x precisely if

$f(x)\ge 2f\left(\frac{x}{2}\right).$

Therefore, partitioning the support of f into a collection of geometric sequences with exponent 2, it is clear that the mean of f is infinite if everything is integer-valued. If f is real-valued, there are some complications, but so long as everything is measurable, the same conclusion will hold.

So the you-should-switch-given-x strategy can only hold for all values of x if f has infinite mean. This pretty much wraps up my feelings. If the mean isn’t infinite, the statement of the paradox no longer holds, and if it is infinite, then the paradox dissolves into a statement about trying to order various expectations, all of which are infinite.

Conclusions

Mathematical summary: it’s Bayes. Things may be exchangeable initially, but not once you condition on the value of one of them! Well, not unless you have a very specific prior.

Philosophical summary: everything in my argument depends on the premise that one can always describe the protagonist’s prior opinion on the contents of the pair of envelopes with a (possibly degenerate) distribution. I feel this is reasonable. As soon as you write down $\frac12 \cdot\frac{x}{2} + \frac12 \cdot2x$ , you are doing a conditional expectation, and it’s got to be conditional with respect to something. Here it’s the uniform prior, or at least the uniform prior restricted to the set of values that are now possible given the revelation of your number.

Second mathematical summary: once you are working with the uniform prior, or any measure with infinite mean, there’s no reason why

$\mathbb{E}\left[X|Y\right]>Y,$

with probability one (in terms of Y) should be surprising, since the LHS is (almost-surely) infinite while the RHS is almost surely finite, despite having infinite mean itself.

Responding to the Monty Hall Problem

Posted on October 10, 2013 by dominicyeo

It’s been a while since I got round to writing anything, so I’m easing myself back in by avoiding anything too mathematically strenuous. Shortly after my last post, there was a nice clip on Horizon featuring Marcus du Sautoy explaining the famous Monty Hall Problem to the ever-incredulous Alan Davies. The BBC article here shows the clip, and then an excellent account by probabilist John Moriarty explaining the supposed paradox.

Recall the situation. A game show is hosted by Monty Hall. At some point, to decide the grand prize, or whatever, a contestant plays the following lottery. There are three doors, and three prizes, one behind each. One of the prizes is worthwhile, normally a car, while the other two are rubbish. For reasons I don’t understand, it is normally said that behind the other two doors are goats. Anyway, MH knows where the car is from the beginning. Then the contestant picks a door, but doesn’t open it yet. MH then dramatically opens a different door, revealing a goat. The contestant then has the option to stick with his original choice, or switch to the third door. What should he or she do?

So the natural error to make is that since one possibility has been eliminated, since the remaining two were a priori equally likely, they should still be equally likely, and hence each occur with probability 1/2. So it makes no difference whether you stay or switch. But, of course, as well as removing information (ruling out one door), we have also gained information. The fact that MH chose to reveal that door, rather than the potential third door is important.

In my opinion, the best way to think about this is to imagine you are the host. The nice thing about being in this position is that then you can imagine you know everything about the configuration in advance, so are less likely to fall foul of counter-intuitive conditional probabilities. Anyway, there’s loads of symmetry, so from my point of view as the host, I only care whether the contestant initially selects the right door or the wrong door. If he selects the wrong door, then I need to open the ONLY OTHER wrong door. In other words, I have no choice in what I do next. I open the other wrong door. Therefore in this case the contestant should switch. If the contestant in fact selects the right door initially, I then have a choice about which door to open. However, it doesn’t really matter which I pick, by symmetry, and in either case the contestant would do better to stay with his original choice. Note that we have defined the outcome entirely by the initial choice. And as the host, I knew where the car was at the start, so from my point of view, unless he had somehow cheated, the contestant always had a 1/3 chance of guessing correctly. So it is better to switch 2/3 of the time.

The BBC’s weekly collation feature, The Loop, printed a couple of responses to the article. This is one:

“I believe that the problem is a game of two halves. In the first instance you have a 1:3 chance of getting the prize. Once you’ve made your initial choice, even though the ‘box’ contents remain the same, one ‘false’ choice is removed and you are then offered the chance to choose again. There is a play on words here, perhaps, with the option to ‘switch’. Effectively you are being asked to make a choice of 1:2. Now, if this is always going to happen then the odds of you winning the prize in the first instance is 1:2, as it doesn’t really matter whether or not you choose correctly the first time – one ‘false’ is going to be removed and you’ll be presented with just two options. It’s mind trickery. The first choice is totally irrelevant because you will always end up having to pick one option from two in order to attain the prize. In my humble opinion.”

I highlight this not to inspire contempt, since it is an easy mistake to make, even after reading as nice an explanation as Moriarty’s. Rather it is an excellent example of how using too many words can convince you an argument is complete before you’ve really made any concrete statement at all. Note that the sentences beginning ‘Effectively…’, ‘Now, if this…’ and ‘The first choice…’ all say exactly the same thing, which is, to use inappropriately formal terminology, exactly what the writer is trying to prove. The padding of ‘mind trickery’ and ‘humble opinions’ is as close as a maths argument can ever really come to an ad hominem.

So I was trying to come to a clean conclusion about exactly why the Monty Hall problem does such a good job at generating confusion. I think the following aspects of the set-up are important:

1) In general, people often get confused about the meaning of ‘random’. In some sense, the sequence 416235 is more random than the sequence 123456. However, typically, we use ‘random’ to mean ‘not determined in advance’. The key is that the bit where MH reveals a goat is not a random action. Whatever you do initially, you know that this will happen. So the fact that it does actually happen should not add extra information.

2) This fact is compounded by the time-steps at which decisions have to be made. It feels like the arc of the narrative is directing everything towards this final choice, which feels very separate from the initial choice. You can imagine the dramatic music that undoubtedly plays as MH opens the appropriate door, and the camera pans back to the contestant making their decision. All of which is an excellent way to suggest that something complicated has happened, when in fact in terms of what we are interested in overall, nothing has happened.

A point I hadn’t thought about before was that it is important that MH knows what happens in advance. Let’s develop this point and consider what happens if MH in fact has no knowledge of where the car is, and chooses a door at random. We will still assume that he must choose a different door to the one that the contestant originally chose.

Then, 1/3 of the time the contestant picks a goat, and MH picks the car, so he gets a goat whether he stays or switches. 1/3 of the time, the contestant picks a goat, and MH picks the other goat, so he should switch. And 1/3 of the time, the contestant originally picks the car, and then by default MH picks a goat, so he should stay. So, overall, it is equally preferable to stay as to switch.

I guess the morals of this story are: “be careful with conditional probabilities; sometimes apparently new information doesn’t change anything; and that it’s probably not a good idea to get into a discussion about the Monty Hall problem with a skeptic unless you have a piece of paper to hand on which to draw a probability table.”

Lesson 4: The Monty Hall Problem (heiyinsin.wordpress.com)
The Monty Hall Problem- Counter-intuition (13laic1.wordpress.com)
The Monty Hall Problem (robinsoons.wordpress.com)
Goat or No Goat: applying the Monty Hall problem to Deal or No Deal (theunforgivingminuteuk.wordpress.com)
Monty Hall problem: The probability puzzle that makes your head melt (misbehavedwoman.wordpress.com)
The probability puzzle that makes your head melt (bbc.co.uk)
How Randomness Rules Our World (kennethmutuma.wordpress.com)
The Loop: Monty who? (bbc.co.uk)

PS. A final moral is that, based on the links WordPress suggested, at least 10% of the internet is devoted to explanations of the Monty Hall problem…

Bayesian Inference and the Jeffreys Prior

Posted on May 10, 2013 by dominicyeo

Last term I was tutoring for the second year statistics course in Oxford. This post is about the final quarter of the course, on the subject of Bayesian inference, and in particular on the Jeffreys prior.

There are loads and loads of articles sitting around on the web contributing the debate about the relative merits of Bayesian and frequentist methods. I do not want to continue that debate here, partly because I don’t have a strong opinion, but mainly because I don’t really understand that much about the underlying issues.

What I will say is that after a few months of working fairly intensively with various complicated stochastic processes, I am starting to feel fairly happy throwing about conditional probability rather freely. When discussing some of the more combinatorial models for example, quite often we have no desire to compute or approximate complication normalising constants, and so instead talk about ‘weights’. And a similar idea underlies Bayesian inference. As in frequentist methods we have an unknown parameter, and we observe some data. Furthermore, we know the probability that such data might have arisen under any value of the parameter. We want to make inference about the value of the parameter given the data, so it makes sense to multiply the probability that the data emerged as a result of some parameter value by some weighting on the set of parameter values.

In summary, we assign a prior distribution representing our initial beliefs about the parameter before we have seen any data, then we update this by weighting by the likelihood that the observed data might have arisen from a particular parameter. We often write this as:

$\pi(\theta| x)\propto f(x|\theta)\pi(\theta)$ ,

or say that posterior = likelihood x prior. Note that in many applications it won’t be necessary to work out what the normalising constant on the distribution ought to be.

That’s the setup for Bayesian methods. I think the general feeling about the relative usefulness of such an approach is that it all depends on the prior. Once we have the prior, everything is concrete and unambiguously determined. But how should we choose the prior?

There are two cases worth thinking about. The first is where we have a lot of information about the problem already. This might well be the case in some forms of scientific research, where future analysis aims to build on work already completed. It might also be the case that we have already performed some Bayesian calculations, so our current prior is in fact the posterior from a previous set of experiments. In any case, if we have such an ‘informative prior’, it makes sense to use it in some circumstances.

Alternatively, it might be the case that for some reason we care less about the actual prior than about the mathematical convenience of manipulating it. In particular, certain likelihood functions give rise to conjugate priors, where the form of the posterior is the same as the form of the prior. For example, a normal likelihood function admits a normal conjugate prior, and a binomial likelihood function gives a Beta conjugate prior.

In general though, it is entirely possible that neither of these situations will hold but we still want to try Bayesian analysis. The ideal situation would be if the choice of prior had no effect on the analysis, but if that were true, then we couldn’t really be doing any Bayesian analysis. The Jeffreys prior is one natural candidate because it removes a specific problem with choosing a prior to express ignorance.

It sounds reasonable to say that if we have total ignorance about the parameter, then we should take the prior to be uniform on the set of possible values taken by the parameter. There are two potential objections to this. The first is that if the parameter could take any real value, then the prior will not be a distribution as the uniform distribution on the reals is not normalisable. Such a prior is called improper. This isn’t a huge problem really though. For making inference we are only interested in the posterior distribution, and so if the posterior turns out to be normalisable we are probably fine.

The second problem is more serious. Even though we want to express ignorance of the parameter, is there a canonical choice for THE parameter? An example will make this objection more clear. Suppose we know nothing about the parameter T except that it lies in [0,1]. Then the uniform distribution on [0,1] seems like the natural candidate for the prior. But what if we considered T^100 to be the parameter instead? Again if we have total ignorance we should assign T^100 the uniform distribution on its support, which is again [0,1]. But if T^100 is uniform on [0,1], then T is massively concentrated near 1, and in particular cannot also be uniformly distributed on [0,1]. So as a minimum requirement for expressing ignorance, we want a way of generating a prior that doesn’t depend on the choice of parameterisation.

The Jeffreys prior has this property. Note that there may be separate problems with making such an assumption, but this prior solves this particular objection. We define it to be $\pi(\theta)\propto [I(\theta)]^{1/2}$ where I is the Fisher information, defined as

$I(\theta)=-\mathbb{E}_\theta\Big[\frac{\partial^2 l(X_1,\theta)}{\partial \theta^2}\Big],$

where the expectation is over the data X_1 for fixed $\theta$ , and l is the log-likelihood. Proving that this has the property that it is invariant under reparameterisation requires demonstrating that the Jeffreys prior corresponding to $g(\theta)$ is the same as applying a change of measure to the Jeffreys prior for $\theta$ . The proof is a nice exercise in the chain rule, and I don’t want to reproduce it here.

For a Binomial likelihood function, we find that the Jeffreys prior is Beta(1/2,1/2), which has density that looks roughly like a bucket suspended above [0,1]. It is certainly worth asking why the ‘natural’ choice for prior might put lots of mass at the edge of the domain for the parameter.

I don’t have a definitive answer, but I do have an intuitive idea which comes from the meaning of the Fisher information. As the second derivative of the log-likelihood, a large Fisher information means that with high probability we will see data for which the likelihood changes substantially if we vary the parameter. In particular, this means that the posterior probability of a parameter close to 0 will be eliminated more quickly by the data if the true parameter is different.

If the variance is small, as it is for parameter near 0, then the data generated by this parameter will have the greatest effect on the posterior, since the likelihood will be small almost everywhere except near the parameter. We see the opposite effect if the variance is large. So it makes sense to compensate for this by placing extra prior mass at parameter values where the data has the strongest effect. Note that in the previous example, the Jeffreys prior is in fact exactly inversely proportional to the standard deviation. For the above argument to make sense, we need it to be monotonic with respect to SD, and it just happens that in this case, being 1/SD is precisely the form required to be invariant under reparameterisation.

Anyway, I thought that was reasonably interesting, as indeed was the whole course. I feel reassured that I can justify having my work address as the Department of Statistics since I now know at least epsilon about statistics!

Bayesian and Frequentist Approaches: Ask the Right Question (win-vector.com)

Independence and Association

Posted on July 29, 2012 by dominicyeo

Back when we did GCSE probability, we gave a definition of independent events as:

A and B are said to be independent if $\mathbb{P}(A)\mathbb{P}(B)=\mathbb{P}(A\cap B)$ .

We might also apply Bayes’ definition of conditional probability to say

$\mathbb{P}(A|B)=\mathbb{P}(A)\quad\iff\quad A,B\text{ independent}\quad\iff\quad\mathbb{P}(B|A)=\mathbb{P}(B)$

provided all the terms exist. (Eg the definition of $\mathbb{P}(B|A)$ is at the very least non-obvious if the probability of A is 0.) In my opinion, this is a more naturally intuitive definition. For example, I think that when you toss two coins, the fact that the probability of the second coin being a tail is unaffected by whether the first is heads is more naturally ‘obvious’ than the fact that the joint probability of the two events is 1/4.

But, before getting too into anything philosophical, it is worth thinking about an equivalent situation for non-independent events. We remark that by an identical argument to above:

$\mathbb{P}(A|B)\geq\mathbb{P}(A)\quad\iff\quad \mathbb{P}(A\cap B)\geq\mathbb{P}(A)\mathbb{P}(B)\quad\iff\quad\mathbb{P}(B|A)\geq\mathbb{P}(B)$

Informally, this says that if we know A occurs, it increases the likelihood of B occuring. If we were talking about two random variables, we might say that they were positively correlated. But of course, by considering RVs $1_A,1_B$ , the result above is precisely the statement that the indicator functions have positive correlation.

Aim: To find a sufficient condition for positive correlation of random variables in a product measure.

Consider the following. Suppose A is an event which is positively correlated with the appearance of each edge. We might suspect that two such events A and B would be positively correlated. Instead, we consider a more concrete description. Recall that an event A is a subset of $\Omega=\{0,1\}^E$ . Given $w\in\Omega,e\in E$ , we say $w^e\in\Omega$ defined by taking w and setting edge e to be open (note it may be open already). Now, we say event A is increasing, if

$\forall w\in\Omega,\forall e\in E: w\in A\Rightarrow w^e\in A$ .

Note that this certainly implies the property previously mentioned, but the converse is not necessarily true.

Anyway, our revised aim will be to show that increasing events A and B are positively correlated for product measure.

For now, we approach the problem from the other direction, namely we attempt to find which measures on $\{0,1\}^E$ have the property that A and B are positively correlated for all increasing A, B. Note that as before, we can think of this as $\mathbb{E}1_A1_B\geq\mathbb{E}1_A\mathbb{E}1_B$ , and again here it is useful to rephrase our framework in terms of random variables. There is a natural (product) partial ordering of $\Omega=\{0,1\}^E$ , and from this there is an easy notion of increasing random variables. Recall a random variable is defined as a measurable map $\Omega\rightarrow\mathbb{R}$ so no further work is required.

X is increasing if $w\geq w'\Rightarrow X(w)\geq X(w')$ .

So we clarify our aim, which is to find a condition on the measure $\mu$ such that $\mu(XY)\geq \mu(X)\mu(Y)$ for all increasing X, Y. When this occurs, we say $\mu$ is positively associated. Note that this is equivalent to $\mu(A\cap B)\geq \mu(A)\mu(B)$ for all increasing events A, B. Why? We can build up X and Y from increasing indicator functions like $\{X\geq x\}$ in a usual monotone class argument.

On the way, we need a partial ordering on the set of probability measures. Obviously, if $\mu(A)\leq \nu(A)$ for all events A, then in fact $\mu=\nu$ ! So instead we say $\mu\leq_{st}\nu$ if $\mu(A)\leq \nu(A)$ for all increasing A. This is called the stochastic ordering, and there is a technical result of Strassen, proving the intuitively obvious claim that if $\mu_1\leq \mu_2$ , then we can couple the measures in a natural way. Formally:

Theorem: $\mu_1\leq\mu_2 \iff \exists$ a probability measure $\nu$ on $\Omega^2$ such that the marginals are $\mu_1,\mu_2$ and

$\nu(\{(w_1,w_2):w_1\leq w_2\})=1$ .

Our main result will be the FKG inequality which asserts that when $\mu$ satisfies the following FKG lattice property

$\mu(w_1\vee w_2)\mu(w_1\wedge w_2)\geq \mu(w_1)\mu(w_2),\quad\forall w_1,w_2\in\Omega$

then $\mu$ is positively associated. We will prove the case $|E|<\infty$ .

We proceed by showing that $\mu_1\leq\mu_2\propto Y\mu_1$ , rescaled, for Y an increasing RV. [Note that we are now suppressing the ‘st’ subscript, as context makes the use clear.]

To show this, we prove the more general Holley’s Theorem:

This states that if two positive probability measures satisfy a related lattice condition:

$\mu_2(w_1\vee w_2)\mu_1(w_1\wedge w_2)\geq \mu_1(w_1)\mu_2(w_2)\quad\forall w_1,w_2\in\Omega$

then we have the stochastic domination result: $\mu_1\leq \mu_2$ .

Note that the lattice condition states, very informally, that adding edges results in a greater relative increase with respect to the measure $\mu_2$ , which has a natural similarity to the definition of stochastic domination.

We prove this, perhaps unexpectedly, by resorting to a Markov chain. We note that there is a Markov chain on $\Omega$ with equilibrium distribution given by $\mu_1$ . This is simple: the non-zero transition rates are those given by the addition or removal of a single edge. Assume that edges are added at unit rate, and that edges are removed with rate: $G(w^e,w_e)=\frac{\mu_1(w_e)}{\mu_1(w^e)}$ .

Similarly, we can construct a Markov chain on state space $\Omega^2$ , where non-zero transitions are given by the addition of an edge to both states in the pair, the removal of an edge from both states in the pair, and the removal of an edge from only the first edge in the pair. Note that, as before, we may be ‘adding’ an edge which is already present. Assuming we start in this set, this choice means that we are restricting the sample space to $\{(\pi,w):\pi\leq w\}$ . We need the transition rate of the third type of transition to have the form: $\frac{\mu_1(\pi_e)}{\mu_1(\pi^e)}-\frac{\mu_2(w_e)}{\mu_2(w^e)}$ . So the lattice condition precisely confirms that this is non-negative, and thus we have a well-constructed Markov chain. The marginals have equilibrium distributions $\mu_1,\mu_2$ by construction, and by the general theory of Markov chains, there is an equilibrium distribution, and this leaves us in precisely the right position to apply Strassen to conclude the result.#

Summary of consequences: We have demonstrated that product measure is positively associated, as it certainly satisfies the FKG condition. Recall that this is what we had suspected intuitively for reasons given at the start of this account. Next time, I will talk about the most natural companion result, the BK inequality, and the stronger Reimer’s Inequality.

References: Both the motivation and the material is derived from Prof. Grimmett’s Part III course, Percolation and Related Topics, which was one of the mathematical highlights of the year. This account of the subject is a paraphrase of his lecture notes, which were themselves based on his book Probability on Graphs. Mistakes, naturally, are mine. Background on the course, and an online source of the book can be found on the course website here.

Eventually Almost Everywhere

A blog about probability and olympiads by Dominic Yeo

Tag Archives: conditional probability