Large Deviations 4 – Sanov’s Theorem

Although we could have defined things for a more general topological space, most of our thoughts about Cramer’s theorem, and the Gartner-Ellis theorem which generalises it, are based on means of real-valued random variables. For Cramer’s theorem, we genuinely are interested only in means of i.i.d. random variables. In Gartner-Ellis, one might say that we are able to relax the condition on independence and perhaps identical distribution too, in a controlled way. But this is somewhat underselling the theorem: using G-E, we can deal with a much broader category of measures than just means of collections of variables. The key is that convergence of the log moment generating function is exactly enough to give a LDP with some rate, and we have a general method for finding the rate function.

So, Gartner-Ellis provides a fairly substantial generalisation to Cramer’s theorem, but is still similar in flavour. But what about if we look for additional properties of a collection of i.i.d. random variables (X_n). After all, the mean is not the only interesting property. One thing we could look at is the actual values taken by the X_ns. If the underlying distribution is continuous, this is not going to give much more information than what we started with. With probability, \{X_1,\ldots,X_n\} is a set of size n, with distribution given by the product of the underlying measure. However, if the random variables take values in a discrete set, or better still a finite set, then (X_1,\ldots,X_n) gives a so-called empirical distribution.

As n grows towards infinity, we expect this empirical distribution to approximate the real underlying distribution fairly well. This isn’t necessarily quite as easy as it sounds. By the strong law of large numbers applied to indicator functions 1(X_i\leq t), the empirical cdf at t converges almost surely to the true cdf at t. To guarantee that this convergence is uniform in t is tricky in general (for reference, see the Glivenko-Cantelli theorem), but is clear for random variables defined on finite sets, and it seems reasonable that an extension to discrete sets should be possible.

So such empirical distributions might well admit an LDP. Note that in the case of Bernoulli random variables, the empirical distribution is in fact exactly equivalent to the empirical mean, so Cramer’s theorem applies. But, in fact we have a general LDP for empirical distributions. I claim that the main point of interest here is the nature of the rate function – I will discuss why the existence of an LDP is not too surprising at the end.

The rate function is going to be interesting whatever form it ends up taking. After all, it is effectively going to some sort of metric on measures, as it records how far a possible empirical measure is from the true distribution. Apart from total variation distance, we don’t currently have many standard examples for metrics on a space of measures. Anyway, the rate function is the main content of Sanov’s theorem. This has various forms, depending on how fiddly you are prepared for the proof to be.

Define L_n:=\sum_{i=1}^n \delta_{X_i}\in\mathcal{M}_1(E) to be the empirical measure generated by X_1,\ldots,X_n. Then L_n satisfies an LDP on \mathcal{M}_1(E) with rate n and rate function given by H(\cdot,\mu), where \mu is the underlying distribution.

The function H is the relative entropy, defined by:

H(\nu|\mu):=\int_E \log\frac{\nu(x)}{\mu(x)}d\nu(v),

whenever \nu<<\mu, and \infty otherwise. We can see why this absolute continuity condition is required from the statement of the LDP. If the underlying distribution \mu has measure zero on some set A, then the observed values will not be in A with probability 1, and so the empirical measure will be zero on A also.

Note that an alternative form is:

H(\nu|\mu)=\int_E \frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}d\mu(v)=\mathbb{E}_\nu\frac{\nu(x)}{\mu(x)}\log\frac{\nu(x)}{\mu(x)}.

Perhaps it is more clear why this expectation is something we would want to minimise.

In particular, if we want to know the most likely asymptotic empirical distribution inducing a large deviation empirical mean (as in Cramer), then we find the distribution with suitable mean, and smallest entropy relative to the true underlying distribution.

A remark on the proof. If the underlying set of values is finite, then a proof of this result is essentially combinatorial. The empirical distribution is some multinomial distribution, and we can obtain exact forms for everything and then proceed with asymptotic approximations.

I said earlier that I would comment on why the LDP is not too surprising even in general, once we know Gartner-Ellis. Instead of letting X_i take values in whatever space we were considering previously, say the reals, consider instead the point mass function \delta_{X_i} which is effectively exactly the same random variable, only now defined on the space of probability measures. The empirical measure is then exactly:

\frac{1}{n}\sum_{i=1}^n \delta_{X_i}.

If the support K of the (X_i)s is finite, then in fact this space of measures is a convex subspace of \mathbb{R}^K, and so the multi-dimensional version of Cramer’s theorem applies. In general, we can work in the possibly infinite-dimensional space [0,1]^K, and our relevant subset is compact, as a closed subset of a compact space (by Tychonoff). So the LDP in this case follows from our previous work.

Mixing Times 1 – Reversing Markov Chains

A small group of us have started meeting to discuss Levin, Peres and Wilmer’s book on Markov Chains and Mixing Times. (The plan is to cover a couple of chapters every week, then discuss points of interest and some of the exercises – if anyone is reading this and fancies joining, let me know!) Anyway, this post is motivated by something we discussed in our first session.

Here are two interesting facts about Markov Chains. 1) The Markov property can be defined in terms of products of transition probabilities giving the probability of a particular initial sequence. However, a more elegant and general formulation is to say that, conditional on the present, the past and the future are independent. 2) All transition matrices have at least one equilibrium distribution. In fact, irreducible Markov Chains have precisely one equilibrium distribution. Then, if we start with any distribution, the distribution of the chain at time t converges to the equilibrium distribution.

But hang on. This might be a fairly serious problem. On the one hand we have given a definition of the Markov property that is symmetric in time, in the sense that it remains true whether we are working forwards or backwards. While, on the other hand, the convergence to equilibrium is very much not time-symmetric: we move from disorder to order as time advances. What has gone wrong here?

We examine each of the properties in turn, then consider how to make them fit together in a non-contradictory way.

Markov Property

As many of the students in the Applied Probability course learned the hard way, there are many ways to define the Markov property depending on context, and some are much easier to work with than others. For a Markov chain, you can find a way to say that the transition probability \mathbb{P}(X_{n+1}=x_{n+1}\,|\,X_n=x_n,\ldots,X_0=x_0) is independent of x_0,\ldots,x_{n-1}. Alternatively, you can use this to give an inductive specification for the probability of the first n values of X being some sequence.

It requires a moment’s checking to see that the earlier definition of past/future independence is consistent with this. Let’s first check that we haven’t messed up a definition somewhere, and that the time-reversal of a general Markov chain does have the Markov property, even as defined in the context of a Markov chain.

For clarity, consider X_0,X_1,\ldots, X_N a Markov chain on some finite state space, with N some fixed finite end time. We aren’t losing anything by reversing over a finite time interval – after all, we need to know how to do it over a finite time interval before it could possibly make sense to do it over (-\infty,\infty). We examine (Y_n)_{n=0}^N defined by Y_n:= X_{N-n}.

\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1},\ldots,X_N=x_N)=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1})

is the statement of the Markov property for (Y_n). We rearrange the left hand side to obtain:

=\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1},\ldots,X_N=x_N)}{\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_N=x_N)}

=\frac{\mathbb{P}(X_N=x_N|X_n=x_n,\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_n=x_n,\ldots,X_{N-1}=x_{N-1})}{\mathbb{P}(X_N=x_N|X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})\mathbb{P}(X_{n+1}=x_{n+1},\ldots,X_{N-1}=x_{N-1})}.

Now, by the standard Markov property on the original chain (X_n), the first probability in each of the numerator and denominator are equal. This leaves us with exactly the same form of expression as before, but with one fewer term in the probability. So we can iterate until we end up with

\frac{\mathbb{P}(X_n=x_n,X_{n+1}=x_{n+1})}{\mathbb{P}(X_{n+1}=x_{n+1})}=\mathbb{P}(X_n=x_n|X_{n+1}=x_{n+1}),

as required.

So there’s nothing wrong with the definition. The reversed chain Y genuinely does have this property, regardless of the initial distribution of X.

In particular, if our original Markov chain starts at a particular state with probability 1, and we run it up to time N, then saying that the time-reversal is a Markov chain too is making a claim that we have a non-trivial chain that converges from some general distribution at time 0 to a distribution concentrated at a single point by time N. This seems to contradict everything we know about these chains.

Convergence to Equilibrium – Markov Property vs Markov Chains

It took us a while to come up with a reasonable explanation for this apparent discrepancy. In the end, we come to the conclusion that Markov chains are a strict subset of stochastic processes with the Markov property.

The key thing to notice is that a Markov chain has even more regularity than the definition above implies. The usual description via a transition matrix says that the probability of moving to state y at time t+1 given that you are at state x at time t is some function of x and y. The Markov property says that this probability is independent of the behaviour up until time t. But we also have that the probability is independent of t. The transition matrix P has no dependence on time t – for example in a random walk we do not have to specify the time to know what happens next. This is the property that fails for the non-stationary time-reversal.

In the most extreme example, we say X_0=x_0 with probability 1. So in the time reversal, \mathbb{P}(Y_N=x_0|Y_{N-1}=y_{N-1})=1 for all y_{N-1}. But it will obviously not be the case in general that \mathbb{P}(Y_n=x_0|Y_{n-1}=y_{n-1})=1 for all y_{n-1}, as this would mean the chain Y would be absorbed after one step at state x_0, which is obviously not how the reversal of X should behave.

Perhaps the best way to reconcile this difference is to consider this example where you definitely start from x_0. Then, a Markov chain in general can be thought of as a measure on paths, that is \Omega^N, with non-trivial but regular correlations between adjacent components. (In the case of stationarity, all the marginals are equal to the stationary distribution – a good example of i.d. but not independent RVs.) This is indexed by the transition matrix and the initial distribution. If the initial distribution is a single point mass, then this can be viewed as a restriction to a smaller set of possible paths, with measures rescaled appropriately.

What have we learned?

Well, mainly to be careful about assuming extra structure with the Markov property. Markov Chains are nice because there is a transition matrix which is constant in time. Other processes, such as Brownian motion are space-homogeneous, where the transitions, or increments in this context, are independent of time and space. However, neither of these properties are true for a general process with the Markov property. Indeed, we have seen in a post from a long time ago that there are Markov processes which do not have the Strong Markov Property, which seems unthinkable if we limit our attention to chain-like processes.

Most importantly, we have clarified the essential point that reversing a Markov Chain only makes sense in equilibrium. It is perfectly possibly to define the reversal of a chain not started at a stationary distribution, but lots of unwelcome information from the forward chain ends up in the reversed chain. In particular, the theory of Markov Chains is not broken, which is good.

Large Deviations 3 – Gartner-Ellis Theorem: Where do the all terms come from?

We want to drop the i.i.d. assumption from Cramer’s theorem, to get a criterion for a general LDP as defined in the previous post to hold.

Preliminaries

For general random variables (Z_n) on \mathbb{R}^d with laws (\mu_n), we will continue to have an upper bound like in Cramer’s theorem, provided the moment generating functions of Z_n converge as required. For analogy with Cramer, take Z_n=\frac{S_n}{n}. The Gartner-Ellis theorem gives conditions for the existence of a suitable lower bound and, in particular, when this is the same as the upper bound.

We define the logarithmic moment generating function

\Lambda_n(\lambda):=\log\mathbb{E}e^{\langle \lambda,Z_n\rangle},

and assume that the limit

\Lambda(\lambda)=\lim_{n\rightarrow\infty}\frac{1}{n}\Lambda_n(n\lambda)\in[-\infty,\infty],

exists for all \lambda\in\mathbb{R}^d. We also assume that 0\in\text{int}(\mathcal{D}_\Lambda), where \mathcal{D}_\Lambda:=\{\lambda\in\mathbb{R}^d:\Lambda(\lambda)<\infty\}. We also define the Fenchel-Legendre transform as before:

\Lambda^*(x)=\sup_{\lambda\in\mathbb{R}^d}\left[\langle x,\lambda\rangle - \Lambda(\lambda)\right],\quad x\in\mathbb{R}^d.

We say y\in\mathbb{R}^d is an exposed point of \Lambda^* if for some \lambda,

\langle \lambda,y\rangle - \Lambda^*(y)>\langle\lambda,x\rangle - \Lambda^*(x),\quad \forall x\in\mathbb{R}^d.

Such a \lambda is then called an exposing hyperplane. One way of thinking about this definition is that \Lambda^*(x) is convex, but is strictly convex in any direction at an exposed point. Alternatively, at an exposed point y, there is a vector \lambda such that \Lambda^*\circ \pi_\lambda has a global minimum or maximum at y, where \pi_\lambda is the projection into \langle \lambda\rangle. Roughly speaking, this vector is what we will to take the Cramer transform for the lower bound at x. Recall that the Cramer transform is an exponential reweighting of the probability density, which makes a previously unlikely event into a normal one. We may now state the theorem.

Gartner-Ellis Theorem

With the assumptions above:

  1. \limsup_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(F)\leq -\inf_{x\in F}\Lambda^*(x), \forall F\subset\mathbb{R}^d closed.
  2. \liminf_{n\rightarrow\infty}\frac{1}{n}\log \mu_n(G)\geq -\inf_{x\in G\cap E}\Lambda^*(x), \forall G\subset\mathbb{R}^d open, where E is the set of exposed points of \Lambda^* whose exposing hyperplane is in \text{int}(\mathcal{D}_\Lambda).
  3. If \Lambda is also lower semi-continuous, and is differentiable on \text{int}(\mathcal{D}_\Lambda) (which is non-empty by the previous assumption), and is steep, that is, for any \lambda\in\partial\mathcal{D}_\Lambda, \lim_{\nu\rightarrow\lambda}|\nabla \Lambda(\nu)|=\infty, then we may replace G\cap E by G in the second statement. Then (\mu_n) satisfies the LDP on \mathbb{R}^d with rate n and rate function \Lambda^*.

Where do all the terms come from?

As ever, because everything is on an exponential scale, the infimum in the statements affirms the intuitive notion that in the limit, “an unlikely event will happen in the most likely of the possible (unlikely) ways”. The reason why the first statement does not hold for open sets in general is that the infimum may not be attained for open sets. For the proof, we need an exposing hyperplane at x so we can find an exponential tilt (or Cramer transform) that makes x the standard outcome. Crucially, in order to apply probabilistic ideas to the resulting distribution, everything must be normalisable. So we need an exposing hyperplane so as to isolate the point x on an exponential scale in the transform. And the exposing hyperplane must be in \mathcal{D}_\Lambda if we are to have a chance of getting any useful information out of the transform. By convexity, this is equivalent to the exposing hyperplane being in \text{int}(\mathcal{D}_\Lambda).

Large Deviations 2 – LDPs, Rate Functions and Lower Semi-Continuity

Remarks from Cramer’s Theorem

So in the previous post we discussed Cramer’s theorem on large deviations for means of i.i.d. random variables. It’s worth stepping back and thinking more abstractly about what we showed. Each S_n has some law, which we think of as a measure on \mathbb{R}, though this could equally well be some other space, depending on where the random variables are supported. The law of large numbers asserts that as n\rightarrow\infty, these measures are increasingly concentrated at a single point in \mathbb{R}, which in this case is \mathbb{E}X_1. Cramer’s theorem then asserts that the measure of certain sets not containing this point of concentration decays exponentially in n, and quantifies the exponent, a so-called rate function, via a Legendre transform of the log moment generating function of the underlying distribution.

One key point is that we considered only certain sets [a,\infty),\,a>\mathbb{E}X_1, though we could equally well have considered (-\infty,a],\,a<\mathbb{E}X_1. What would happen if we wanted to consider an interval, say [a,b],\,\mathbb{E}X_1<a<b? Well, \mu_n([a,b])=\mu_n([a,\infty))-\mu_n((b,\infty)), and we might as well assume that \mu_n is sufficiently continuous, at least in the limit, that we can replace the open interval bound with a closed one. Then Cramer’s theorem asserts, written in a more informal style, that \mu_n([a,\infty))\sim e^{-nI(a)} and  similarly for [b,\infty). So provided I(a)<I(b), we have

\mu_n([a,b])\sim e^{-nI(a)}-e^{-nI(b)}\sim e^{-nI(a)}.

To in order to accord with our intuition, we would like I(x) to be increasing for x>\mathbb{E}X_1, and decreasing for x<\mathbb{E}X_1. Also, we want I(\mathbb{E}X_1)=0, to account for the fact that \mu_n([\mathbb{E}X_1,\infty))=O(1). For each consider a sequence of coin tosses. The probability that the observed proportion of heads is in [\frac12,1] should be roughly 1/2 for all n.

Note that in the previous displayed equation for \mu_n([a,b]) the right hand side has no dependence on b. Informally, this means that any event which is at least as unlikely as the event of a deviation to a, will in the limit happen in the most likely of the unlikely ways, which will in this case be a deviation to a, because of relative domination of exponential functions. So if, rather than just half-lines and intervals, we wanted to consider more general sets, we might conjecture a result of the form:

\mu_n(\Gamma)\sim e^{-n\inf_{z\in\Gamma}(z)},

with the approximation defined formally as in the statement of Cramer’s theorem. What can go wrong?

Large Deviations Principles

Well, if the set \Gamma=\{\gamma\} a single point, and the underlying distribution is continuous, then we would expect \mu_n(\{\gamma\})=0 for all n. Similarly, we would expect \mu_n((\mathbb{E}X_1,\infty))\sim O(1), but there is no a priori reason why I(z) should be continuous at \mathbb{E}X_1. (In fact, this is false.), so taking \Gamma=(\mathbb{E}X_1,\infty) again gives a contradiction.

So we need something a bit more precise. Noting that the problem here is that measure (in this case, measure of likeliness on an exponential scale) can leak into open sets through the boundary in the limit, and also the rate function requires some sort of neighbourhood to make sense for continuous RVs, so boundaries of closed sets may give an overestimate. This is reminiscent of weak convergence, and motivated by this, the appropriate general definition for a Large Deviation Principle is:

A sequence of measure (\mu_n) on some space E satisfies an LDP with rate function I and speed n if \forall \Gamma\in \mathcal{B}(E):

-\inf_{x\in\Gamma^\circ}I(x)\leq \liminf \frac{1}{n}\log\mu_n(\Gamma)\leq \limsup\frac{1}{n}\log\mu_n(\Gamma)\leq -\inf_{x\in \bar{\Gamma}}I(x).

Although this might look very technical, you might as well think of it as nothing more than the previous conjecture for general sets, with the two problems that we mentioned now taken care of.

So, we need to define a rate function. I: E\rightarrow[0,\infty] is a rate function, if it not identically infinite. We also demand that it is lower semi-continuous, and has closed level sets \Psi_I^\alpha:=\{x\in E: I(x)\leq\alpha\}. These definitions are in fact equivalent. I will say what lower semi-continuity is in a moment. Some authors also demand that the level sets be compact. Others call this a good rate function, or similar. The advantage of this is that infima on closed sets are attained.

It is possible to specify a different rate. The rate gives the speed of convergence. \frac 1 n can be replaced with any function converging to 0, including continuously.

Lower Semi-Continuity

A function f is lower semi-continuous if

f(x)\leq \liminf f(x_n),\text{ for all sequences }x_n\rightarrow x.

One way of thinking about this definition is to say that the function cannot jump upwards as it reaches a boundary, it can only jump downwards (or not jump at all). The article on Wikipedia for semi-continuity has this picture explaining how a lower semi-continuous function must behave at discontinuities. Note that the value of f at the discontinuity could be the blue dot, or anything less than the blue dot. It is reasonable clear why this definition is equivalent to having closed level sets.

So the question to ask is: why should rate functions be lower semi-continuous? Rather than proceeding directly, we argue by uniqueness. Given a function on \mathbb{R} with discontinuities, we can turn it into a cadlag function, or a caglad function by fiddling with the values taken at points of discontinuity. We can do a similar thing to turn any function into a lower semi-continuous function. Given f, we define

f_*(x):=\liminf_{x_n\rightarrow x}f(x_n)=\sup\{\inf_G f: x\ni G, G \text{ open}\}.

The notes I borrowed this idea from described this as the maximal lower semi-continuous regularisation, which I think is quite a good explanation despite the long words.

Anyway, the claim is that if I(x) satisfies a LDP then so does $I_*(x)$. This needs to be checked, but it explains why we demand that the rate function be lower semi-continuous. We really want the rate function to be unique, and this is a good way to prevent an obvious cause of non-uniqueness. It needs to be checked that it is actually unique once we have this assumption, but that is relatively straightforward.

So, to check that the lower semi-continuous regularisation of I satisfies the LDP if I does, we observe that the upper bound is trivial, since I^*\leq I everywhere. Then, for every open set G, note that for x\in G, I_*(x)=\liminf_{x_n\rightarrow x}I(x), so we might as well consider sequences within G, and so I_*(x)\geq \inf \inf_G I. So, since I_*(x)\leq I(x), it follows that

\inf_G I_*=\inf_G I,

and thus we get the upper bound for the LDP.

References

The motivation for this particular post was my own, but the set of notes here, as cited in the previous post were very useful. Also the Wikipedia page on semi-continuity, and Frank den Hollander’s book ‘Large Deviations’.

Large Deviations 1 – Motivation and Cramer’s Theorem

I’ve been doing a lot of thinking about Large Deviations recently, in particular how to apply the theory to random graphs and related models. I’ve just writing an article about some of the more interesting aspects, so thought it was probably worth turning it into a few posts.

Motivation

Given X_1,X_2,\ldots i.i.d. real-valued random variables with finite expectation, and S_n:=X_1+\ldots+X_n, the Weak Law of Large Numbers asserts that the empirical mean \frac{S_n}{n} converges in distribution to \mathbb{E}X_1. So \mathbb{P}(S_n\geq n(\mathbb{E}X_1+\epsilon))\rightarrow 0. In fact, if \mathbb{E}X_1^2<\infty, we have the Central Limit Theorem, and a consequence is that \mathbb{P}(S_n\geq n\mathbb{E}X_1+n^\alpha)\rightarrow 0 whenever \alpha>\frac12.

In a concrete example, if we toss a coin some suitably large number of times, the probability that the proportion of heads will be substantially greater or smaller than \frac12 tends to zero. So the probability that at least \frac34 of the results are heads tends to zero. But how fast? Consider first four tosses, then eight. A quick addition of the relevant terms in the binomial distribution gives:

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of four tosses are heads}\right)=\frac{1}{16}+\frac{4}{16}=\frac{5}{16},

\mathbb{P}\left(\text{At least }\tfrac34\text{ out of twelve tosses are heads}\right)=\frac{1}{2^{12}}+\frac{12}{2^{12}}+\frac{66}{2^{12}}+\frac{220}{2^{12}}=\frac{299}{2^{12}}.

There are two observations to be made. The first is that the second is substantially smaller than the first – the decay appears to be relatively fast. The second observation is that \frac{220}{2^{12}} is substantially larger than the rest of the sum. So by far the most likely way for at least \tfrac34 out of twelve tosses to be heads is if exactly \tfrac34 are heads. Cramer’s theorem applies to a general i.i.d. sequence of RVs, provided the tail is not too heavy. It show that the probability of any such large deviation event decays exponentially with n, and identifies the exponent.

Theorem (Cramer): Let (X_i) be i.i.d. real-valued random variables which satisfy \mathbb{E}e^{tX_1}<\infty for every t\in\mathbb{R}. Then for any a>\mathbb{E}X_1,

\lim_{n\rightarrow \infty}\frac{1}{n}\log\mathbb{P}(S_n\geq an)=-I(a),

\text{where}\quad I(z):=\sup_{t\in\mathbb{R}}\left[zt-\log\mathbb{E}e^{tX_1}\right].

Remarks

  • So, informally, \mathbb{P}(S_n\geq an)\sim e^{-nI(a)}.
  • I(z) is called the Fenchel-Legendre transform (or convex conjugate) of \log\mathbb{E}e^{tX_1}.
  • Considering t=0 confirms that I(z)\in[0,\infty].
  • In their extremely useful book, Dembo and Zeitouni present this theorem in greater generality, allowing X_i to be supported on \mathbb{R}^d, considering a more general set of large deviation events, and relaxing the requirement for finite mean, and thus also the finite moment generating function condition. All of this will still be a special case of the Gartner-Ellis theorem, which will be examined in a subsequent post, so we make do with this form of Cramer’s result for now.

The proof of Cramer’s theorem splits into an upper bound and a lower bound. The former is relatively straightforward, applying Markov’s inequality to e^{tS_n}, then optimising over the choice of t. This idea is referred to by various sources as the exponential Chebyshev inequality or a Chernoff bound. The lower bound is more challenging. We reweight the distribution function F(x) of X_1 by a factor e^{tx}, then choose t so that the large deviation event is in fact now within the treatment of the CLT, from which suitable bounds are obtained.

To avoid overcomplicating this initial presentation, some details have been omitted. It is not clear, for example, whether I(x) should be finite whenever x is in the support of X_1. (It certainly must be infinite outside – consider the probability that 150% or -40% of coin tosses come up heads!) In order to call this a Large Deviation Principle, we also want some extra regularity on I(x), not least to ensure it is unique. This will be discussed in the next posts.

Weak Convergence and the Portmanteau Lemma

Much of the theory of Large Deviations splits into separate treatment of open and closed sets in the rescaled domains. Typically we seek upper bounds for the rate function on closed sets, and lower bounds for the rate function on open sets. When things are going well, these turn out to be same, and so we can get on with some applications and pretty much forget about the topology underlying the construction. Many sources made a comment along the lines of “this is natural, by analogy with weak convergence”.

Weak convergence is a topic I learned about in Part III Advanced Probability. I fear it may have been one of those things that leaked out of my brain shortly after the end of the exam season… Anyway, this feels like a good time to write down what it is all about a bit more clearly. (I’ve slightly cheated, and chosen definitions and bits of the portmanteau lemma which look maximally similar to the Large Deviation material, which I’m planning on writing a few posts about over the next week.)

The motivation is that we want to extend the notion of convergence in distribution of random variables to general measures. There are several ways to define convergence in distribution, so accordingly there are several ways to generalise it. Much of what follows will be showing that these are equivalent.

We work in a metric space (X,d) and have a sequence (\mu_n) and \mu of (Borel) probability measures. We say that (\mu_n) converges weakly to \mu, or \mu_n\Rightarrow\mu if:

\mu_n(f)\rightarrow\mu(f), \quad\forall f\in\mathcal{C}_b(X).

So the test functions required for result are the class of bounded, continuous functions on X. We shall see presently that it suffices to check a smaller class, eg bounded Lipschitz functions. Indeed the key result, which is often called the portmanteau lemma, gives a set of alternative conditions for weak convergence. We will prove the equivalence cyclically.

Portmanteau Lemma

The following are equivalent.

a) \mu_n\Rightarrow \mu.

b) \mu_n(f)\rightarrow\mu(f) for all bounded Lipschitz functions f.

c) \limsup_n \mu_n(F)\leq \mu(F) for all closed sets F. Note that we demanded that all the measures be Borel, so there is no danger of \mu(F) not being defined.

d) \liminf_n \mu_n(F)\geq \mu(G) for all open sets G.

e) \lim_n \mu_n(A)=\mu(A) whenever \mu(\partial A)=0. Such an A is called a continuity set.

Remarks

a) All of these statements are well-defined if X is a general topological space. I can’t think of any particular examples where we want to use measures on a non-metrizable space (eg C[0,1] with topology induced by pointwise convergence), but there seem to be a few references (such as the one cited here) implying that the results continue to hold in this case provided X is locally compact Hausdorff. This seems like an interesting thing to think about, but perhaps not right now.

b1) This doesn’t strike me as hugely surprising. I want to say that any bounded continuous function can be uniformly approximated almost everywhere by bounded Lipschitz functions. Even if that isn’t true, I am still not surprised.

b2) In fact this condition could be replaced by several alternatives. In the proof that follows, we only use one type of function, so any subset of \mathcal{C}_b(X) that contains the ones we use will be sufficient to determine weak convergence.

c) and d) Why should the sign be this way round? The canonical example to have in mind is some sequence of point masses \delta_{x_n} where x_n\rightarrow x in some non-trivial way. Then there is some open set eg X\{x} such that \mu_n(X\backslash x)=1 but \mu(X\backslash x)=0. Informally, we might say that in the limit, some positive mass could ‘leak out’ into the boundary of an open set.

e) is then not surprising, as the condition of being a continuity set precisely prohibits the above situation from happening.

Proof

a) to b) is genuinely trivial. For b) to c), find some set F’ containing F such that \mu(F')-\mu(F)=\epsilon. Then find a Lipschitz function f which is 0 outside F’ and 1 on F. We obtain

\limsup_n \mu_n(F)\leq \limsup \mu_n(f)=\mu(f)\leq \mu(F').

But \epsilon was arbitrary, so the result follows as it tends to zero. c) and d) are equivalent after taking F^c=G. If we assume c) and d) and apply them to A^\circ, \bar{A}, then e) follows.

e) to a) is a little trickier. Given a bounded continuous function f, assume WLOG that it has domain [0,1]. At most countably many events \{f=a\} have positive mass under each of \mu, (\mu_n). So given M>0, we can choose a sequence

-1=a_0<a_1<\ldots<a_M=2, such that |a_{k+1}-a_k|<\frac{1}{M},

and \mu(f=a_k)=\mu_n(f=a_k)=0 for all k,n. Now it is clear what to do. \{f\in[a_k,a_{k+1}]\} is a continuity set, so we can apply e), then patch everything together. There are slightly too many Ms and \epsilons to do this sensibly in WordPress, so I will leave it at that.

I will conclude by writing down a combination of c) and d) that will look very familiar soon.

\mu(A^\circ)\leq \liminf_n \mu_n(A)\leq \limsup_n\mu_n(A)\leq \mu(\bar{B}).

References

Apart from the Part III Advanced Probability course, this article was prompted by various books on Large Deviations, including those by Frank den Hollander and Ellis / Dupuis. I’ve developed the proof above from the hints given in the appendix of these very comprehensible notes by Rassoul-Agha and Seppalainen.