Exchangeability and De Finetti’s Theorem

Exchangeability generalises the notion of a sequence of random variables being iid. Essentially, the motivation is that in frequentist statistics data is assumed to be generated by a series of iid RVs with distribution parameterised by some unknown p. The theory for sequences of iid RVs is rich, with laws of large numbers, and limit theorems. However, from a Bayesian perspective, the parameter p has some prior distribution, so the random variables which give the data are no longer independent. That is, each random variable has non-trivial dependence on p, so in general will have non-trivial dependence on each other.

We say a sequence of random variables X=(X_1,X_2,\ldots) is exchangeable if the law of X is invariant under finite permutation of the indices of the sequence. Formally, if for any \sigma\in S_n, (X_1,\ldots,X_n)\stackrel{d}{=}(X_{\sigma(1)},\ldots,X_{\sigma_n)}. Note that permutations with non-trivial action on an infinite subset of N are not considered in this definition, as the law of the entire sequence of RVs is generated by the laws of finite subsets of the sequence. For example, take Y,Y_1,Y_2,\ldots iid, and set X_n=Y+Y_n. Provided Y has some non-trivial distribution, the sequence X is not iid, but it is exchangeable. Note that, conditional on Y, the sequence X is iid. This is the exact situation as in the Bayesian inference framework, where the RVs are iid conditional on some underlying random parameter. De Finetti’s Theorem gives that this in fact holds for any exchangeable sequence.

Theorem (De Finetti): X=(X_1,X_2,\ldots) an exchangeable sequence of random variables. Then there exists a random probability measure \mu (that is, a RV taking values in the space of probability measures) such that conditional on \mu,\; X_1,X_2\ldots \stackrel{iid}{\sim}\mu. Continue reading

Senior Mentoring – Solving equations in integers

At school, we learn how to solve equations like 4x+3=-5. Sometimes the answer is an integer (that is, a whole number), sometimes it isn’t. If we change that -5 to a -6, the solution to the equation above won’t be an integer any more. What is more, if we change the 5 to \sqrt{5}, the solution won’t even be rational (that is, a normal fraction that looks like \frac{p}{q}). But, the key point is that in the introduction to the chapter on linear equations, or whatever they call it, in a textbook, there will be an explanation of how to solve this generic class of problems, which will work whether the numbers in the equation are integers, fractions, or even complex numbers!

Later, we move on to quadratic equations. Spotting factorisations is a good way to get started, then you could learn the quadratic formula to use in those cases where the factors are harder to spot. Alternatively, there’s the method of ‘completing the square’, which takes more steps, but means you can do the arithmetic in clear stages. Then, if you think about why the quadratic formula works, you realise that you are just completing the square for all quadratics (using a, b, c in place of any numerical coefficients). In other words, though there are a few different ways of approaching the problem, the sensible methods are essentially the same. And then you might get problems about logarithms or matrices for which you use your new knowledge about these objects to turn the question into a quadratic equation, say. At all times, there’s an implicit assumption that any question can be turned into an equation which you ‘know how to solve’.

So when you first come across a problem which asks you to find integer solutions to an equation, as have appeared a few times on this year’s Senior Mentoring scheme, it is hard to know where to start. You could try to solve it as if it wasn’t about integers, then select the integer solutions at the end. But this could be difficult, or might not even make sense. You might have a condition that x is odd: what does this mean if x isn’t even an integer? You also often start making sensible-looking substitutions: x is odd, so write x=2n+1 for some n which is also an integer, then work with n. But this can sometimes cause you to end with a huge number of variables which you can’t really relate to, and horribly complicated expressions.

As with so much of maths, experience is very useful. If you’ve seen solutions to twenty moderately tricky integer equations, you’ll have more ideas about how to think about starting another problem. So here are a few tricks that might come in handy.

Smallest solution: Are you stuck trying to prove there aren’t any solutions? Do all your substitutions just give more complicated versions of the original equation? What about considering the smallest solution? You might have to be careful about what you mean by ‘small’, but suppose you consider the smallest solution x, and end up finding a smaller solution x'<x. What could this mean? Well it can only mean that there isn’t a smallest solution! In other words, there are no solutions. You should think about why the fact that we are working with integers means that if there is any solution there must be a smallest one. Then see if you can find all solutions in positive (>0) integers to x^2-2y^2=0. Continue reading

Remarkable fact about Brownian Motion #1: It exists.

A Brownian Motion in one dimension is a stochastic process B_t adapted to (\mathcal{F}_t), defined on some probability space, which is almost surely continuous, and has the properties that B_0=0 a.s. and for every 0\leq s\leq t, B_t-B_s\sim N(0,t-s) and is independent of \mathcal{F}_s. So it has independent increments.

It can be shown that Brownian motion is also surely differentiable nowhere, and is invariant under suitable time-space rescaling. It is therefore not obvious that a probability space with rich enough structure exists to construct such a process.

Theorem (Wiener): There exists a Brownian motion on some probability space.

Proof: We will construct Brownian motion on D, the dyadic rationals in [0,1], then develop the machinery which will enable us to conclude that taking a limit onto the reals retains all the properties we need. The Strong Markov property allows us to extend this to the real line by taking countably many copies, and d independent copies will give BM on \mathbb{R}^d.

We make the following observation. Given independent X_1,X_2\sim N(0,1), it is a simple check that conditional on (X_1|X_1+X_2=a)\sim N(\frac{a}{2},\frac{1}{2}). From this, it is clear how our construction on D will proceed. Take a family of independent N(0,1) RVs, one for each dyadic rational. Rescale these appropriately to construct the BM on D_{n+1} from the values on D_n. Can check the covariance of the new finer increments, and it is clear that these are independent, provided the original increments were independent.

We need continuity. The precise result needed will be dealt with at the end, but it is easy to check that B_d has dyadic increments bounded as \mathbb{E}|B_t-B_s|^p=|t-s|^{p/2}\mathbb{E}|N|^p. The Kolmogorov criterion thus gives that D\ni t\mapsto B_t(\omega) is Holder continuous for any exponent in (0,1/2). Then define: B_t=\lim_{D\ni s\downarrow t} B_s, t\in[0,1], which will also have the a.s. Holder property.

Need to check increments property. Given some increments t_0<t_1<\ldots<t_k, approximate from above by dyadic t_i^n\rightarrow t_i. Then (B_{t_0}^n,\ldots, B_{t_k}^n)\stackrel{\text{a.s.}}{\rightarrow} (B_{t_0},\ldots, B_{t_k}), so the joint distributions converge also. Using Levy then gives both the Gaussian property and the independence in one go.

Theorem (Kolmogorov’s Criterion): If there exist p,\epsilon>0 such that: \mathbb{E}|X_t-X_s|^p\leq C|t-s|^{1+\epsilon}\; \forall s,t\in D then for every \alpha\in (0,\frac{\epsilon}{p}), X is \alpha-Holder continuous almost surely.

Proof: By Markov and BC, can deduce from the condition the existence of a RV M such that almost surely: \sup_{n}\max_k 2^{-n\alpha}|X_{k2^{-n}}-X_{(k+1)2^{-n}}|\leq M<\infty. Now given dyadic s,t, there is a unique dyadic rational with smallest denominator, say 2^{r}, between them. Can then express the difference t-s as a sum of dyadic reciprocals with denominators greater than $2^{(r+1)}$, where each denominator occurs at most twice. So can write: |X_t-X_s|\leq 2\sum_{n\geq r+1}M2^{-n\alpha}=2M\cdot\frac{2^{-(r+1)\alpha}}{1-2^{-alpha}}. Then M<\infty a.s. gives the Holder criterion.

Motivating Ito’s Formula

Ito’s formula, which characterises the stochastic differential, has been mentioned by various textbooks and courses, but now for the first time (after James Norris’s first lecture for the Stochastic Calculus course) I think I finally have a reasonable idea of what’s going on. The reasons I was initially confused help to explain what the motivation is:

  • What processes can we consider? Well, initially continuous time, time-homogeneous Markov processes in \mathbb{R}^d with continuous paths. It could be space-homogeneous as well if desired. By the theory of decomposition of Levy processes (ie what we are considering), the continuous paths property gives that such a process must be a Brownian motion with drift. This has the property that X_{t+dt}-X_t\sim N(b(X_t)dt,a(X_t)dt) where a(X_t) is the diffusivity, that is, the intensity of the Brownian component, and b(X_t) is the drift.
  • What is the stochastic differential? Well, for a process as above, we define: dX_t:= X_{t+dt}-X_t-N(b(X_t)dt,a(X_t)dt). This is non-deterministic: that’s reasonable since X is a stochastic process. And, a normal differential is meaningful only when you integrate, so similarly the stochastic differential is only meaningful when you take an expectation.
  • Write N_t for the Brownian noise. Then \mathbb{E}[d(f(X_t))|\mathcal{F}_t]=\mathbb{E}[f(X_{t+dt})-f(X_t)|\mathcal{F}_t], so by Taylor: =\mathbb{E}[f'(X_t)(b(X_t)dt+N_t)+\frac12 f''(X_t)N_t^2 +o(dt)], remembering that N_t=O(\sqrt{dt}).
  • This is generally written as \mathbb{E}[d(f(X_t))|\mathcal{F}_t]=Lf(X_t)dt where Lf(x)=b(x)f'(x)+\frac12 a(x)f''(x). Now note that \mathbb{E}[dX_t|\mathcal{F}_t]=b(X_t)dt and \mathbb{E}[dX_tdX_t|\mathcal{F}_t]=a(X_t)dt, so it is reasonable that we might ‘cancel the expectations’ to get: d(f(X_t))=f'(X_t)dX_t+\frac12 f''(X_t)dX_tdX_t.
  • Use a suitable tensor product or dX_tdX_t^T when d>1.
  • This is (a version of) Ito’s Lemma.

Convergence of Random Variables

The relationship between the different modes of convergence of random variables is one of the more important topics in any introduction to probability theory. For some reason, many of the textbooks leave the proofs as exercises, so it seems worthwhile to present a sketched but comprehensive summary.

Almost sure convergence: X_n\rightarrow X\;\mathbb{P}-a.s. if \mathbb{P}(X_n\rightarrow X)=1.

Convergence in Probability: X_n\rightarrow X in \mathbb{P}-probability if \mathbb{P}(|X_n-X|>\epsilon)\rightarrow 0 for any \epsilon>0.

Convergence in Distribution: X_n\stackrel{d}{\rightarrow} X if \mathbb{E}f(X_n)\rightarrow \mathbb{E}f(X) for any bounded, continuous function f. Note that this definition is valid for RVs defined on any metric space. When they are real-valued, this is equivalent to the condition that F_{X_n}(x)\rightarrow F_X(x) for every point x\in \mathbb{R} where F_X is continuous. It is further equivalent (by Levy’s Convergence Theorem) to its own special case, convergence of characteristic functions: \phi_{X_n}(u)\rightarrow \phi_X(U) for all u\in\mathbb{R}.

Note: In contrast to the other conditions for convergence, convergence in distribution (also known as weak convergence) doesn’t require the RVs to be defined on the same probability space. This thought can be useful when constructing counterexamples.

L^p-convergence: X_n\rightarrow X in L^p if ||X_n-X||_p\rightarrow 0; that is, \mathbb{E}|X_n-X|^p\rightarrow 0.

Uniform Integrability: Informally, a set of RVs is UI if the integrals over small sets tend to zero uniformly. Formally: (X_n) is UI if \sup_{n,A\in\mathcal{F}}\{\mathbb{E}[|X_n|1(A)]|\mathbb{P}(A)\leq \delta\}\rightarrow 0 as \delta\rightarrow 0.

Note: In particular, a single RV, and a collection of independent RVs are UI. If X~U[0,1] and X_n=n1(X\leq \frac{1}{n}), then the collection is not UI.

Continue reading

Martingales

Definition

X_n is a stochastic process, integrable and adapted to filtration (\Omega,\mathcal{F},(\mathcal{F}_n),\mathbb{P}). Then X is a martingale if \mathbb{E}[X_n|\mathcal{F}_m]=X_m almost surely whenever n\geq m.

It is natural to think about a martingale as defined in the context of a process evolving in time. Then this definition is very reasonable:

  • Integrable: the entire construction is about taking expectations. So these need to exist.
  • Adapted: X_n can’t be affected by what happens after time n. It must be defined by what has happened up to time n.
  • Expectation condition: if you look at the process at a given time m, the best estimate for what X will be in the future is in fact what it is now. As you might expect, it is sufficient that \mathbb{E}[X_{n+1}|\mathcal{F}_n]=X_n almost surely for each n.

Motivation

There are many situations where the expected change in a variable over a time period is zero, whatever the value of the variable is at the start. For example, gambling. For illustration, assume we are speculating on the outcomes of tossing a coin repeatedly. You might have a complicated strategy, for example ‘double your stake when you lose’ (the so-called martingale strategy), or anything else. But ultimately, you can’t see into the future. So before every coin toss, you have to decide your stake, based on what’s happened up until now, and you will win or lose with equal probability, so your expected gain is 0, regardless of how you make your stake choice. Thus under any strategy determined without looking into the future (called previsible), the process recording your winnings is a martingale. Continue reading