Non-separable Skorohod Representations

In the previous post, I discussed the statement and proof of the Skorohod representation theorem. This concerns the conditions under which it is possible to couple distributions which converge in law, to obtain a family of random variable on a possibly very large probability space, which converge almost surely. The condition for the theorem to hold is that the base space, or at least the support of the limiting distribution should be a separable metric space. Skorohod’s original proof concerned the case where all the distributions were supported on a complete, separable metric space (Polish space), but this extension is not particularly involved, and was proven not long after the original result.

It is natural to ask exactly what goes wrong in non-separable or non-metrizable spaces. Recall a space is separable if it contains a countable dense subset. Obviously, finite or countable sets are by definition separable with any metric. Considering the points with rational coordinates shows that \mathbb{R}^d is separable for each d, and the Stone-Weierstrass theorem shows that continuous functions with on a bounded interval are also separable with the uniform topology, as they can be approximated uniformly well by polynomials with rational coefficients. One heuristic is that a separable space does not have ‘too many’ open sets.

There are references (for example, see [2]) to examples of Skorohod non-representation in non-metrizable topological spaces, which are ‘big’ enough to allow convergence in distribution with respect to a particular class of test functions, but where the distributions are not uniformly tight, so cannot converge almost surely. However, I don’t really understand this well at all, and have struggled to chase the references, some of which are unavailable, and some in French.

Instead, I want to talk about an example given in [1] of a family of distributions on a non-separable space, which cannot be coupled to converge almost surely. The space is (0,1) equipped with the discrete metric, which says that d(x,y)=1 whenever x\ne y. Note that it is very hard to have even deterministic convergence in this space, since the only way to be close to a element of the space is indeed to be equal to that element. We will construct random variables and it will unsurprising that they cannot possibly converge almost surely in any coupling, but the exact nature of the construction will lead to convergence in distribution.

Based on what we proved last time, the support of the limiting distribution will be non-separable. It turns out that the existence of such a distribution is equiconsistent in the sense of formal logic with the existence of an extension of Lebesgue measure to the whole power set of (0,1). This is not allowed under the Axiom of Choice, but is consistent under the slightly weaker Axiom of Dependent Choice (AC). This weaker condition says, translated into language more familiar to me, that every directed graph with arbitrary (and in particular, potentially uncountable) vertex set, and with all out-degrees at least 1 contains an infinite directed path. This seems obvious when viewed through the typically countable context of graph theory. But the natural construction is to start somewhere and ‘just keep going’ wherever possible, which involves making a choice from the out-neighbourhood at lots of vertices. Thus it is clear why this is weaker than AC. Anyway, in the sequel, we assume that this extension of Lebesgue measure exists.

Example (from [1]): We take (X_n)_{n\ge 1} to be an IID sequence of non-negative RVs defined on the probability space ((0,1),\mathcal{B}(0,1),\mathrm{Leb}), with expectation under Lebesgue measure equal to 1. It is not obvious how to do this, with the restriction on the probability space. One example might be to write \omega\in(0,1) as \overline{\omega_1\omega_2\ldots}, the binary expansion, and then set X_n=2\omega_n. We will later require that X_n is not identically 1, which certainly holds in this example just given.

Let \mu be the extension of Lebesgue measure to the power set \mathcal{P}=\mathcal{P}(0,1). Now define the measures:

\mu_n(B)=\mathbb{E}_\mu(X_n \mathbf{1}_B),\quad \forall B\in\mathcal{P}.

To clarify, we are defining a family of measures which also are defined for all elements of the power set. We have defined them in a way that is by definition a coupling. This will make it possible to show convergence in distribution, but they will not converge almost surely in this coupling, or, in fact, under any coupling. Now consider a restricted class of sets, namely B\in \sigma(X_1,\ldots,X_k), the class of sets distinguishable by the outcomes of the first k RVs.

[Caution: the interpretation of this increasing filtration is a bit different to the standard setting with for example Markov processes, as the sets under consideration are actually subsets of the probability space on which everything is defined. In particular, there is no notion that a ‘fixed deterministic set’ lies in all the layers of the filtration.]

Anyway, by independence, when n>k, by independence, we have

\mu_n(B)=\mathbb{E}_\mu(X_n\mathbf{1}_B)=\mathbb{E}_\mu(X_n)\mathbb{E}_\mu(\mathbf{1}_B)=\mu(B).

So whenever B\in\mathcal{F}\bigcup_k \sigma(X_1,\ldots,X_k), \lim_n \mu_n(B)=\mu(B). By MCT, we can extend this convergence to any bounded \mathcal F-measurable function.

This is the clever bit. We want to show that \mu_n(B)\rightarrow\mu(B) for all B\in\mathcal P, but we only have it so far for B\in\mathcal F. But since \mathcal{F}\subset \mathcal P, which is the base field of the probability space under the (non-AC) assumption, we can take conditional expectations. In particular for any B\in\mathcal P, \mathbb{E}_\mu[\mathbf{1}_B | \mathcal{F}] is a bounded, \mathcal F-measurable function. Hence, by definition of \mu_n and the extended MCT result:

\mu_n(B)=\mathbb{E}_\mu[X_n\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]]=\mathbf{E}_{\mu_n}[\mathbb{E}_\mu[\mathbf{1}_B|\mathcal F]] \rightarrow \mathbb{E}_\mu [\mathbb{E}_\mu[\mathbf{1}_B |\mathcal{F}]].

Now, since by definition \mathbf{1}_B is \mathcal{P}-measurable, applying the tower law gives that this is equal to \mu(B). So we have

\mu_n(B)\rightarrow \mu(B),\quad \forall B\in\mathcal{P}. (*)

This gives weak convergence \mu_n\Rightarrow \mu. At first glance it might look like we have proved a much stronger condition than we need. But recall that in any set equipped with the discrete topology, any set is both open and closed, and so to use the portmanteau lemma, (*) really is required.

Now we have to check that we can’t have almost sure convergence in any coupling of these measures. Suppose that we have a probability space with random variables Y,(Y_n) satisfying \mathcal L(Y)=\mu, \mathcal L(Y_n)=\mu_n. But citing the example I gave of X_n satisfying the conditions, the only values taken by Y_n are 0 and 2, and irrespective of the coupling,

\mathbb{P}(Y_n=2\text{ infinitely often})>0.

So it is impossible that Y_n can converge almost surely to any supported on [0,1].

References

[1] Berti, Pratelli, Rigo – Skorohod Representation and Disintegrability (here – possibly not open access)

[2] Jakubowski – The almost sure Skorokhod representation for subsequences in non-metric spaces.

Skorohod Representation Theorem

Continuing the theme of revising theory in the convergence of random processes that I shouldn’t have forgotten so rapidly, today we consider the Skorohod Representation Theorem. Recall from the standard discussion of the different modes of convergence of random variables that almost sure convergence is among the strongest since it implies convergence in probability and thus convergence in distribution. (But not convergence in L_1. For example, take U uniform on [0,1], and X_n=n\mathbf{1}_{\{U<\frac{1}{n}\}}.)

Almost sure convergence is therefore in some sense the most useful form of convergence to have. However, it comes with a strong prerequisite, that the random variables be defined on the same probability space, which is not required for convergence in distribution. Indeed, one can set up weak versions of convergence in distribution which do not even require the convergents to be random variables. The Skorohod representation theorem gives a partial converse to this result. It states some conditions under which random variables which converge in distribution can be coupled on some larger probability space to obtain almost sure convergence.

Skorohod’s original proof dealt with convergence of distributions defined on complete, separable metric spaces (Polish spaces). The version discussed here is from Chapter 5 of Billingsley [1], and assumes the limiting distribution has separable support. More recent authors have considered stronger convergence conditions (convergence in total variation or Wasserstein distance, for example) with weaker topological requirements, and convergence of random variables defined in non-metrizable spaces.

Theorem (Skorohod representation theorem): Suppose that distributions P_n\Rightarrow P, where P is a distribution with separable support. Then we can define a probability space (\Omega,\mathcal{F},\mathbb{P}) and random variables X,(X_n)_{n\ge 1} on this space such that the laws of X,X_n are P,P_n respectively and X_n(\omega)\rightarrow X(\omega) for all \omega\in\Omega.

NB. We are proving ‘sure convergence’ rather than merely almost sure convergence! It is not surprising that this is possible, since changing the value of all the X_ns on a set with measure zero doesn’t affect the conditions for convergence in distribution.

Applications: Before going through the Billingsley proof, we consider one simple application of this result. Let S be a separable metric space containing the support of X, and g a continuous function S\rightarrow S'. Then

X_n\stackrel{a.s.}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{a.s.}{\rightarrow}g(X).

So, by applying the Skorohod representation theorem once, and the result that almost sure convergence implies convergence in distribution, we have shown that

X_n\stackrel{d}{\rightarrow}X\quad\Rightarrow\quad g(X_n)\stackrel{d}{\rightarrow}g(X),

subject to these conditions on the space supporting X. And we have avoided the need to be careful about exactly which class of functions determine convergence in distribution, as would be required for a direct argument.

Proof (from [1]): Unsurprisingly, the idea is to construct realisations of the (X_n) from a realisation of X. We take X, and a partition of the support of X into small measurable sets, chosen so that the probability of lying in a particular set is almost the same for X_n as for X, for large n. Then, the X_n are constructed so that for large n, with limitingly high probability X_n lies in the same small set as X.

Constructing the partition is the first step. For each x\in S:=\mathrm{supp}(X), there must be some radius \frac{\epsilon}{4}<r_x<\frac{\epsilon}{2} such that P(\partial B(x,r_x)=0. This is where we use separability. Since every point in the space is within \frac{\epsilon}{4} of some element of a countable sequence of elements of the space, we can take a countable subset of these open balls B(x,r_x) which cover the space. Furthermore, we can take a finite subset of the balls which cover all of the space apart from a set of measure at most \epsilon. We want the sets to be disjoint, and we can achieve this by removing the intersections inductively in the obvious way. We end up with a collection B_0,B_1,\ldots,B_k, where B_0 is the leftover space, such that

  • P(B_0)<\epsilon
  • P(\partial B_i)=0,\quad i=0,1,\ldots,k
  • \mathrm{diam}(B_i)<\epsilon,\quad i=1\ldots,k.

Now suppose for each m, we take such a partition B^m_0,B^m_1,\ldots,B^m_{k_m}, for which \epsilon_m=\frac{1}{2^m}. Unsurprisingly, this scaling of \epsilon is chosen so as to use Borel-Cantelli at the end. Then, from convergence in distribution, there exists an integer N_m such that for n\ge N_m, we have

P_n(B^m_i)\ge (1-\epsilon_m)P(B^m_i),\quad i=0,1,\ldots,k_m. (*)

Now, for N_m\le n <N_{m+1}, for each B^m_i with non-zero probability under P, take Y_{n,i} to be independent random variables with law P_n(\cdot | B^m_i) equal to the restriction onto the set. Now take \xi\sim U[0,1] independent of everything so far. Now we make concrete the heuristic for constructing X_n from X. We define:

X_n=\sum_{i=0}^{k_m}\mathbf{1}_{\{\xi\le 1-\epsilon_m, X\in B^m_i\}} Y_{n,i} + \mathbf{1}_{\{\xi>1-\epsilon_m\}}Z_n.

We haven’t defined Z_n yet. But, from (*), there is a unique distribution such that taking Z_n to be independent of everything so far, with this distribution, we have \mathcal{L}(X_n)=P_n. Note that by iteratively defining random variables which are independent of everything previously defined, our resulting probability space \Omega will be a large product space.

Note that \xi controls whether the X_n follow the law we have good control over, and we also want to avoid the set B^m_0. So define E_m:=\{X\not \in B^m_0, \xi\le 1-\epsilon_m\}. Then, P(E_m)<2\epsilon_m=2^{-(m-1)}, and so by Borel-Cantelli, with probability 1, E_m holds for all m larger than some threshold. Let us call this \liminf_m E_m=: E, and on this event E, we have by definition X_n \rightarrow X. So we have almost sure convergence. But we can easily convert this to sure convergence by removing all \omega\in\Omega for which \xi(\omega)=1 and setting X_n\equiv X on E^c, as this does not affect the distributions.

Omissions: 

  • Obviously, I have omitted the exact construction of the distribution of Z_n. This can be reverse reconstructed very easily, but requires more notation than is ideal for this medium.
  • It is necessary to remove any sets B^m_i with zero measure under P for the conditioning to make sense. These can be added to B^m_0 without changing any of the required conditions.
  • We haven’t dealt with any X_n for n<N_1.

The natural question to ask is what happens if we remove the restriction that the space be separable. There are indeed counterexamples to the existence of a Skorohod representation. The clearest example I’ve found so far is supported on (0,1) with a metric inducing the discrete topology. If time allows, I will explain this construction in a post shortly.

References

[1] – Billingsley – Convergence of Probability Measures, 2nd edition (1999)