Random transpositions

We study a procedure for generating a random sequence of permutations of [N]. We start with the identity permutation, and then in each step, we choose two elements uniformly at random, and swap them. We obtain a sequence of permutations, where each term is obtained from the previous one by multiplying by a uniformly-chosen transposition.

Some more formality and some technical remarks:

  • This is a Markov chain, and as often with Markov chains, it would be better it was aperiodic. As described, the cycle will alternate between odd and even permutations. So we allow the two elements chosen to be the same. This laziness slows down the chain by a factor N-1/N, but removes periodicity. We will work over timescales where this adjustment makes no practical difference.
  • Let \tau_1,\tau_2,\ldots be the sequence of transpositions. We could define the sequence of permutations by \pi_m= \tau_m\cdot\tau_{m-1}\cdot \ldots\cdot \tau_1. I find it slightly more helpful to think of swapping the elements in places i and j, rather the elements i and j themselves, and so I’ll use this language, for which \pi_m = \tau_1\cdot \tau_2\cdot\ldots \cdot \tau_m is the appropriate description. Of course, transpositions and the identity are self-inverse permutations, so it makes no difference to anything we might discuss.
  • You can view this as lazy random walk on the Cayley graph of S_N generated by the set of transpositions. That is, the vertices of the graph are elements of S_N, and two are connected by an edge if one can be obtained from the other by multiplying by a transposition. Note this relation is symmetric. Hence random transposition random walk.
  • Almost everything under discussion would work in continuous time too.

At a very general level, this sort of model is interesting because sometimes the only practical way to introduce ‘global randomness’ is repeatedly to apply ‘local randomness’. This is not the case for permutations – it is not hard to sample uniformly from S_N. But it is a tractable model in which to study relevant questions about the generating randomness on a complicated set through iterated local operations.

Since it is a Markov chain with a straightforward invariant distribution, we can ask about the mixing time. That is, the correct scaling for the number of moves before the random permutation is close in distribution (say in the sense of total variation distance) to the equilibrium distribution. See this series of posts for an odd collection of background material on the topic. Diaconis and Shahshahani [DS81] give an analytic argument for mixing around \frac{N\log N}{2} transpositions. Indeed they include a constant because it is a sharp cutoff, where the total variation distance drops from approximately 1 to approximately 0 in O(N) steps.

Comparison with Erdos-Renyi random graph process

In the previous result, one might observe that m=\frac{N\log N}{2} is also the threshold number of edges to guarantee connectivity of the Erdos-Renyi random graph G(N,m) with high probability. [ER59] Indeed, there is also a sharp transition around this threshold in this setting too.

We explore this link further. We can construct a sequence of random graphs simultaneously with the random transposition random walk. When we multiply by transposition (i j), we add edge ij in the graph. Laziness of RTRW and the possibility of multiple edges mean this definition isn’t literally the same as the conventional definition of a discrete-time Erdos-Renyi random graph process, but again this is not a problem for any of the effects we seek to study.

The similarity between the constructions is clear. But what about the differences? For the RTRW, we need to track more information than the random graph. That is, we need to know what order the transpositions were added, rather than merely which edges were added. However, the trade-off is that a permutation is a simpler object than a graph in the following sense. A permutation can be a described as a union of disjoint cycles. In an exchangeable setting, all the information about a random permutation is encoded in the lengths of the these cycles. Whereas in a graph, geometry is important. It’s an elegant property of the Erdos-Renyi process that we can forget about the geometry and treat it as a process on component sizes (indeed, a multiplicative coalescent process), but there are other questions we might need to ask for which we do have to study the graph structure itself.

Within this analogy, unfortunately the word cycle means different things in the two different settings. In a permutation, a cycle is a directed orbit, while in a graph it has the usual definition. I’m going to write graph-cycle whenever relevant to avoid confusion.

A first observation is that, under this equivalence, the cycles of the permutation form a finer partition than the components of the graph. This is obvious. If we split the vertices into sets A and B, and there are no edges between them, then nothing in set A will ever get moved out of set A by a transposition. (Note that the slickness of this analogy is the advantage of viewing a transposition as swapping the elements in places i and j.)

However, we might then ask under what circumstances is a cycle of the permutation the same as a component of the graph (rather than a strict subset of it). A first answer is the following:

Lemma: [Den59] The permutation formed by a product of transpositions corresponding in any order to a tree in the graph has a single cycle.

We can treat this as a standalone problem and argue in the following predictable fashion. (Indeed, I was tempted to set this as a problem during selection for the UK team for IMO 2017 – it’s perfectly suitable in this context I think.) The first transposition corresponds to some edge say ab, and removing this edge divides the vertices into components A \ni a, B\ni b. Since no further transposition swaps between places in A and places in B, the final permutation maps a into B and b into A, and otherwise preserves A and B.

This argument extends to later transpositions too. Now, suppose there are multiple cycles. Colour one of them. So during the process, the coloured labels move around. At some point, we must swap a coloured label with an uncoloured label. Consider this edge, between places a and b as before, and indeed the same conclusion holds. WLOG we move the coloured label from a to b. But then at the end of the process (ie in the permutation) there are more coloured labels in B than initially. But the number of coloured labels should be the same, because they just cycle around in the final permutation.

We can learn a bit more by trying thinking about the action on cycles (in the permutation) of adding a transposition. In the following pair of diagrams, the black arrows represent the original permutation (note it’s not helpful to think of the directed edges as having anything to do with transpositions now), the dashed line represents a new transposition, and the new arrows describe the new permutation which results from this product.

It’s clear from this that adding a transposition between places corresponding to different cycles causes the cycles to merge, while adding a transposition between places already in the same cycle causes the cycle to split into two cycles. Furthermore the sizes of the two cycles formed is related to the distance in the cycle between the places defining the transposition.

This allows us to prove the lemma by adding the edges of the tree one-at-a-time and using induction. The inductive claim is that cycles of the permutation exactly correspond to components of the partially-built tree. Assuming this claim guarantees that the next step is definitely a merge, not a split (otherwise the edge corresponding to the next step would have to form a cycle). If all N-1 steps are merges, then the number of cycles is reduced by one on each step, and so the final permutation must be a single cycle.

Uniform split-merge

This gives another framework for thinking about the RTRW itself, entirely in terms of cycle lengths as a partition of [N]. That is, given a partition, we choose a pair of parts in a size-biased way. If they are different, we merge them; and if it is the same part, with size k, we split them into two parts, with sizes chosen uniformly from { (1,k-1), (2,k-2), …  (k-1,1) }.

What’s nice about this is that it’s easy to generalise to real-valued partitions, eg of [0,1]. Given a partition of [0,1], we sample two IID U[0,1] random variables U_1,U_2. If these correspond to different parts, we replace these parts by a single part with size given by the sum. If these correspond to the same part, with size \alpha, we split this part into two parts with sizes |U_1-U_2| and \alpha - |U_1-U_2|. This is equivalent in a distributional sense to sampling another U[0,1] variable U and replacing \alpha with (\alpha U, \alpha(1-U)). We probably want our partition to live in \ell^1_\searrow, so we might have to reorder the parts afterwards too.

These uniform split-merge dynamics have a (unique) stationary distribution, the canonical Poisson-Dirichlet random partition, hereafter PD(0,1). This was first shown in [DMZZ04], and then in a framework more relevant to this post by Schramm [Sch08].

Conveniently, PD(0,1) is also the scaling limit of the cycle lengths in a uniform random permutation (scaled by N). The best way to see this is to start with the observation that the length of the cycle containing 1 in a permutation chosen uniformly from S_N has the uniform distribution on {1,…,N}. This matches up well with the uniform stick-breaking construction of PD(0,1), though other arguments are available too. Excellent background on Poisson-Dirichlet distributions and this construction and equivalence can be found in Chapter 3 of Pitman’s comprehensive St. Flour notes [CSP]. Also see this post, and the links within, with the caveat that my understanding of the topic was somewhat shaky then (as presently, for now).

However, Schramm says slightly more than this. As the Erdos-Renyi graph passes criticality, there is a well-defined (and whp unique) giant component including \Theta(N) vertices. It’s not clear that the corresponding permutation should have giant cycles. Indeed, whp the giant component has \Theta(N) surplus edges, so the process of cycle lengths will have undergone O(N) splits. Schramm shows that most of the labels within the giant component are contained in giant cycles in the permutation. Furthermore, the distribution of cycle lengths within the giant component, rescaled by the size of the giant component, converges in distribution to PD(0,1) at any supercritical time \frac{(1+\epsilon)N}{2}

This is definitely surprising, since we already know that the whole permutation doesn’t look close to uniform until time \frac{N\log N}{2}. Essentially, even though the size of the giant component is non-constant (ie it’s gaining vertices), the uniform split-merge process is happening to the cycles within it at rate N. So heuristically, at the level of the largest cycles, at any supercritical time we have a non-trivial partition, so at any slightly later time (eg \frac{(1+\epsilon/2)N}{2} and \frac{(1+\epsilon)N}{2} ), mixing will have comfortably occurred, and so the distribution is close to PD(0,1).

This is explained very clearly in the introduction of [Ber10], in which the approach is extended to a random walk on S_N driven by a uniform choice from any conjugacy class.

So this really does tell us how the global uniform randomness emerges. As the random graph process passes criticality, we have a positive mass of labels in a collection of giant cycles which are effectively a continuous-space uniform split-merge model near equilibrium (and thus with PD(0,1) marginals). The remaining cycles are small, corresponding to small trees which make up the remaining (subcritical by duality) components of the ER graph. These cycles slowly get absorbed into the giant cycles, but on a sufficiently slow timescale relevant to the split-merge dynamics that we do not need to think of a separate split-merge-with-immigration model. Total variation distance on permutations does feel the final few fixed points (corresponding to isolated vertices in the graph), hence the sharp cutoff corresponding to sharp transition in the number of isolated vertices.

References

[Ber10] – N. Berestycki – Emergence of giant cycles and slowdown transition in random transpositions and k-cycles. [arXiv version]

[CSP] – Pitman – Combinatorial stochastic processes. [pdf available]

[Den59] – Denes – the representation of a permutation as a product of a minimal number of transpositions, and its connection with the theory of graphs

[DS81] – Diaconis, Shahshahani – Generating a random permutation with random transpositions

[DMZZ04] – Diaconis, Mayer-Wolf, Zeitouni, Zerner – The Poisson-Dirichlet distribution is the unique invariant distribution for uniform split-merge transformations [link]

[ER59] – Erdos, Renyi – On random graphs I.

[Sch08] – Schramm – Compositions of random transpositions [book link]

The Rearrangement Inequality

A favourite result of many students doing olympiad inequality problems is the so-called Rearrangement Inequality. This is a mathematical formulation of the idea well-known to even the smallest of child that if you prefer cakes to carrots then if you are offered two of one and one of the other, you should take two of the one you prefer!

At a more formal level, it says that given two strings of non-negative numbers

a_1\le a_2,\le \ldots\le a_n, \quad b_1\le b_2\le \ldots\le b_n,

if you want to form a sum of products of pairs, like

a_1b_4+a_2b_1+a_3b_3+\ldots,

you get the largest result if you take

a_1b_1+a_2b_2+\ldots+a_nb_n.

Formally, for any permutation \sigma \in S_n,

a_1b_1+\ldots+a_nb_n\ge a_1b_{\sigma(1)}+\ldots+a_nb_{\sigma(n)}\ge a_1b_n+\ldots+a_nb_1.

That is, you multiply the largest terms in each sequence together.

The notation to describe to equality case is a bit annoying. Essentially, the sums are equal if and only if the summands exactly correspond. If the sequences are strictly increasing, then equality holds only if the permutation \sigma=\text{id}.

This result is nice because, although it is rarely explicitly useful, it goes in a different direction from the standard scheme of results strengthening AM-GM, Cauchy-Schwarz and so on, and is in some sense more intuitive than these more well-known inequalities, at least in the form presented in an olympiad context.

I was thinking about this partly because it’s a nice result in its own right, but also because it came up in a research problem to do with comparing the expected likelihood of different tree isomorphism classes arising in an inhomogeneous, but relatively well-behaved, random graph model. The probability of forming a given tree is a homogeneous multivariate polynomial in the ages of the vertices that would form the tree. It is then necessary to integrate over the joint distribution (which fortunately is a product in the limit) of the ages of the vertices. I was playing around with this by considering what seemed to be the extreme cases: the star and the path. I was working with the relatively simple case n=4, and it struck me that perhaps the polynomial for the star was always at least as large as that for the path. This would be convenient as it would avoid the need for a horrific-looking integral calculation. This turned out to be true. My first method was a heavy but uncontroversial convexity and stationary point argument, but I found a pair of vectors embedded in the desired inequality on which I could deploy rearrangement.

Anyway, I thought I should be able to come up with a nice proof, and I think this is one. I think this is particularly nice because it is a demonstration that one can do a proof by induction without explicitly inducting on the natural numbers.

We begin with a base case, which is the theorem for n=2, even though we will not be doing induction in the canonical way. We are required to prove that given

a_1\le a_2,\quad b_1\le b_2,

that

a_1b_1+a_2b_2\ge a_1b_2+a_2b_1,

since these are the only available permutations. Moving some terms around gives

(a_2-a_1)(b_2-b_1)\ge 0,

which is true by construction, and so the n=2 result follows.

We now move straight to the general n case. We focus on the left of the two inequalities in the statement of the result, since the other will follow by an identical method, applied in reverse. We consider the case where \sigma is a transposition. For example, we might consider 12435. When we write out the result we want:

a_1b_1+a_2b_2+a_3b_3+a_4b_4+a_5b_5\ge a_1b_1+a_2b_2+a_3b_4+a_4b_3+a_5b_5,

we realise that many of the terms cancel, and the content of the theorem reduces to the n=2 case we have already dealt with. Obviously, this holds equally well whenever \sigma is a transposition. Similarly, if \sigma is a product of two disjoint transpositions, which means that two disjoint pairs of elements are interchanged, we can apply the n=2 case twice, then add on the extra terms to get the result.

In fact, we can do much better than this, by using the fact that any permutation can be expressed as a product of transpositions. We need to be careful about the risk of asserting that every time we multiply the permutation \sigma by a transposition, the value of the associated sum-product expression gets smaller. While the idea is correct, this cannot be generally true. After all, applying the same transposition twice returns us to the identity permutation!

We can nonetheless say something useful. If we start with a permutation

\sigma(1),\sigma(2),\ldots,\sigma(n),

and we interchange the ith and jth elements, to get,

\tau=\sigma(1),\ldots,\sigma(i-1),\sigma(j),\sigma(i+1),\ldots,\sigma(j-1),\sigma(i),\sigma(j+1),\ldots,\sigma(n),

then the product sum corresponding to \tau is less than or equal to the product sum corresponding to \sigma if $\sigma(i)\leq sigma(j)$, under the implicit assumption that i<j. In other words, we can prove the rearrangement inequality for any permutation \sigma that can be obtained from the identity by repeatedly interchanging elements that are initially in increasing order. Essentially, we have defined a partial ordering on the set of permutations.

It suffices to check that all permutations have this property. In fact, this is relatively easy. We can move element n to its required position in \sigma by successively swapping with (n-1), (n-2), etc. If we set this up as an inductive argument, we can finish by applying the hypothesis to the remaining (n-1) elements, which are in the same order as the identity permutation on [n-1].

So we have proved the left-hand side of the Rearrangement Inequality. In fact, this partial ordering framework makes it clear how to prove the right-hand side. By an identical argument, we can get from any permutation to the reverse identity by a similar set of operations.

The Top-to-Random Shuffle III

This post concludes my non-exhaustive list of things I think are interesting about the top-to-random shuffle. In previous posts I have talked about the construction and correct sense of convergence to randomness, and that this algorithm does genuinely achieve uniform randomness at some hitting time which is easy to specify. Promising that posts will be short hasn’t worked in the past so I won’t do that again now, but the idea of this post is brief:

When we specified the dynamics of the top-to-random shuffle, we insisted that the top card card could be placed anywhere in the deck with equal probability including back on top. This appears to be doing nothing except slowing down the shuffling process. Why is this important for convergence to randomness?

Fortunately the answer is short: if we do not let the top card be inserted back onto the top, allowing the configuration to stay the same, then we can divide up the set of orderings into two classes, and the pack will alternate between them.

Why is this a problem? Suppose the classes are called X and Y, and X is the class that contains the original ordering 1,2,…,n. Then after k shuffles, the ordering of the deck will be in X if k is even and in Y if k is odd. Remember our definition of ‘close to randomness’ will be the greatest difference in probability of an event between the actual distribution and the uniform distribution. As before, you can think of this by a betting analogy – what proportion profit can you make again someone who thinks it’s uniform by knowing the true distribution?

Well, it will turn out that the sets X and Y have the same size, so in the uniform distribution, the probability that an ordering is in X is 1/2. Whereas if the pack alternates, then so long as we know how many shuffles have occurred, this probability is either 0 or 1. In particular this is far from 1/2. We should remark that if we introduce the notion of sampling at a random time, or taking an average over all large times in some sense, such problems may disappear, but the result obtained may be less useful. See this post on Cesaro Mixing for details presented in a more rigorous style.

So it remains to see why this is true. First a definition. A transposition is when two elements in a permutation are exchanged. Eg 31452 -> 35412 by transposing 1 and 5. It makes sense intuitively that we can get from any permutation to any other permutation by making successive transpositions. Indeed, this is precisely what is happening in the top-to-random shuffle. To avoid continually having to write it out, we call the original permutation 1,2,…,n the identity permutation.

Then the idea is that X is the set of permutations we can obtain by starting with the identity and applying an even number of transpositions, while Y is the set obtained by applying an odd number of transpositions. For this to work, we will need to show that these sets are disjoint. That is, no permutation can be generated by both an odd number and an even number of transpositions. This is important, as a permutation can certainly be generated from transpositions in multiple ways. For example, if the elements are 1,2,3, we can obtain the permutation 2,1,3 by transposing 1 and 2, obviously. However, we could alternatively start by transposing 2 and 3 to get 1,3,2, then 1 and 3 to get 3,1,2, then 2 and 3 again to get 2,1,3. Note that both of these require an odd number of transpositions.

We will call a permutation even if it is generated by an even number of transpositions, and odd otherwise. We also say that its sign (alternatively signature, parity) is +1 or -1 respectively. To prove this is well-defined, we really want to find a different property that is easier to track.

A useful trick is to count how many pairs of elements are not in the correct order. Let’s do this for our previous example: 31452. There are 5 elements so 5 x 4 / 2 = 10 pairs of elements. We list them:

  • 1 and 2 are in the correct order.
  • 1 and 3 are not, as 3 comes before 1 in this permutation.
  • 1 and 4 are correct.
  • 1 and 5 are correct.
  • 2 and 3 are not.
  • 2 and 4 are not.
  • 2 and 5 are not.
  • 3 and 4 are correct.
  • 3 and 5 are correct.
  • 4 and 5 are not.

So 5 pairs are not in the correct order. Since 5 is odd, the claim is that this means 31452 is an odd permutation. To check this, and to confirm that the sign is well-defined, it suffices to check that the number of so-called inversions, or pairs in the wrong order, changes parity every time we apply a transposition.

This is clearly true if we transpose adjacent elements. Then the orderings of all pairs remain the same, apart from the pair we transposed, which changes. Then, if the elements are not adjacent, instead of transposing them directly, we can perform a succession of transpositions of adjacent elements. The easiest way to describe this is again by example. Suppose we want to transpose 3 and 5 in 31452.

31452 -> 13452 -> 14352 -> 14532 -> 15432 -> 51432.

Note that the middle transposition is actually transposing 3 and 5, and the others are symmetric about this middle operation. In particular, there is an odd number of transpositions in total. So we have proved the result for general transpositions, and thus we now know that the sign of a permutation is well-defined. Note also that there are an equal number of odd and even permutations of every n=>2. For every odd permutation, transposing 1 and 2 gives an even permutation, and vice versa, uniquely, giving a bijection.

What’s really going on is that we are able to multiply permutations, by doing one after the other. Unlike multiplying real numbers, the order in which we do this now matters. In this context, the set of permutations is an example of a general structure called a group. The idea of partitioning a group into subsets which are in some sense symmetric and where some other operation jumps between the subsets is a useful motivation point for a whole avenue of interesting theory. Not to be explored now unfortunately…

Mixing Times 4 – Avoiding Periodicity

A Markov chain is periodic if you can partition the state space such it is possible to be in a particular class only at certain, periodic times. Concretely, suppose we can find a decomposition into classes \Omega=V_1\cup\ldots\cup V_k such that conditional on X_t\in V_i, we have \mathbb{P}(X_{t+1}\in V_{i+1})=1, where the indices of the Vs are taken modulo k. Such a chain is called periodic with period k. In most cases, we would want to define the period to be the maximal such k.

Why is periodicity a problem? It prevents convergence to equilibrium. The distribution at time t has some fairly strong dependence on the initial distribution. For example, if the initial distribution is entirely supported on V_1 as defined above, then the distribution at time t will be entirely supported on V_i, where i\equiv t \mod k. In particular, this cannot converge to some equilibrium.

Aperiodicity thus becomes a necessary condition in any theorem on convergence to equilibrium. Note that by construction this is only relevant for chains in discrete time. In an first account of Markov chains, most of the examples will either have a small state space, for which the transition matrix will have to contain lots of zeros before it stands a chance of being periodic, or obviously aperiodic birth-death or queue type processes. But some of the combinatorially motivated chains we consider for interesting mixing properties are more likely to be periodic. In particular, for a random walk on a group say, the generator measure may well be supported only on a small subset of the whole group, which is completely natural (eg transpositions as a subset of the symmetric group). Then it becomes more plausible that periodicity might arise because of some underlying regularity or symmetry in the group structure.

My first claim is that periodicity is not a disaster for convergence properties of Markov chains. Firstly, by the definition above, P^k(x,y) for x,y\in V_1 is an irreducible (aperiodic if k is maximal) transition matrix on V_1, and so we have convergence to some equilibrium distribution on V_1 of (X_{kt+a}) or similar. An initial distribution mixed between classes gives a mix of such equilibria. Alternatively, we could think about large-time ergodic properties. By taking an average over all distributions up to some large t, the periodic problems get smoothed out. So, for mixing on a periodic chain, it might be possible to make headway with Cesaro mixing, which looks at the speed of convergence of the ergodic average distribution.

In most cases, though, we prefer to alter the chain directly to remove periodicity, or even any chance of periodicity. The preferred method in many contexts is to replace the transition matrix P with \frac12 (P+I). This says that at every time t, we toss an independent fair coin, and with probability 1/2 make the transition suggested by P, and with probability 1/2 we stay where we are. Note that if a chain is irreducible, and some P(x,x)>0, then it is definitely aperiodic, as x cannot be in more than one class as per the definition of periodicity.

If you want to know about the mixing time of the original chain, note that this so-called lazy chain moves at half the speed of the original, so to get exact asymptotics (eg in the case of cutoff, that is mixing speed faster than the scale of the mixing time) you must multiply by 2. Also note that all of the eigenvalues of \frac12 (P+I) are non-negative, and in fact, the eigenvalues are subject to a linear transform in the construction of the lazy transition matrix \lambda\mapsto \frac12(1+\lambda).

Note that choosing 1/2 as the parameter is unnecessary. Firstly, it would suffice to take some P(x,x)=\epsilon and rescale the rest of the row appropriately. Also, in some cases, a different constant gives a more natural interpretation of the underlying mechanism. For example, one model worth considering is the Random Transposition Random Walk on the symmetric group, where at time t we multiply (ie compose) with a transposition chosen uniformly at random. This model is interesting partly because the orbits of an element resemble, at least initially, the component size process of a Erdos-Renyi random graph, on the grounds that when the number of transpositions is small, they don’t interact too much, so can be viewed as independent edges. Anyway, some form of laziness is needed in RTRW, otherwise the chain will alternative between odd and even permutations. In this case, 1/2 is not the most natural choice. The most sensible way to sample random transpositions is to select the two elements of [n] to be transposed uniformly and independently at random. Thus each transposition is selected with probability \frac{2}{n^2}, while the identity, which corresponds to ‘lazily’ staying at the current state in the random walk, is selected with probability 1/n.

The lazy chain is also useful when the original chain has a lot of symmetry involved. In particular, if the original chain involves ‘switching’ say one coordinate. The best example is the random walk on the vertices of the n-hypercube, but there are others. Here, the most helpful way to visualise the configuration is to choose a coordinate uniformly at random and then flip its value (from 0 to 1 or 1 to 0). Now the lazy chain can be viewed similarly, but note that the dependence on the current value of the coordinate is suppressed. That is, having chosen the coordinate to be affected, we set it to be 0 with probability 1/2 and 1 with probability 1/2, irrespective of the prior value at that coordinate. Thus instead of viewing the action on coordinates to be ‘stay or switch’, we can view the action on the randomly chosen coordinate to be ‘randomly resample’, to use statistical terminology. This is ideal for coupling, because from the time coordinate j is first selected, the value at that coordinate is independent of the past, and in particular, the initial value or distribution. So we can couple arbitrary initial configurations or distributions, and we know that as soon as all coordinates have been selected (a time that can be described as a coupon collector problem), the chains are well coupled, that is, the values are the same.

Note that one way we definitely get periodicity is if the increment distribution for random walk on a group is supported entirely in a single coset of a normal subgroup. Why? Well if we take H\lhd G to be the normal subgroup, and gH to be the relevant coset, then P^t(g',\cdot) is supported entirely on g^tg'H, so is periodic with period equal to the order of gH in the quotient group G/H. Note that if the coset is the normal subgroup itself, then it might well include support on the identity, which immediately makes the chain aperiodic. However, there will be then be no transitions between cosets, so the chain is not irreducible on G.

The previous paragraph is the content of Remark 8.3 in the book we are reading. My final comment is that normality is precisely what is needed for this to hold. The key idea is that the set of subsets {gH, gHgH, gHgHgH, … } forms a partition of the group. This is certainly true if H is normal and gH generates G/H. If the latter statement is not true, then the set of subsets still forms a partition, but of some subset of G. The random walk is then neither irreducible nor aperiodic on the reduced state space. If H is not normal, then there are no such restrictions. For example, gHgH might be equal to the whole group G. Then the random walk is aperiodic, as this would imply we can move between any pair of states in two steps, and so by extension between any pair of states in three steps. (2,3)=1, hence the chain is aperiodic. As a concrete example, consider

\tau=\langle (1 2)\rangle \leq S_3,

the simplest example of a non-normal subgroup. Part of the problem is that cosets are different in the left-case and the right-case. Consider the left coset of \tau given by \sigma\tau=\{(1 2 3),(2 3)\}. These elements have order three and two respectively, and so by a similar argument to the general one above, this random walk is aperiodic.