The Yule Process

The second problem sheet for classes on the Applied Probability course this term features a long question about the Yule process. This is probably the simplest example of a birth process. It’s named for the British statistician George Udny Yule, though some sources prefer to call it the Yule-Furry process for the American physicist Wendell Furry who used it as a model of a radioactive reaction.

The model is straightforward. At any time there is some number of individuals in the population, and each individual gives birth to an offspring at constant rate \lambda, independently from the rest of the population. After a birth has happened, the parent and child evolve independently. In the notation of general birth processes, the birth rate when there are n individuals is \lambda_n=\lambda n.

Note that if we start with two or more individuals, the sizes of the two or more families of descendents evolve as a continuous-time Polya’s urn. The arrivals process speeds up with time, but the jump chain is exactly Polya’s urn. Unsurprisingly, the Yule process can be found embedded in preferential attachment models, and other processes which are based around Polya’s urn with extra information.

This is a discrete, random version of exponential growth. Since the geometric distribution is the discrete analogue of the exponential distribution, we probably shouldn’t be surprised to learn that this is indeed the distribution of the process at some fixed time t, when it is started from a single original ancestor. This is all we care about, since the numbers of descendents from each different original ancestors are independent. In general, the distribution of the population size at some fixed time will be negative binomial, that is, a sum of IID geometric distributions.

The standard method here is to proceed using generating functions. Conditioning on the first splitting time gives two independent copies of the original process over a shorter time-scale. One derives an ODE in time for the generating function evaluated at any particular value z. This can be solved uniquely for each z, and patching together gives the generating function of the distribution at any specific time t, which can be seen to coincide with the corresponding generating function of the geometric distribution with parameter e^{-\lambda t}.

So we were trying to decide whether there might be a more heuristic argument for this geometric distribution. The method we came up with is not immediate, but does justify the geometric distribution in a couple of steps. First, we say that the birth times are T_2,T_3,\ldots, so between times [T_n,T_{n+1}) there are n individuals, with T_1:=0 for concreteness. Then by construction of the birth process, T_{n+1}-T_n\stackrel{d}{=}\mathrm{Exp}(\lambda n).

We now look at these ‘inter-birth times’ backwards, starting from T_{n+1}. Note that \mathrm{Exp}(\lambda n) is the distribution of the time for the first of n IID \mathrm{Exp}(\lambda) clocks to ring. But then, looking backwards, the next inter-birth time is thus the distribution of the time for one of (n-1) IID \mathrm{Exp}(\lambda) clocks to ring. So by memorylessness of the exponential distribution (discussed at great length on the first problem sheet), we can actually take these (n-1) clocks to be exactly those of the original n clocks which did not ring first. Continuing this argument, we can show that the first (in the original time direction) inter-birth time corresponds to the time spent waiting for the final clock to ring. Rewriting this observation formally:

T_{n+1}\stackrel{d}{=}\max\{X_i : X_1,\ldots,X_n\stackrel{\text{iid}}{\sim}\mathrm{Exp}(\lambda)\}. (*)

To return to justifying the geometric form of the distribution, we need to clarify the easiest relationship between the population size at a fixed size and these birth times. As we are aiming for the geometric distribution, the probability of the event \{X_t>n\} will be most useful. Clearly this event is the same as \{T_{n+1}<t\}, and from the description involving maxima of IID exponentials, this is easy to compute as (1-e^{-\lambda t})^n, which is exactly what we want.

There are two interesting couplings hidden in these constructions. On closer inspection they turn out to be essentially the same from two different perspectives.

We have specified the distribution of T_n at (*). Look at this distribution on the right hand side. There is a very natural way to couple these distributions for all n, namely to take some infinite sequence X_1,X_2,\ldots of IID \mathrm{Exp}(\lambda) random variables, then use initial sequences of these to generate each of the T_ns as described in (*).

Does this coupling correspond to the use of these IID RVs in the birth process? Well, in fact it doesn’t. Examining the argument, we can see that X_1 gives a different inter-birth time for each value of t in the correspondence proposed. Even more concretely, in the birth process, almost surely T_{n+1}>T_n for each n. This is not true if we take the canonical coupling of (*). Here, if X_n<\max\{X_1,\ldots,X_{n-1}\}, which happens with high probability for large n, we have T_{n+1}=T_n in the process of running maxima.

Perhaps more interestingly, we might observe that this birth process gives a coupling of the geometric distributions. If we want to recover the standard parameterisation of the geometric distribution, we should reparameterise time. [And thus generate an essentially inevitable temptation to make some joke about now having a Yule Log process.]

Let’s consider what the standard coupling might be. For a binomial random variable, either on [n] or some more exotic set, as in percolation, we can couple across all values of the parameter by constructing a family independent uniform random variables, and returning a 1 if U_i>1-p and so on, where p is the parameter of a specific binomial realisation.

We can do exactly the same here. A geometric distribution can be justified as the first success in a sequence of Bernoulli trials, so again we can replace the relevant Bernoulli distribution with a uniform distribution. Take U_1,U_2,\ldots to be IID U[0,1] random variables. Then, we have:

X_t=\stackrel{d}{=}\bar X_t:= \max\{n: U_1,\ldots,U_{n-1}\ge e^{-\lambda t}\}.

The equality in distribution holds for any particular value of t by constructing. But it certainly doesn’t hold uniformly in t. Note that if we define \bar X_t as a process, then typically the jumps of this process will be greater than 1, which is forbidden in the Yule process.

So, we have seen that this Yule process, even though its distribution at a fixed time has a standard form, provides a coupling of such distributions that is perhaps slightly surprising.

The Inspection Paradox and related topics

In the final class for Applied Probability, we discussed the so-called Inspection Paradox for an arrivals process. We assume that buses, sat, arrive as a Poisson process with rate 1, and consider the size of the interval (between buses) containing some fixed time T. The ‘paradox’ is that the size of this interval is larger in expectation than the average time between buses, which of course is given by an exponential random variable.

As with many paradoxes, it isn’t really that surprising after all. Perhaps what is more surprising is that the difference between the expected observed interval and the expected actual interval time is quite small here. There are several points of interest:

1) The Markov property of the Poisson process is key. In particular, this says that the expectation (and indeed the distribution) of the waiting time for a given customer arriving at T is not dependent on T, even if T is a random variable (or rather, a class of random variables, the stopping times). So certainly the inspection paradox property will hold whenever the process has the Markov property, because the inspected interval contains the inspected waiting time, which is equal in distribution to any fixed interval.

2) Everything gets slightly complicated by the fact that the Poisson process is defined to begin at 0. In particular, it is not reversible. Under the infinitesimal (or even the independent Poisson increments) definition, we can view the Poisson process not as a random non-decreasing function of time, but rather as a random collection of points on the positive reals. With this setup, it is clearly no problem to define instead a random collection of points on all the reals. [If we consider this instead as a random collection of point masses, then this is one of the simplest examples of a random measure, but that’s not hugely relevant here.]

We don’t need to worry too much about what value the Poisson process takes at any given time if we are only looking at increments, but if it makes you more comfortable, you can still safely assume that it is 0 at time 0. Crucially, the construction IS now reversible. The number of points in the interval [s,t] has distribution parameterised by t-s, so we it doesn’t matter which direction we are moving in down the real line. In this case, A_t, the time since the previous arrival, and E_t, the waiting time until the next arrival, are both Exp(1) RVs, as the memorylessness property applies in each time direction.

For the original Poisson process, we actually have A_t stochastically dominated by an Exp(1) distribution, because it is conditioned to be less than or equal to t. So in this case, the expected interval time is some complicated function of t, lying strictly between 1 and 2. In our process extended to the whole real line, the expected interval time is exactly 2.

This idea of extending onto the whole real line explains why we mainly consider delayed renewal processes rather than just normal renewal processes. The condition that we start a holding time at 0 is often not general enough, particularly when the holding times are not exponential and so the process is not memoryless.

3) There is a general size-biasing principle in action here. Roughly speaking, we are more likely to arrive in a large interval than in a small interval. The scaling required is proportional to the length of the interval. Given a density function f(x) of X, we define the size-biased density function to be xf(x). We need to normalise to give a probability distribution, and dividing by the expectation EX is precisely what is needed. Failure to account for when an observation should have the underlying distribution or the size-biased distribution is a common cause of supposed paradoxes. A common example is ‘on average my friends have more friends than I do’. Obviously, various assumption on me and my friends, and how we are drawn from the set of people (and the distribution of number of friends) is required that might not necessarily be entirely accurate in all situations.

In the Poisson process example above, the holding times have density function e^{-x}, so the size-biased density function if xe^{-x}. This corresponds to a \Gamma(2,1) distribution, which may be written as the sum of two independent Exp(1) RVs as suggested above.

4) A further paradox mentioned on the sheet is the waiting time paradox. This says that the expected waiting time is longer if you arrive at a random time than if you just miss a bus. This is not too surprising: consider at least the stereotypical complaint about buses in this country arriving in threes, at least roughly. Then if you just miss a bus, there is a 2/3 chance that another will be turning up essentially immediately. On the sheet, we showed that the \Gamma(\alpha,1) distribution has this property also, provided \alpha<1.

We can probably do slightly better than this. The memoryless property of the exponential distribution says that:

\mathbb{P}(Z>t+s|Z>t)=\mathbb{P}(Z>s).

In general, for the sort of holding times we might see at a bus stop, we might expect it to be the case that if we have waited a long time already, then we are less likely relatively to have to wait a long time more, that is:

\mathbb{P}(Z>t+s|Z>t)\leq\mathbb{P}(Z>s),

and furthermore this will be strict if neither s nor t is 0. I see no reason not to make up a definition, and call this property supermemorylessness. However, for the subclass of Gamma distributions described above, we have the opposite property:

\mathbb{P}(Z>t+s|Z>t)\geq\mathbb{P}(Z>s).

Accordingly, let’s call this submemorylessness. If this is strict, then it says that we are more likely to have to wait a long time if we have already been waiting a long time. This seems contrary to most real-life distributions, but it certainly explains the paradox. If we arrive at a random time, then the appropriate holding time has been ‘waiting’ for a while, so is more likely to require a longer observed wait than if I had arrived as the previous bus departed.

In conclusion, before you think of something as a paradox, think about whether the random variables being compared are being generated in the same way, in particular whether one is size-biased, and whether there are effects due to non-memorylessness.

The Poisson Process – A Third Characteristion

There remains the matter of the distribution of the number of people to arrive in a fixed non-infinitissimal time interval. Consider the time interval [0,1], which we divide into n smaller intervals of equal width. As n grows large enough that we know the probability that two arrivals occur in the same interval tends to zero (as this is \leq no(\frac{1}{n})), we can consider this as a sequence of iid Bernoulli random variables as before. So

\mathbb{P}(N_1=k)=\binom{n}{k}(\frac{\lambda}{n})^k(1-\frac{\lambda}{n})^{n-k}
\approx \frac{n^k}{k!} \frac{\lambda^k}{n^k}(1-\frac{\lambda}{n})^n\approx \frac{\lambda^k}{k!}e^{-\lambda}.

We recognise this as belonging to a Poisson (hence the name of the process!) random variable. We can repeat this for a general time interval and obtain N_t\sim \text{Po}(\lambda t).

Note that we implicitly assumed that, in the infinitissimal case at least, behaviour in disjoint intervals was independent. We would hope that this would lift immediately to the large intervals, but it is not immediately obvious how to make this work. This property of independent increments is one of the key definitions of a Levy Process, of which the Poisson process is one of the two canonical examples (the other is Brownian Motion).

As before, if we can show that the implication goes both ways (and for this case it is not hard – letting t\rightarrow 0 clearly gives the infinitissimal construction), we can prove results about Poisson random variables with ease, for example \text{Po}(\lambda)+\text{Po}(\mu)=\text{Po}(\lambda+\mu).

This pretty much concludes the construction of the Poisson process. We have three characterisations:
1) X_n\sim\text{Exp}(\lambda) all iid.
2) The infinitissimal construction as before, with independence.
3) The number of arrivals in a time interval of width t \sim \text{Po}(\lambda t). (This is sometimes called a stationary increments property.) Furthermore, we have independent increments.

A formal derivation of the equivalence of these forms is important but technical, and so not really worth attempting here. See James Norris’s book for example for a fuller exposition.

The final remark is that the Poisson Process has the Markov property. Recall that this says that conditional on the present, future behaviour is independent of the past. Without getting into too much detail, we might like to prove this by using the independent increments property. But remember that for a continuous process, it is too much information to keep track of all the distributions at once. It is sufficient to track only the finite marginal distributions, provided the process is cadlag, which the Poisson process is, assuming we deal with the discontinuities in the right way. Alternatively, the exponential random variable is memoryless, a property that can be lifted, albeit with some technical difficulties, to show the Markov property.