Guess 2/3 of the average

The following problem was doing the rounds on some email lists in Oxford this week:

“You and a group of colleagues are all invited to pick an integer between 0 and 100. The average x is calculated, and the person whose guess is unique and comes closest to 2x/3 (among the unique guesses) is declared the winner.”

Although it wasn’t specified in the setting, one assumes that if two unique guesses were equally close to 2/3 of the average, then the two participants shared the prize.

In any case, it is interesting to ask the question, what is the optimal strategy?

I don’t know very much about game theory, so I’m going to say some things that I hope are true. I don’t claim that this is the best or even a good way of thinking about the situation. If we break the problem down into two parts, we have two of the more famous problems in game theory. To consider the first part, we remove the requirement that the winner’s guess be unique, and allow the participants to pick any real number between 0 and 100.

It seems to me there are three ways to do this. Let’s say we have K possible winners who guessed closed to 2x/3. We could give each of them the fixed prize, say £1. Or we could give each of them £1/K, or we could choose one at random and give them £1. I claim that the latter two methods are essentially the same, and so will ignore the third possibility.

Note that 2x/3 will definitely be at most 200/3. Suppose I pick a real value y that is greater than 200/3. What would happen if instead I picked 200/3? Well the value of x would decrease slightly. How much it decreases would depend on the number of people playing the game. It is not hard to check that we end up closer to 2x/3 if we pick 200/3 instead of y. Let’s call \alpha=\frac{200}{3}.

We might say that the strategy “pick \alpha” dominates the strategy “pick y”, because we always do at least as well with the first strategy as the second, regardless of what the other players do, and indeed in some cases do strictly better. Now comes the potentially philosophical bit. We have concluded that it would be irrational for us to pick any number greater than \alpha. We also know that the email was sent to mathematicians in Oxford, whom we might hope we could safely assume were equally rational. So now we are playing the same game as before, only with the bounds that we should pick a number between 0 and \alpha. We can then apply exactly the same argument to show that it would be irrational to pick any number greater than \frac{2\alpha}{3}. This equally applies to all the other agents, and so we continue, showing that for any k, it is irrational to pick any number greater than (\frac23)^k\alpha. In conclusion, the only rational strategy is to pick 0, and hence everyone will do this.

For the prize structure suggested, this is a Nash equilibrium, since no-one can improve their winnings by changing strategy (with everyone else’s kept the same). This principle is in general referred to as iterated elimination of dominated strategies. The assumption that, not only are players rational, but that all players know that all other players are rational, and all players know that all players know the others rational and so on, is called Common Knowledge of Rationality. This doesn’t seem hugely controversial in this artificial setting, but it is not hard to think of examples where this might not hold. For example, some players might be facing the same situation, but have different utility functions, so especially if there is some randomness in the dynamics, people will have different ‘rational’ opinions concerning strategies, and unless you know their utility functions, it might seem like the other players are not acting rationally.

This argument works equally well with integers. We first eliminate all possibilities greater than 67, and then keep dividing by 2/3, but we have to take the ceiling function to get the new upper bound. As a result, after repeated application of this idea, we conclude that the only rational strategies are to pick 0 or 1.

It can be seen that neither strategy dominates the other. If more than ¾ of the agents choose 1, then choosing 1 is the optimal strategy; if less than ¾ choose 1, then choosing 0 is better. If exactly ¾ choose 1 then everyone wins. There are, however, only two Nash equilibrium, given by everyone choosing 1, and everyone choosing 0. It is tempting to argue that we might expect roughly ½ the agents to choose 1, and so our better option is to choose 0. But we are assuming the other agents are rational, and there is no a priori reason why picking 0 or 1 with probability ½ each should be the rational thing to do.

However, choosing randomly is a possible strategy. In this setting, it becomes harder to talk about Nash equilibria. When there are just two agents, the set of even random strategies for each agent is relatively manageable, but when there is some arbitrary number of players, the set of possible strategies might be very complicated indeed.

The final stage of this integer problem is morally equivalent to playing a game where everyone has to pick 0 or 1, and you win if you are in the majority. For this problem, it is clear that the only Nash equilibria are for everyone to pick either 0 or 1. If the winners share the prize-money equally, then this game has the interesting property that every configuration is Pareto optimal. This means that no-one can improve their winnings without causing someone else’s situation to decrease. When it comes to deciding what to do, with no knowledge of other people’s strategies, we have made little progress. At least here the symmetry between 0 and 1 makes it clear how little progress we can make. The only thing we can say is that if the utility is concave (a generally reasonable assumption), then by conditioning on what everyone else does, an optimal strategy is to pick 0 with probability 1/2.

The problem with saying anything sensible in these sorts of problem with a zero-sum condition (as we effectively have if the winnings are shared) is the symmetry between all the players. Given any strategy, if everyone used that strategy, then everyone would win 1/N in expectation. This applies equally well if we add the condition that the winner has to pick something unique.

The only question we can perhaps say something about occurs if we say that no-one wins the original game if there are no unique integer answers. Then it is worth considering which strategy could everyone adopt so as to minimise the probability of this happening. For the reasons given above, we should indeed force everyone to take the same strategy, which we can interpret as a distribution on the integers. So which distribution on the integers between 1 and k maximises the probability of having at least one unique value?

We also need to know how many people are playing, so let’s say this is N. I don’t know how to solve the general problem, but I can say something if we fix either k or N and then let the other one go to infinity.

If we fix the number of people and let k go to infinity, then we can get arbitrarily close to probability one by using the uniform distribution. Indeed, any sequence of distributions where the probabilities converge uniformly to zero ( ie \max_{x\in[k]} \mathbb{P}(x) \rightarrow 0 ) has this property.

The case where k is fixed and the number of players goes to infinity is a bit trickier. Essentially, if we choose any fixed distribution, the event we seek becomes less and less likely as the number of agents grows. It is equivalent to demanding that if we roll a dice a million times, we only see precisely one six. If we replace a million by larger numbers, this probability decays exponentially.

If we want to maximise the probability of getting precise one agent choosing value 1, we observe that the number of agents choosing that value is binomial(N,p), where p is the probability of that value. If p=o(1/N) then asymptotically, with high probability the number of 1s is 0. If p=w(1/N) then asymptotically, with high probability, the number of 1s is \sim pN. Taking p=\frac{\alpha}{N}, the asymptotic distribution of the number of 1s is Poisson with parameter \alpha. We can solve to see which value of \alpha maximises the probability of getting a single 1. It turns out, unsurprisingly after considering the expectation of the corresponding pre-limit binomial distribution, that this maximum is achieved at \alpha=1.

So note now that if we take the probabilities of picking 1 and 2 both to be 1/N, we get two Poisson random variables asymptotically. For a similar argument to the construction of the Poisson process from independent infinitesimal increments, the covariance of these tends to 0. So I conjecture, and I reckon it is probably not too hard to come up with a formal proof that for large N, the optimal distribution looks like:

(\frac{1}{N},\ldots,\frac{1}{N},1-\frac{k-1}{N}),

or some permutation of that.

Enhanced by Zemanta

The Fisher Information and Cramer-Rao Bound

Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.

Recall we are in a frequentist model, where there is an unknown parameter \theta controlling the distribution of some observable random variable X. Our aim is to make inference about \theta using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of \theta, there is a density function f_\theta(x) controlling X. We summarise this in the likelihood function L(x,\theta).

The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas \theta is unknown because that is how the situation has been established. \theta is fixed, but we are ignorant of its value. If we knew \theta we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.

For some value of X=x, we define the maximum likelihood estimator to be

\hat\theta(X):=\text{argmax}_\theta L(X,\theta).

In words, given some data, \hat\theta is the parameter under which this data is most likely. Note that L(x,\theta) is a probability density function for fixed \theta, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which L(x,\theta)=0\,\forall \theta\in\Theta.

Note also that we are only interested in the relative values of L(x,\theta_1), L(x,\theta_2). So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood l(x,\theta)=\log L(x,\theta) instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.

Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:

J(\theta):=-\frac{d^2 l(\theta)}{d\theta^2}.

This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, l(\theta)=l(X,\theta), and so it must be random. The second question is: “where are we evaluating this second derivative?”

For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at \hat\theta, then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.

Ok, but the information does not evaluate this second derivative at \hat\theta, it evaluates it at \theta. The key point is that it is still a good measure if it evaluates the second derivative at a point close to \hat\theta. And if \hat\theta is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then \theta and \hat\theta will be close together, and so it remains a plausible measure.

This idea is particularly important when we come to consider the Fisher InformationThis is defined as

I(\theta):= \mathbb{E}J(\theta)=\mathbb{E} -\frac{d^2 l(\theta)}{d\theta^2}.

The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why J(\theta) is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!

So the key thing to remember is that in I(\theta), the value \theta plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is l(\cdot,\theta), then we expect \hat\theta to be close to the true value \theta, and so it makes sense to evaluate it there.

Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator \hat\theta of \theta, we have

\text{Var}(\hat\theta)\ge \frac{1}{I(\theta)}.

First we explain that unbiased means that \mathbb{E}\hat\theta=\theta. This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of \theta. Here we mean that given some parameter \theta, \hat\theta is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.

So let’s prove this. First we need a quick result about the score, which is defined as:

U(\theta)=\frac{dl(\theta)}{d\theta}.

Again, this is a random variable. We want to show that \mathbb{E}U(\theta)=0. This is not difficult. Writing f(x)=L(x,\theta), we have

\mathbb{E}U(\theta)=\int f(x)\frac{\partial}{\partial\theta}\log f(x)dx

= \int \frac{\partial}{\partial\theta} L(x,\theta)dx=\frac{d}{d\theta}\int f(x)dx=\frac{d}{d\theta}1=0,

as required. Next we consider the covariance of U and \hat\theta. Since we have established that \mathbb{E}U=0, this is simply \mathbb{E}[U\hat\theta].

\text{Cov}(U,\hat\theta)=\int \hat\theta(x)f(x) \frac{d \log f(x)}{d\theta}dx=\int \hat\theta(x)f(x)\cdot \frac{\frac{\partial f(x)}{\partial \theta}}{f(x)} dx

= \int \hat\theta(x)\frac{\partial f(x)}{\partial \theta}=\frac{\partial}{\partial \theta}\int \hat\theta(x)f(x)

= \frac{\partial}{\partial\theta} \mathbb{E}\hat\theta=\frac{d\theta}{d\theta}=1,

as we assumed at the beginning that \hat\theta was unbiased. Then, from Cauchy-Schwarz, we obtain

\text{Var}(U)\text{Var}(\hat\theta)\ge \text{Cov}(U,\hat\theta)=1.

So it suffices to prove that \text{Var}(U)=I(\theta). This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).

Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for \theta given an observation from \text{Bin}(n,\theta). Unsurprisingly, \hat\theta=\frac{X}{m}. We know from our knowledge of the binomial distribution that the variance of this is \frac{\theta(1-\theta)}{n}, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.

The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.

In any case, I hope this was helpful and interesting in some way!

Enhanced by Zemanta

Large Deviations 6 – Random Graphs

As a final instalment in this sequence of posts on Large Deviations, I’m going to try and explain how one might be able to apply some of the theory to a problem about random graphs. I should explain in advance that much of what follows will be a heuristic argument only. In a way, I’m more interested in explaining what the technical challenges are than trying to solve them. Not least because at the moment I don’t know exactly how to solve most of them. At the very end I will present a rate function, and reference properly the authors who have proved this. Their methods are related but not identical to what I will present.

Problem

Recall the two standard definitions of random graphs. As in many previous posts, we are interested in the sparse case where the average degree of a vertex is o(1). Anyway, we start with n vertices, and in one description we add an edge between any pair of vertices independently and with fixed probability \frac{\lambda}{n}. In the second model, we choose uniformly at random from the set of graphs with n vertices and \frac{\lambda n}{2} edges. Note that if we take the first model and condition on the number of edges, we get the second model, since the probability of a given configuration appearing in G(n,p) is a function only of the number of edges present. Furthermore, the number of edges in G(n,p) is binomial with parameters \binom{n}{2} and p. For all purposes here it will make no difference to approximate the former by \frac{n^2}{2}.

Of particular interest in the study of sparse random graphs is the phase transition in the size of the largest component observed as \lambda passes 1. Below 1, the largest component has size on a scale of log n, and with high probability all components are trees. Above 1, there is a unique giant component containing \alpha_\lambda n vertices, and all other components are small. For \lambda\approx 1, where I don’t want to discuss what ‘approximately’ means right now, we have a critical window, for which there are infinitely many components with sizes on a scale of n^{2/3}.

A key observation is that this holds irrespective of which model we are using. In particular, this is consistent. By the central limit theorem, we have that:

|E(G(n,\frac{\lambda}{n}))|\sim \text{Bin}\left(\binom{n}{2},\frac{\lambda}{n}\right)\approx \frac{n\lambda}{2}\pm\alpha,

where \alpha is the error due to CLT-scale fluctuations. In particular, these fluctuations are on a scale smaller than n, so in the limit have no effect on which value of \lambda in the edge-specified model is appropriate.

However, it is still a random model, so we can condition on any event which happens with positive probability, so we might ask: what does a supercritical random graph look like if we condition it to have no giant component? Assume for now that we are considering G(n,\frac{\lambda}{n}),\lambda>1.

This deviation from standard behaviour might be achieved in at least two ways. Firstly, we might just have insufficient edges. If we have a large deviation towards too few edges, then this would correspond to a subcritical G(n,\frac{\mu n}{2}), so would have no giant components. However, it is also possible that the lack of a giant component is due to ‘clustering’. We might in fact have the correct number of edges, but they might have arranged themselves into a configuration that keeps the number of components small. For example, we might have a complete graph on Kn^{1/2} vertices plus a whole load of isolated vertices. This has the correct number of edges, but certainly no giant component (that is an O(n) component).

We might suspect that having too few edges would be the primary cause of having no giant component, but it would be interesting if clustering played a role. In a previous post, I talked about more realistic models of complex networks, for which clustering beyond the levels of Erdos-Renyi is one of the properties we seek. There I described a few models which might produce some of these properties. Obviously another model is to take Erdos-Renyi and condition it to have lots of clustering but that isn’t hugely helpful as it is not obvious what the resulting graphs will in general look like. It would certainly be interesting if conditioning on having no giant component were enough to get lots of clustering.

To do this, we need to find a rate function for the size of the giant component in a supercritical random graph. Then we will assume that evaluating this near 0 gives the LD probability of having ‘no giant component’. We will then compare this to the straightforward rate function for the number of edges; in particular, evaluated at criticality, so the probability that we have a subcritical number of edges in our supercritical random graph. If they are the same, then this says that the surfeit of edges dominates clustering effects. If the former is smaller, then clustering may play a non-trivial role. If the former is larger, then we will probably have made a mistake, as we expect on a LD scale that having too few edges will almost surely lead to a subcritical component.

Methods

The starting point is the exploration process for components of the random graph. Recall we start at some vertex v and explore the component containing v depth-first, tracking the number of vertices which have been seen but not yet explored. We can extend this to all components by defining:

S(0)=0, \quad S(t)=S(t-1)+(X(t)-1),

where X(t) is the number of children of the t’th vertex. For a single component, S(t) is precisely the number of seen but unexplored vertices. It is more complicated in general. Note that when we exhaust the first component S(t)=-1, and then when we exhaust the second component S(t)=-2 and so on. So in fact

S_t-\min_{0\leq s\leq t}S_s

is the number of seen but unexplored vertices, with \min_{0\leq s\leq t}S_s equal to (-1) times the number of components already explored up to time t.

Once we know the structure of the first t vertices, we expect the distribution of X(t) – 1 to be

\text{Bin}\Big(n-t-[S_t-\min_{0\leq s\leq t}S_s],\tfrac{\lambda}{n}\Big)-1.

We aren’t interested in all the edges of the random graph, only in some tree skeleton of each component. So we don’t need to consider the possibility of edges connecting our current location to anywhere we’ve previously visited (as such an edge would have been consider then – it’s a depth-first exploration), hence the -t. But we also don’t want to consider edges connecting our current location to anywhere we’ve seen, since that would be a surplus edge creating a cycle, hence the -S_s. It is binomial because by independence even after all this conditioning, the probability that there’s an edge from my current location to any other vertex apart from those discounted is equal to \frac{\lambda}{n} and independent.

For Mogulskii’s theorem in the previous post, we had an LDP for the rescaled paths of a random walk with independent stationary increments. In this situation we have a random walk where the increments do not have this property. They are not stationary because the pre-limit distribution depends on time. They are also not independent, because the distribution depends on behaviour up to time t, but only through the value of the walk at the present time.

Nonetheless, at least by following through the heuristic of having an instantaneous exponential cost for a LD event, then products of sums becoming integrals within the exponent, we would expect to have a similar result for this case. We can find the rate function \Lambda_\lambda^*(x)of latex \text{Po}(\lambda)-1$ and thus get a rate function for paths of the exploration process

I_\lambda(f)=\int_0^1 \Lambda_{(1-t-\bar{f}(t))\lambda}^*(f')dt,

where \bar{f}(t) is the height of f above its previous minimum.

Technicalities and Challenges

1) First we need to prove that it is actually possible to extend Mogulskii to this more general setting. Even though we are varying the distribution continuously, so we have some sort of ‘local almost convexity’, the proof is going to be fairly fiddly.

2) Having to consider excursions above the local minima is a massive hassle. We would ideally like to replace \bar{f} with f. This doesn’t seem unreasonable. After all, if we pick a giant component within o(n) steps, then everything considered before the giant component won’t show up in the O(n) rescaling, so we will have a series of macroscopic excursions above 0 with widths giving the actual sizes of the giant components. The problem is that even though with high probability we will pick a giant component after O(1) components, then probability that we do not do this decays only exponentially fast, so will show up as a term in the LD analysis. We would hope that this would not be important – after all later we are going to take an infimum, and since the order we choose the vertices to explore is random and in particular independent of the actual structure, it ought not to make a huge difference to any result.

3) A key lemma in the proof of Mogulskii in Dembo and Zeitouni was the result that it doesn’t matter from an LDP point of view whether we consider the linear (continuous) interpolation or the step-wise interpolation to get a process that actually lives in L_\infty([0,1]). In this generalised case, we will also need to check that approximating the Binomial distribution by its Poisson limit is valid on an exponential scale. Note that because errors in the approximation for small values of t affect the parameter of the distribution at larger times, this will be more complicated to check than for the IID case.

4) Once we have a rate function, if we actually want to know about the structure of the ‘typical’ graph displaying some LD property, we will need to find the infimum of the integrated rate function with some constraints. This is likely to be quite nasty unless we can directly use Euler-Lagrange or some other variational tool.

Answer

Papers by O’Connell and Puhalskii have found the rate function. Among other interesting things, we learn that:

I_{(1+\epsilon)}(0)\approx \frac{\epsilon^3}{6},

while the rate function for the number of edges:

-\lim\tfrac{1}{n}\log\mathbb{P}\Big(\text{Bin}(\tfrac{n^2}{2},\tfrac{1+\epsilon}{n})\leq\tfrac{n}{2}\Big)\approx \frac{\epsilon^2}{4}.

So in fact it looks as if there might be a significant contribution from clustering after all.