The Fisher Information and Cramer-Rao Bound

Given that I’d done this twice before, and was giving the same tutorial five times this week, I was surprised at the extent to which the definition of the Fisher Information caused me to confuse both myself and the students. I thought it would be worth summarising some of the main ways to get confused, and talking about one genuine, quantitative use of the Fisher Information.

Recall we are in a frequentist model, where there is an unknown parameter \theta controlling the distribution of some observable random variable X. Our aim is to make inference about \theta using the value of X we observe. We use a lower case x to indicate a value we actually observe (ie a variable as opposed to a random variable). For each value of \theta, there is a density function f_\theta(x) controlling X. We summarise this in the likelihood function L(x,\theta).

The important thing to remember is that there are two unknowns here. X is unknown because it is genuinely a random variable. Whereas \theta is unknown because that is how the situation has been established. \theta is fixed, but we are ignorant of its value. If we knew \theta we wouldn’t need to be doing any statistical inference to find out about it! A useful thing to keep in mind in everything that follows is: “Is this quantity a RV or not?” This is equivalent to “Is this a function of X or not?”, but the original form is perhaps clearer.

For some value of X=x, we define the maximum likelihood estimator to be

\hat\theta(X):=\text{argmax}_\theta L(X,\theta).

In words, given some data, \hat\theta is the parameter under which this data is most likely. Note that L(x,\theta) is a probability density function for fixed \theta, but NOT for fixed x. (This changes in a Bayesian framework.) For example, there might well be values of x for which L(x,\theta)=0\,\forall \theta\in\Theta.

Note also that we are only interested in the relative values of L(x,\theta_1), L(x,\theta_2). So it doesn’t matter if we rescale L by a constant factor (although this means the marginal in x is no longer a pdf). We typically consider the log-likelihood l(x,\theta)=\log L(x,\theta) instead, as this has a more manageable form when the underlying RV is an IID sequence. Anyway, since we are interested in the ratio of different values of L, we are interested in the difference between values of the log-likelihood l.

Now we come to various versions of the information. Roughly speaking, this is a measure of how good the estimator is. We define the observed information:

J(\theta):=-\frac{d^2 l(\theta)}{d\theta^2}.

This is an excellent example of the merits of considering the question I suggested earlier. Here, J is indeed a random variable. The abbreviated notation being used can lead one astray. Of course, l(\theta)=l(X,\theta), and so it must be random. The second question is: “where are we evaluating this second derivative?”

For this, we should be considering what our aim is. We know we are defining the MLE by maximising the likelihood function for fixed x. We have said that the difference between values of l gives a measure of relative likelihood. So if the likelihood function has a sharp peak at \hat\theta, then this gives us more confidence than if the peak is very shallow. (I am using ‘confidence’ in a non-technical. Confidence intervals are related, but I am not considering that now.) The absolute value second derivative is precisely a measure of this property.

Ok, but the information does not evaluate this second derivative at \hat\theta, it evaluates it at \theta. The key point is that it is still a good measure if it evaluates the second derivative at a point close to \hat\theta. And if \hat\theta is a good estimator, which it typically will be, especially when we have an IID sequence and the number of terms grows large, then \theta and \hat\theta will be close together, and so it remains a plausible measure.

This idea is particularly important when we come to consider the Fisher InformationThis is defined as

I(\theta):= \mathbb{E}J(\theta)=\mathbb{E} -\frac{d^2 l(\theta)}{d\theta^2}.

The cause for confusion is exactly what is mean by this expectation. It is not implausible that this is present, since we have already explained why J(\theta) is a random variable. But we need to decide what distribution we are to integrate with respect to. After all, we don’t actually know the distribution of X. If we did, we wouldn’t be doing statistical inference on it!

So the key thing to remember is that in I(\theta), the value \theta plays two roles. First, it gives the distribution of X with respect to which we integrate. Also, it tells us where to evaluate this second derivative. This makes sense overall. If the distribution we are considering is l(\cdot,\theta), then we expect \hat\theta to be close to the true value \theta, and so it makes sense to evaluate it there.

Now we deduce the Cramer-Rao bound, which says that for any unbiased estimator \hat\theta of \theta, we have

\text{Var}(\hat\theta)\ge \frac{1}{I(\theta)}.

First we explain that unbiased means that \mathbb{E}\hat\theta=\theta. This is a property that we would like any estimator to have, though often we have to settle for this property asymptotically. Again, we should be careful about the role of \theta. Here we mean that given some parameter \theta, \hat\theta is then a RV depending onto the actual data, and so has a variance, which happens to be bounded below by a function of the Fisher Information.

So let’s prove this. First we need a quick result about the score, which is defined as:

U(\theta)=\frac{dl(\theta)}{d\theta}.

Again, this is a random variable. We want to show that \mathbb{E}U(\theta)=0. This is not difficult. Writing f(x)=L(x,\theta), we have

\mathbb{E}U(\theta)=\int f(x)\frac{\partial}{\partial\theta}\log f(x)dx

= \int \frac{\partial}{\partial\theta} L(x,\theta)dx=\frac{d}{d\theta}\int f(x)dx=\frac{d}{d\theta}1=0,

as required. Next we consider the covariance of U and \hat\theta. Since we have established that \mathbb{E}U=0, this is simply \mathbb{E}[U\hat\theta].

\text{Cov}(U,\hat\theta)=\int \hat\theta(x)f(x) \frac{d \log f(x)}{d\theta}dx=\int \hat\theta(x)f(x)\cdot \frac{\frac{\partial f(x)}{\partial \theta}}{f(x)} dx

= \int \hat\theta(x)\frac{\partial f(x)}{\partial \theta}=\frac{\partial}{\partial \theta}\int \hat\theta(x)f(x)

= \frac{\partial}{\partial\theta} \mathbb{E}\hat\theta=\frac{d\theta}{d\theta}=1,

as we assumed at the beginning that \hat\theta was unbiased. Then, from Cauchy-Schwarz, we obtain

\text{Var}(U)\text{Var}(\hat\theta)\ge \text{Cov}(U,\hat\theta)=1.

So it suffices to prove that \text{Var}(U)=I(\theta). This is a very similar integral rearrangement to what has happened earlier, so I will leave it as an exercise (possibly an exercise in Googling).

Note a good example of this is question 4 on the sheet. At any rate, this is where we see the equality case. We are finding the MLE for \theta given an observation from \text{Bin}(n,\theta). Unsurprisingly, \hat\theta=\frac{X}{m}. We know from our knowledge of the binomial distribution that the variance of this is \frac{\theta(1-\theta)}{n}, and indeed it turns out that the Fisher Information is precisely the reciprocal of this.

The equality case must happen when the score is proportional to the observed value. I don’t have a particularly strong intuition for when and why this should happen.

In any case, I hope this was helpful and interesting in some way!

Enhanced by Zemanta

Bayesian Inference and the Jeffreys Prior

Last term I was tutoring for the second year statistics course in Oxford. This post is about the final quarter of the course, on the subject of Bayesian inference, and in particular on the Jeffreys prior.

There are loads and loads of articles sitting around on the web contributing the debate about the relative merits of Bayesian and frequentist methods. I do not want to continue that debate here, partly because I don’t have a strong opinion, but mainly because I don’t really understand that much about the underlying issues.

What I will say is that after a few months of working fairly intensively with various complicated stochastic processes, I am starting to feel fairly happy throwing about conditional probability rather freely. When discussing some of the more combinatorial models for example, quite often we have no desire to compute or approximate complication normalising constants, and so instead talk about ‘weights’. And a similar idea underlies Bayesian inference. As in frequentist methods we have an unknown parameter, and we observe some data. Furthermore, we know the probability that such data might have arisen under any value of the parameter. We want to make inference about the value of the parameter given the data, so it makes sense to multiply the probability that the data emerged as a result of some parameter value by some weighting on the set of parameter values.

In summary, we assign a prior distribution representing our initial beliefs about the parameter before we have seen any data, then we update this by weighting by the likelihood that the observed data might have arisen from a particular parameter. We often write this as:

\pi(\theta| x)\propto f(x|\theta)\pi(\theta),

or say that posterior = likelihood x prior. Note that in many applications it won’t be necessary to work out what the normalising constant on the distribution ought to be.

That’s the setup for Bayesian methods. I think the general feeling about the relative usefulness of such an approach is that it all depends on the prior. Once we have the prior, everything is concrete and unambiguously determined. But how should we choose the prior?

There are two cases worth thinking about. The first is where we have a lot of information about the problem already. This might well be the case in some forms of scientific research, where future analysis aims to build on work already completed. It might also be the case that we have already performed some Bayesian calculations, so our current prior is in fact the posterior from a previous set of experiments. In any case, if we have such an ‘informative prior’, it makes sense to use it in some circumstances.

Alternatively, it might be the case that for some reason we care less about the actual prior than about the mathematical convenience of manipulating it. In particular, certain likelihood functions give rise to conjugate priors, where the form of the posterior is the same as the form of the prior. For example, a normal likelihood function admits a normal conjugate prior, and a binomial likelihood function gives a Beta conjugate prior.

In general though, it is entirely possible that neither of these situations will hold but we still want to try Bayesian analysis. The ideal situation would be if the choice of prior had no effect on the analysis, but if that were true, then we couldn’t really be doing any Bayesian analysis. The Jeffreys prior is one natural candidate because it removes a specific problem with choosing a prior to express ignorance.

It sounds reasonable to say that if we have total ignorance about the parameter, then we should take the prior to be uniform on the set of possible values taken by the parameter. There are two potential objections to this. The first is that if the parameter could take any real value, then the prior will not be a distribution as the uniform distribution on the reals is not normalisable. Such a prior is called improper. This isn’t a huge problem really though. For making inference we are only interested in the posterior distribution, and so if the posterior turns out to be normalisable we are probably fine.

The second problem is more serious. Even though we want to express ignorance of the parameter, is there a canonical choice for THE parameter? An example will make this objection more clear. Suppose we know nothing about the parameter T except that it lies in [0,1]. Then the uniform distribution on [0,1] seems like the natural candidate for the prior. But what if we considered T^100 to be the parameter instead? Again if we have total ignorance we should assign T^100 the uniform distribution on its support, which is again [0,1]. But if T^100 is uniform on [0,1], then T is massively concentrated near 1, and in particular cannot also be uniformly distributed on [0,1]. So as a minimum requirement for expressing ignorance, we want a way of generating a prior that doesn’t depend on the choice of parameterisation.

The Jeffreys prior has this property. Note that there may be separate problems with making such an assumption, but this prior solves this particular objection. We define it to be \pi(\theta)\propto [I(\theta)]^{1/2} where I is the Fisher information, defined as

I(\theta)=-\mathbb{E}_\theta\Big[\frac{\partial^2 l(X_1,\theta)}{\partial \theta^2}\Big],

where the expectation is over the data X_1 for fixed \theta, and l is the log-likelihood. Proving that this has the property that it is invariant under reparameterisation requires demonstrating that the Jeffreys prior corresponding to g(\theta) is the same as applying a change of measure to the Jeffreys prior for \theta. The proof is a nice exercise in the chain rule, and I don’t want to reproduce it here.

For a Binomial likelihood function, we find that the Jeffreys prior is Beta(1/2,1/2), which has density that looks roughly like a bucket suspended above [0,1]. It is certainly worth asking why the ‘natural’ choice for prior might put lots of mass at the edge of the domain for the parameter.

I don’t have a definitive answer, but I do have an intuitive idea which comes from the meaning of the Fisher information. As the second derivative of the log-likelihood, a large Fisher information means that with high probability we will see data for which the likelihood changes substantially if we vary the parameter. In particular, this means that the posterior probability of a parameter close to 0 will be eliminated more quickly by the data if the true parameter is different.

If the variance is small, as it is for parameter near 0, then the data generated by this parameter will have the greatest effect on the posterior, since the likelihood will be small almost everywhere except near the parameter. We see the opposite effect if the variance is large. So it makes sense to compensate for this by placing extra prior mass at parameter values where the data has the strongest effect. Note that in the previous example, the Jeffreys prior is in fact exactly inversely proportional to the standard deviation. For the above argument to make sense, we need it to be monotonic with respect to SD, and it just happens that in this case, being 1/SD is precisely the form required to be invariant under reparameterisation.

Anyway, I thought that was reasonably interesting, as indeed was the whole course. I feel reassured that I can justify having my work address as the Department of Statistics since I now know at least epsilon about statistics!