Why do we need the Lebesgue integral?

I’m currently lecturing the course Fundamentals of Probability at KCL, where we cover some of the measure theory required to set up probability with a higher level of formality than students have seen in their introductory courses. By this point, we have covered:

  • Sigma-algebras, measures, and probability spaces, including the fact that not all subsets of \mathbb{R} are Borel-measurable.
  • Measurable functions, which in probability spaces are random variables.

More recently, we’ve covered the construction of the Lebesgue integral for measurable functions f:E\to\mathbb{R} for some general measure space (E,\mathcal{E},\mu).

In this post, I’ll summarise briefly this construction, and some key differences in context and usage between this and the more familiar Riemann integral. But I’ll also aim to answer the question: “Why do we need to set up the Lebesgue integral like this? Why can’t we do it directly, or more simply?

Someone asked essentially this question after the lecture, and I think it deserves an answer. To address this, I’ll look a handful of plausible alternative directions for defining integration, and seeing where they go wrong.

Construction of the Lebesgue integral

But first, we have to recap what the construction is. It goes in three steps:

  • Construction for simple functions which only take finitely-many values, and all these values are non-negative.
  • Construction for measurable, non-negative functions, by considering a monotone limit of simple functions.
  • Construction for (general) measurable functions, by splitting into positive and negative parts.
  • It’s worth emphasising that for all of these steps, the integral is defined directly over the whole set E. This is an immediate contrast with the (improper) Riemann integral for functions f:\mathbb{R}\to\mathbb{R}, where the integral \int_{-\infty}^\infty has to be defined as a limit of integrals over finite ranges.

Adding a few details, a simple function has the form f=\sum_{i=1}^k a_i \mathbf{1}_{A_i}, where a_i\ge 0 are non-negative coefficients, and A_i\in \mathcal{E} are measurable sets. Note that the sum is finite.

Then the integral of f is defined to be \int_E f \,d\mu=\sum_{i=1}^k a_i \mu(A_i). This matches our intuition if we are thinking of functions f:\mathbb{R}\to\mathbb{R} with an idea of integrals as ‘area under a graph’, but it is just a definition.

Question: we will see shortly why it’s only relevant to consider non-negative coefficients. But initially, we might ask why it needs to be a finite sum? In general, when sums are finite, then they don’t need to be defined as a limit. Furthermore, they definitely exist! If we allowed infinite sums in the definition of simple functions, we would then need to exclude pathologies like f=\sum_{n\ge 1} \mathbf{1}_{[-n,n]} which is infinite everywhere [1]. This will be a recurring theme of this post: an optimal definition only takes a limit when it’s certain that the limit exists.

Anyway, the arguments are sometimes a bit notation-heavy, but in the setting of simple functions, we can prove directly that the integral is a linear operator (and so satisfies \int_E (\alpha f+\beta g)d\mu=\alpha \int_E f\,d\mu + \beta \int_E g\,d\mu) and various other unsurprising-but-important results.

Following this, we define the integral for general measurable functions f: E\to\mathbb{R} taking only non-negative values via a limit of simple functions. Specifically

\int_E f\,d\mu = \sup\Big\{\int_E g\,d\mu,\,0\le g\le f\text{ a.e., }g\text{ simple}\Big\}. (*)

While it would be tempting to define \int_E f\,d\mu as the limit of the integrals of a specific sequence of simple functions f_n approximating f (for example, defining f_n by rounding f down to the nearest multiple of 1/2^n), the more abstract definition (*) turns out to be more useful in proofs [2].

Using definition (*), we derive the Monotone Convergence Theorem. Informally, this says that monotone limits of non-negative measurable functions respect integrals. More formally, when f_n\uparrow f a.e., then \int_E f_n\,d\mu\uparrow \int_E f\,d\mu \in[0,\infty]. As a bonus, we immediately recover that the limit of the integrals of the ’rounding-down approximations’ in the previous paragraph is the limit of the original function.

Finally, we extend to general measurable functions f:E\to\mathbb{R}, potentially taking both positive and negative values. We identify the positive and negative parts f^+,f^- of f, and define the integral \int_E f\,d\mu=\int_E f^+\,d\mu - \int_E f^-\,d\mu, provided that at least one of the RHS integrals is finite. The point is that it’s well-defined to write \infty - 4 = \infty or \pi - \infty=-\infty, but it’s not well-defined to study \infty-\infty.

It might seem annoying that a consequence of this is that functions such as f(x)=\mathbf{1}_{x>0}\frac{\sin x}{x} are then not integrable over \mathbb{R} (even though they have a finite improper Riemann integral), but this is a direct consequence of defining the Lebesgue integral directly onto the whole set E. Any thoughts about ‘positive and negative contributions cancelling out’ are implicitly or explicitly thinking about \mathbb{R} as a limit of compact ranges. [3]

Ideas for alternative constructions

Given the non-integrability of certain functions as above, one might consider whether it’s possible to tweak the definitions to avoid this. Here is a list of reasonable alternatives to the standard order of construction described above. In each case, I’ll draw attention to something that might go wrong (focusing on the \mathbb{R}\to\mathbb{R} case, but the issues mostly generalise).

Truncation of range: We’ve touched on this earlier, but let’s briefly revisit what happens if we define integrals \int_{\mathbb{R}} as a limit of intervals over finite range. Leaving aside the question of whether this is possible for more general spaces (see [3]), we already have issues with this approach over \mathbb{R}. Note that once we restrict to finite range and continuous functions (or finitely-many discontinuities), it doesn’t matter if we are using Riemann or Lebesgue framework. But when studying a function like f(x)\mathbf{1}_{\mathbb{R}\setminus\{0\}}(x)\frac{1}{x}, we have truncated results like

\int_{-n}^n f(x)\mathrm{d}x = 0,\quad \int_{-n}^{2n}f(x)\mathrm{d}x = \log 2,

and so taking the limit \mathbb{R}=\bigcup_{n} [-n,n] would ‘define’ the integral of f over \mathbb{R} to be zero, while taking the limit \mathbb{R}=\bigcup_{n} [-n,2n] would give a different answer. So this certainly wouldn’t work as a general definition.

In practice, for improper integrals \mathbb{R}, one generally studies \int_{-\infty}^\infty = \lim\limits_{n\to\infty} \int_0^n + \lim\limits_{n\to\infty} \int_{-n}^\infty, with the same restriction on taking \infty-\infty as we discussed earlier in this post. However, this split into the two half-lines is very much a feature of the reals. Even just in \mathbb{R}^2, there is no easy analogue to this split.

Approximating all functions from below: What about if we try to approximate all measurable functions from below by simple functions, after relaxing the condition that the coefficients of a simple function have to be non-negative? Well, if a function f takes arbitrarily negative values (eg all values in (-\infty,0)) then it just can’t be bounded below by a function which only takes finitely many values.

And if we allow simple functions to have negative coefficients and support on infinitely many sets A_i, this is even worse than the situation discussed in footnote [1] already. At this point we would already have simple functions for which the integral isn’t defined, such as taking f(x)=1 for x\ge 0 and f(x)=-1 for x<0. It’s not going to work very well to study the supremum of a set, some of whose values are not defined!

Jointly approximating from below/above: This is the suggestion that prompted me to write this post. Suppose we relax the non-negativity, but not the finite sum condition for a simple function. Then approximate f:\mathbb{R}\to\mathbb{R} by ‘simple’ functions f_n so that |f_n|\le |f|, and \mathrm{sign}(f_n)=\mathrm{sign}(f) a.e. That is, f_n bounds f from below when f is positive, and from above when f is negative.

The issue here is that it’s not clear how to turn this into a definition. We can’t write

\int_E f\,d\mu = \sup\Big\{\int_E g\, d\mu,\, g\text{ `simple'},\mathrm{sign}(g)=\mathrm{sign}(f),\,|g|\le |f|\Big\},

because the sup would be attained by considering ‘simple’ functions g which are zero whenever f is negative. And it’s not well-defined to replace the supremum with a limit – it’s not a sequence, after all.

If you try and split it as a supremum over the range where f\ge 0 and an infimum over the range where f<0, then you are actually back to the original definition of the general Lebesgue integral!

And going for a very concrete sequence doesn’t help either, unfortunately! If we define f_n by rounding f down/up (depending on whether f is positive/negative) to the nearest multiple of 1/2^n, we could define \int_E f\,d\mu=\lim\limits_{n\to\infty} \int_E f_n\,d\mu, notwithstanding some of the issues we mentioned earlier about committing to an specific approximating sequence.

However, since the (f_n)_{n\ge 1} are not monotone in n, there is no guarantee that this limit exists. And, indeed, we can construct examples of f where the limit does not exist. Consider, for example, the intervals A_k=[2^{k-1},2^k), and the function f=\sum\limits_{k\ge 1}(-\frac{1}{2})^{k-1} \mathbf{1}_{A_k}. So then the approximation f_n=\sum\limits_{k=1}^n (-\frac{1}{2})^{k-1} \mathbf{1}_{A_k}, and thus \int_{\mathbb{R}} f_n\,dx = \sum\limits_{k=1}^n (-1)^{k-1}, which is 1 when n is odd, and 0 when n is even.

With the true definition of the Lebesgue integral \int_{\mathbb{R}}f\,dx is undefined, which might seem equally disappointing, but in reality is more clear than ending up in a situation where you get different limits depending on whether you round down to the nearest multiple of 1/2^{2n} versus rounding down to the nearest multiple of 1/2^{2n+1}.

Footnotes

[1] – It’s not impossible to exclude such pathologies (for example by demanding that the A_i are disjoint (*) ) but adding constraints to the definition of simple function is far from ideal if we need to work with them for proofs. For example, under (*), it becomes less obvious that the sum of two simple functions is simple.

[2] – if we define f_n to be f rounded down to the nearest multiple of 1/2^n, and similarly g_n for g, then we immediately run into problems when trying to prove that \int_E (f+g)d\mu=\int_E f\,d\mu + \int_E g\,d\mu, as the rounding down operation doesn’t commute with addition.

[3] – and there are measure spaces which are not sigma-finite, for which E cannot be expressed as a (countable) union of finite-measure sets. But we still want a theory of integration over such spaces.

Borel-Cantelli Lemmas

I’m currently lecturing my first course at King’s, which builds measure theory from the ground up to the construction of the Lebesgue integral, along with some more probabilistic topics. In this second week, we have been discussing various matters related to the construction of sigma-algebras. It took me slightly by surprise that this ends up being the natural moment during an undergraduate’s progression through probability to introduce the Borel-Cantelli lemmas. It’s not hard to imagine a pedagogical schedule in which they appear earlier, but equally, one benefits from seeing them with a bit more mathematical experience, even if they are only very loosely related to the construction of the Lebesgue integral!

Anyway, I noticed that I have not written about these results on the blog before. There are a couple of neat applications that lie slightly beyond the remit (and time allowance) of the lecture course, so they are presented at the end.

Context – events holding eventually and infinitely often

Recall that (in the notation of probability spaces) a sigma-algebra \mathcal F provides a set of events A\subseteq \Omega for which it is meaningful and consistent to define a probability measure. This is extremely uncontroversial in the setting where the set of outcomes \Omega is finite, but thornier in the setting where \Omega is uncountable, for example [0,1] or \mathbb{R}. The key axiom for sigma-algebras involves countable unions. If (A_n)_{n\ge 1} is a sequence of events in \mathcal F, we must also have \bigcup_{n\ge 1}A_n\in\mathcal F.

Since the axioms also demand that \mathcal F is closed under taking complements, it only takes a few steps to show that countable intersections are also valid, ie \bigcap_{n\ge 1}A_n\in\mathcal F also. It is uncontroversial to note that

\bigcap_{n\ge 1}A_n \subseteq \bigcup_{n\ge 1}A_n.

The main initial motivation for considering the events \{A_n\text{ infinitely often}\},\{A_n\text{ eventually}\} are that they lie between the intersection and the union in terms of containment.

  • ‘Infinitely often’ is relatively uncontroversial, just meaning ‘for infinitely many values of n’.
  • ‘Eventually’ is less clear on an initial reading. Perhaps ‘eventually always’ would be better? [And closer to the title of this blog!] The meaning is ‘always, except for finitely many n’, which implies that there is some threshold N such that it holds for all n>N.
  • In set notation, infinitely often (or i.o.) means that \bigcup_{k\ge n}A_k must hold for every n, and so the event \{A_n\text{ i.o.}\}=\bigcap_n\bigcup_{k\ge n}A_k.
  • Similarly, eventually means that there must be an n for which \bigcap_{k\ge n}A_k holds, so the event \{A_n\text{ eventually}\}=\bigcup_n\bigcap_{k\ge n}A_k.

We end up with the richer containment sequence

\bigcap_{n\ge 1}A_n\subseteq \{A_n\text{ eventually}\}\subseteq \{A_n\text{ i.o.}\}\subseteq \bigcup_{n\ge 1}A_n.

These individual containment relations may not be obvious a priori in this form. Probably the most clear is the central containment, that if an A_n holds for all except finitely many n, then it holds for infinitely many n. The right-most says that if it holds infinitely often, then it holds at least once. For the left-most containment, note that \bigcap_{n\ge 1}A_n is literally the same as the n=1 case of \bigcap_{k\ge n}A_k, so taking a union over other values of n only increases the set.

It’s also worth thinking about the complements. Note that \left(\bigcap_{n\ge 1}A_n\right)^c = \bigcup_{n\ge 1} A_n^c and \left(\bigcup_{n\ge 1}A_n\right)^c=\bigcap_{n\ge 1}A_n^c. Versions of these apply to eventually/i.o. also. In particular

\{A_n\text{ i.o.}\}^c = \{A_n\text{ for only finitely many }n\} = \{A_n^c\text{ eventually}\}.

and \{A_n\text{ eventually}\}^c = \{A_n^c\text{ for infinitely many }n\}.

So in any calculation or application, there is flexibility to study (A_n^c) is this is more convenient. As we will see, it makes sense to focus on whichever of (A_n),(A_n^c) has vanishing probabilities as n\to\infty.

Comment: It is standard to write \limsup_{n}A_n= \bigcap_{n\ge 1}\bigcup_{k\ge n}A_k, and \liminf_n A_n=\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k. I am happy to admit I get these confused with non-vanishing probability, and prefer to stick to eventually/i.o. or the full union/intersection statements for safety!

Borel-Cantelli Lemmas

The Borel-Cantelli lemmas establish some conditions under which the event \{A_n\text{ i.o.}\} almost surely holds, or almost surely doesn’t hold. We’ll state them now:

BC1: If \sum_{n\ge 1}\mathbb{P}(A_n)<\infty, then \mathbb{P}(A_n\text{ i.o.})=0.

BC2: If \sum_{n\ge 1}\mathbb{P}(A_n)=\infty and the events (A_n) are independent, then \mathbb{P}(A_n\text{ i.o.})=1.

Proof of BC1: Since \{A_n\text{ i.o.}\}\subseteq \bigcup_{k\ge n}A_k for every $n$, we have

\mathbb{P}(A_n\text{ i.o.}) \le \sum_{k\ge n}\mathbb{P}(A_k)\;\stackrel{n\to\infty}\longrightarrow\; 0,

since \sum \mathbb{P}(A_n)<\infty.

Comment: we could also introduce the random variable N which counts how many events A_n occur, ie N=\sum_{n\ge 1}\mathbf{1}_{A_n}. Then \{A_n\text{ i.o.}\}=\{N=\infty\}, and the given condition is equivalent to \mathbb{E}[N]<\infty, which certainly implies \mathbb{P}(N=\infty)=0. Indeed, it’s reasonable to think of BC1 as a first-moment result, with BC2 as (sort of) a second-moment result, and this is reflected in the comparative complexity of the two proofs.

Proof of BC2: As discussed, it is equivalent to show that \mathbb{P}(A_n^c\text{ eventually})=0. To do this, we study the intersections of the complements, whose probabilities are tractable using independence:

\mathbb{P}\left( \bigcup_{k\ge n}A_k^c\right) = \prod_{k\ge n}\mathbb{P}(A_k^c)=\prod_{k\ge n}(1-\mathbb{P}(A_k))

\le \prod_{k\ge n}\exp(-\mathbb{P}(A_k))=\exp(-\sum_{k\ge n}\mathbb{P}(A_k)).

We have assumed that \sum_{k\ge n}\mathbb{P}(A_k)=\infty for each n, and so we obtain

\mathbb{P}\left(\bigcup_{k\ge n}A_k^c\right)=0,\quad\forall n\ge 1.

But since (\bigcap_{k\ge n}A_k^c)_{n\ge 1} is an increasing sequence of events, so taking a union over n is particularly natural. Precisely, by continuity of measure, we have

\mathbb{P}\left(\bigcup_{n\ge 1}\bigcap_{k\ge n}A_k^c\right) = \lim_{n\to\infty}\mathbb{P}\left( \bigcap_{k\ge n}A_k^c\right)=0.

Comment: the independence condition is certainly necessary, as otherwise one can have situations where A_1\supseteq A_2\supseteq\cdots. For example, with U\sim \mathrm{Unif}[0,1], consider A_n:=\{U\le \frac1n\}, for which \mathbb{P}(A_n)=\frac1n, whose sum diverges, while clearly \mathbb{P}(A_n\text{ i.o.})=\mathbb{P}(U=0)=0.

Classic examples: the famous ‘infinite monkey theorem’, relatively well-known outside the realm of theoretical probability, asserts that a monkey typing randomly at a keyboard for long enough will eventually type the complete works of William Shakespeare. Other standard examples involve independent (A_n) with probabilities decaying at various speeds. The latter do have some overlap with the slightly less standard examples presented below.

Example – generating random permutations

Consider a sequence of random permutations (\sigma_n)_{n\ge 1} where \sigma_n is a permutation of [n]. We construct these iteratively as follows: \sigma_1 is the identity (ie the only permutation of one element!); then for each n\ge 2, define \sigma_n from \sigma_{n-1} by inserting element n at a uniformly-chosen position in \sigma_{n-1}.

We ask the question whether this gives a well-defined limiting permutation \sigma_\infty on \mathbb{N}. A necessary condition for this is that the first element \sigma_n(1) is eventually constant. But in fact the events \{\sigma_n(1)=n\} are independent, and each occurs with probability 1/n, and so

\mathbb{P}(\text{first element changes i.o.})\ge \mathbb{P}(\sigma_n(1)=n\text{ for infinitely many }n)=1,

by BC2.

On reflection, each \sigma_n has the uniform distribution on \Sigma_n, permutations of [n]. So our result is consistent with the idea that one cannot construct a uniform random permutation on an infinite set.

Now, alternatively, suppose the new element is added at place \max(n-X_n,1) where (X_2,X_3,\ldots) are IID \mathrm{Geometric}(q) random variables. (The truncation is just to prevent pathologies like trying to add the new entry in ‘place -6’.) Then

\mathbb{P}(\sigma_n(1)=n)=(1-q)^{n-2},

and since \sum_n (1-q)^{n-2}<\infty, BC1 shows that the first element (\sigma_1(1),\sigma_2(1),\sigma_3(1),\ldots) changes only finitely often. A similar argument applies for the second element, and the third element, and so on. Consequently, almost surely we have a well-defined pointwise limit permutation \sigma_\infty. This is known as the Mallows Permutation on the integers, and is one of the more popular models for a random permutation on an infinite set, with lots of interesting properties. I supervised an undergraduate research project on this model over Summer 2022, and will write more when time allows.

Example – well-separated random integers

The final example is a toy version of a construction I’ve been working on recently regarding subsets of the vertices of a graph which ‘trap’ random walk on that graph.

Suppose we are trying to colour some positive integers randomly, so that the coloured integers are infinite, but also asymptotically sparse. We also do not want pairs of coloured integers to be close together. Let’s say that two integers m,n are close if |m-n|\le \sqrt{\max(m,n)}.

Now, suppose we colour each integer n green independently with probability \frac1n. Then, whenever we have two integers m,n which are close, and coloured green, we re-colour them both red.

We would like to prove that infinitely many integers are coloured green. Let C_n:={n\text{ coloured with some colour}}. Using BC2, we know that almost surely, infinitely many are coloured with some colour, and it seems implausible that many would be close. However, the events G_n={n\text{ coloured green}} are not independent, so one cannot apply BC2 directly to these. However, we can study instead R_n={n\text{ coloured red}}. We can bound these probabilities as

\mathbb{P}(R_n)\le\frac{1}{n}\frac{2\sqrt{n}}{n-\sqrt{n}} = \frac{2+o(1)}{n^{3/2}},

and apply BC1 to show \mathbb{P}(R_n\text{ i.o.})=0. Then, since \{G_n\text{ i.o.}\}\supseteq \{C_n\text{ i.o.}\}\setminus{R_n\text{ i.o.}}, we obtain \mathbb{P}(G_n\text{ i.o.})=1, as required.

Lecture 9 – Inhomogeneous random graphs

I am aiming to write a short post about each lecture in my ongoing course on Random Graphs. Details and logistics for the course can be found here.

As we enter the final stages of the semester, I want to discuss some extensions to the standard Erdos-Renyi random graph which has been the focus of most of the course so far. In doing so, we can revisit material that we have already covered, and discover how easily one can extend this directly to more exotic settings.

The focus of this lecture was the model of inhomogeneous random graphs (IRGs) introduced by Soderberg [Sod02] and first studied rigorously by Bollobas, Janson and Riordan [BJR07]. Soderberg and this blog post address the case where vertices have a type drawn from a finite set. BJR address the setting with more general typespaces, in particular a continuum of types. This generalisation is essential if one wants to use IRGs to model effects more sophisticated than those of the classical Erdos-Renyi model G(n,c/n), but most of the methodology is present in the finite-type setting, and avoids the operator theory language which is perhaps intimidating for a first-time reader.

Inhomogeneous random graphs

Throughout, k\ge 2 is fixed. A graph with k types is a graph G=(V,E) together with a type function V\to \{1,\ldots,k\}. We will refer to a k\times k symmetric matrix with non-negative entries as a kernel.

Given n\in\mathbb{N} and a vector p=(p_1,\ldots,p_k)\in\mathbb{N}_0^k satisfying \sum p_i=n, and \kappa a kernel, we define the inhomogeneous random graph G^n(p,\kappa) with k types as:

  • the vertex set is [n],
  • types are assigned uniformly at random to the vertices such that exactly p_i vertices have type i.
  • Conditional on these types, each edge v\leftrightarrow w (for v\ne w\in [n]) is present, independently, with probability

1 - \exp\left(-\frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)} }{n} \right).

Notes on the definition:

  • Alternatively, we could assign the types so that vertices \{1,\ldots,p_1\} have type 1, \{p_1+1,\ldots,p_1+p_2\} have type 2, etc etc. This makes no difference except in terms of the notation we have to use if we want to use exchangeability arguments later.
  • An alternative model considers some distribution \pi on [k], and assigns the types of the vertices of [n] in an IID fashion according to \pi. Essentially all the same results hold for these two models. (For example, this model with ‘random types’ can be studied by quenching the number of each type!) Often one works with whichever model seems easier for a given proof.
  • Note that the edge probability given is \approx \frac{\kappa_{\mathrm{type}(v),\mathrm{type}(w)}}{n}. The exponential form has a more natural interpretation if we ever need to turn the IRGs into a process. Additionally, it avoids the requirement to treat small values of n (for which, a priori, k/n might be greater than 1) separately.

In the above example, one can see that, roughly speaking, red vertices are more likely to be connected to each other than blue vertices. However, for both colours, they are more likely to be connected to a given vertex of the same colour than a vertex of the opposite colour. This might, for example, correspond to the kernel \begin{pmatrix}3&1\\1&2\end{pmatrix}.

The definition given above corresponds to a sparse setting, where the typical vertex degrees are \Theta(1). Obviously, one can set up an inhomogeneous random graph in a dense regime by an identical argument.

From an applications point of view, it’s not hard to imagine that an IRG of some flavour might be a good model for many phenomena observed in reality, especially when a mean-field assumption is somewhat appropriate. The friendships of boys and girls in primary school seems a particularly resonant example, though doubtless there are many others.

One particular application is to recover the types of the vertices from the topology of the graph. That is, if you see the above picture without the colours, can you work out which vertices are red, and which are blue? (Assuming you know the kernel.) This is clearly impossible to do with anything like certainty in the sparse setting – how does one decide about isolated vertices, for example? The probabilities that a red vertex is isolated and that a blue vertex is isolated differ by a constant factor in the n\rightarrow\infty limit. But in the dense setting, one can achieve this with high confidence. When studying such statistical questions, these IRGs are often referred to as stochastic block models, and the recent survey of Abbe [Abbe] gives a very rich history of this type of problem in this setting.

Poisson multitype branching processes

As in the case of the classical random graph G(n,c/n), we learn a lot about the IRG by studying its local structure. Let’s assume from now on that we are given a sequence of IRGs G^n(p^n,\kappa) for which \frac{p^n}{n}\rightarrow \pi, where \pi=(\pi_1,\ldots,\pi_k)\in[0,1]^k satisfies ||\pi||_1=1.

Now, let v^n be a uniformly-chosen vertex in [n]. Clearly \mathrm{type}(v^n)\stackrel{d}\rightarrow \pi, with the immediate mild notation abuse of viewing \pi as a probability distribution on [k].

Then, conditional on \mathrm{type}(v^n)=i:

  • when j\ne i, the number of type j neighbours of v^n is distributed as \mathrm{Bin}\left(p_j,1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right).
  • the number of type i neighbours of v^n is distributed as \mathrm{Bin}\left( p_i-1,1-\exp\left(-\frac{\kappa_{i,i}}{n}\right)\right).

Note that p_j\left[1-\exp\left(-\frac{\kappa_{i,j}}{n}\right)\right]\approx \frac{p_j\cdot \kappa_{i,j}}{n} \approx \kappa_{i,j}\pi_j, and similarly in the case j=i, so in both cases, the number of neighbours of type j is distributed approximately as \mathrm{Poisson}(\kappa_{i,j}\pi_j).

This motivates the following definition of a branching process tree, whose vertices have k types. Continue reading

Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in \mathbb{R}^n, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write A=P^{-1}DP, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying Ax=\lambda x, then x is in the kernel of A-\lambda I. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of \lambda result in A-\lambda I having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have \mathrm{det}(A-\lambda I)=0, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for \lambda, called the characteristic polynomial \chi_A(z), which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by \pi/2 for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity \alpha(\lambda) of an eigenvalue \lambda to be the exponent of the factor (z-\lambda) in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue \lambda form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity \gamma(\lambda). There are two facts that one needs to remember. The slightly less obvious one is that \gamma(\lambda)\le \alpha(\lambda) for all \lambda. One can see this by, for example, working in a basis that extends a basis of the \lambda-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of A-\lambda I.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is \begin{pmatrix}2&1\\0&2\end{pmatrix}, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from \begin{pmatrix}2&0\\0&2\end{pmatrix} and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is \chi_A(A)=0, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression \mathrm{det}(A-\lambda I), but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like (7-\lambda) on the diagonal; it doesn’t make sense to substitute a matrix for \lambda as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that \chi_A(A) is a matrix. Now looki at the action of \chi_A(A) on any eigenvector v, corresponding to eigenvalue \lambda. Applying some power of A to v gives v multiplied by the same power of \lambda, and so we end up with

\chi_A(A)v = \chi_A(\lambda)v = 0.

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so \chi_A(A)v=0 for all v, hence \chi_A(A)=0.

But \chi_A(A) is just a matrix-valued function of A. If you think about it, \chi_A is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of \chi_A(A) are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on \mathbb{R}^{n\times n}) the function \chi_A(A) should be continuous as a function of A. But we’ve shown \chi_A(A)=0 for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

Linear Algebra II: Determinants 2

In the previous post, we introduced determinants of matrices (and by extension linear maps) via its multilinearity properties, and as the change-of-volume factor. We also discussed how to calculate them, via row operations, or Laplace expansion, or directly via a sum of products of entries over permutations.

The question of why this is ever a useful quantity to consider remains, and this post tries to answer it. We’ll start by seeing one example where this is a very natural quantity to consider, and then the main abstract setting, where the determinant is zero, and consider a particularly nice example of this.

Jacobeans as a determinant

We consider integration by substitution. Firstly, in one variable: when it comes to Riemann integration of a function g(x) with respect to x, we view dx as the width of a small column which approximates the function near x. Now, if we reparameterise, that is if we write x=f(y) for some well-behaved (in particular differentiable) function f, then the width of the column is dx= dy.(dx/dy)=f'(y) dy. This may be negative, if y is decreasing while x is increasing, but for now let’s not worry about this overly, for example by assuming the function g is non-negative. Thus if we want to integrate with y as the variable, we multiply the integrand by this factor |f'(y)|.

What about in higher dimensions? We have exactly the same situation, only instead of two-dimensional columns, we have (n+1)-dimensional columns. We then multiply the n-dimensional volume of the base by the height, again given by g(\mathbf{x}). If we have a similar transformation of the base variable \mathbf{x}=f(\mathbf{y}), we differentiate to get

\mathrm{d}x_i = \sum_{j=1}^n\frac{\mathrm{d}f_i}{\mathrm{d}y_j} \mathrm{d}y_j.

In other words

\mathrm{d}\mathbf{x}= J \mathrm{d}\mathbf{y},

where J is the Jacobean matrix of partial deriatives. In particular, we know how to relate the volume [0,\mathrm{d}x_1]\times\ldots\times [0,\mathrm{d}x_n] to the volume [0,\mathrm{d}y_1]\times \ldots\times [0,\mathrm{d}y_n]. It’s simply the determinant of the Jacobean J. So if we want to integrate with respect to \mathbf{y}, it only remains to pre-multiply the integrand by |\mathrm{det}J| and proceed otherwise as in the one-dimensional case.

Det A = 0

A first linear algebra course might well motivate the introducing matrices as a notational shortcut for solving families of linear equations, Ax=b. The main idea is that generally we can solve this equation uniquely. Almost all of the theory developed in such a first linear algebra course deals with the case when this fails to hold. In particular, there are many ways to characterise this case, and we list some of them now:

  • Ax=b has no solutions for some b;
  • A is not invertible;
  • A has non-trivial kernel, that is, with dimension at least one;
  • A does not have full rank, that is, the image has dimension less than n;
  • The columns (or indeed the rows) are linearly dependent;
  • The matrix can be row-reduced to a matrix with a row of zeroes.

It is useful that these are equivalent, as in abstract problems one can choose whichever interpretation from this list is most relevant. However, all of these are quite hard to check. Exhibiting a non-trivial kernel element is hard – one either has to do manual row-reduction, or the equivalent in the context of linear equations. But we can add the characterisation

  • det A = 0;

to the list. And this is genuinely much easier to check for specific examples, either abstract or numerical.

Let’s quickly convince ourselves of a couple of these equivalences. Determinant is invariant under row-reductions, and by multilinearity it is certainly the case that det A = 0 if A has a row of zeroes. We also said that A is the change-of-volume factor. Note that A is a map from the domain to its image, so if A has less than full rank, then any set in the image has zero volume.

The Vandermonde matrix

This is a good example of this theory in practice. Consider the Vandermonde matrix where each row is a geometric progression:

V=\begin{pmatrix}1&\alpha_1&\ldots&\alpha_1^{n-1}\\1&\alpha_2&\ldots&\alpha_2^{n-1}\\ \vdots&\vdots&\ddots&\vdots\\ 1&\alpha_n&\ldots&\alpha_n^{n-1}\end{pmatrix}.

Now suppose we attempt to solve

V\begin{pmatrix}a_0\\a_1\\ \vdots\\ a_{n-1}\end{pmatrix}=\begin{pmatrix}b_1\\b_2\\ \vdots \\ b_n\end{pmatrix}.

There’s a natural interpretation to this, that’s especially clear with this suggestive notation. Each row corresponds to a polynomial, where the coefficients are given by the (a_0,a_1,\ldots,a_{n-1}), and the argument is given by \alpha_i.

So if we try to solve for (a_0,a_1,\ldots,a_{n-1}), given (\alpha_1,\ldots,\alpha_n) and (b_1,\ldots,b_n), we are asking whether we can find a polynomial P with degree at most n-1 such that P(\alpha_i)=b_i for i=1,\ldots,n. Lagrange interpolation gives an argument where we just directly write down the relevant polynomial, but we can also deploy our linear algebraic arguments too.

The equivalence of all these statements means that to verify existence and uniqueness of such a polynomial, we only need to check that the Vandermonde matrix has non-zero determinant. And in fact there are a variety of methods to show that

\mathrm{det}V=\prod_{1\le i< j}\le n(\alpha_j-\alpha_i).

For the polynomial question to be meaningful, we would certainly demand that the (\alpha_i) are distinct, and so this determinant is non-zero, and we’ve shown that n points determine a degree (n-1) polynomial uniquely.

If we multiply on the left instead, suppose that we are considering a discrete probability distribution X that takes n known values (\alpha_1,\ldots,\alpha_n) with unknown probabilities (p_1,\ldots,p_n). Then we have

(p_1,\ldots,p_n) V = (1,\mathbb{E}X, \mathbb{E}[X^2],\ldots, \mathbb{E}[X^{n-1}]).

So, again by inverting the Vandermonde matrix (which is know is possible since its determinant is non-zero…) we can recover the distribution from the first (n-1) moments of the distribution.

A similar argument applies to show that the Discrete Fourier Transform is invertible, and in this case (where the \alpha_is are roots of unity), the expression for the Vandermonde determinant is particularly tractable.

Linear Algebra II: Determinants 1

This term, I’m giving tutorials on a course that’s new to me, the apparently notorious ‘Linear Algebra II’ for first year undergraduates. I can appreciate how it might have ended up with this reputation, but as always, every challenge is also an opportunity, So I’m going to (try to) write a short series of blog posts about what we’ve discussed in the tutorials.

The first problem sheet-and-a-half concerned determinants of matrices. There are three things worth addressing here:

  1. What are abstract definitions, and which is most useful in each setting?
  2. How to actually calculate them?
  3. What’s the overall point?

The answers are obviously not completely unrelated, but we’ll probably defer the third question to a second post.

The determinant is a map from the set of matrices \mathcal{M}_n to the base field (hereafter assumed to be \mathbb{R},\mathbb{C}). The Oxford course defines it through its properties:

  • Multilinear in the columns of the matrix.
  • Equal to zero if two columns are equal.
  • Equal to one if the matrix is the identity.

One then checks that there is a unique such map, and so from now on it’s reasonable to call it the determinant of the matrix. It will follow from pretty much any consequence that we can replace ‘columns’ with ‘rows’ throughout and get the same map.

Other definitions

We have a closed form expression for the determinant given via permutations of n

\mathrm{det}(A)=\sum_{\sigma\in \Sigma_n} \mathrm{sign}(\sigma) a_{1\sigma(1)}\ldots a_{n\sigma(n)}.

We’ll come back to a discussion of when this particular definition is useful. It can be derived by carefully transforming the identity matrix into A, using the operations which are mentioned in the original definition of the determinant, in particular, keeping track of the number of transpositions of columns.

It’s clear from any definition that the determinant is a polynomial of degree n in the entries of the matrix, but this definition will be useful if you want to make some more precise comment on the nature of this polynomial. For example, if entries of the matrix are polynomials in x of various degree (think of the eigenvalue equation for example) this allows you to control (or at least bound) the overall degree of the determinant as a polynomial in x.

The determinant is also the volume of the n-dimensional parallelopiped formed by the column vectors of the matrix. This is easy to check in two dimensions, for the matrix \begin{pmatrix}a&b\\c&d\end{pmatrix}:

20160225_172206To calculate the area of the central parallelogram, we have to subtract the area of two small rectangles and four small triangles from the outer rectangle, obtaining

(a+b)(c+d)-2bc - \frac12(ac+ac+bd+bd)=ad-bc,

as we expect. This calculation is harder to execute in higher dimensions, and certainly harder to visualise.

Maybe, though, we don’t have to, so long as we can reassure ourselves that this volume satisfies the implicit definition of the determinant map at the start. Multilinearity in the columns is not that hard to see. If we multiply the jth column by some constant, we are stretching the parallelopiped by the same constant factor in one direction, and so the volume grows appropriately. The additivity property can similarly be thought of as joining together two parallelopipeds at their common face (which is common since the other column vectors have to be constant in this construction). If two column vectors are equal, then clearly this volume actually has dimension at most n-1, and thus volume zero, so the final two conditions are genuinely easy to check.

The challenge here is that there is a direction involved. Determinants can be negative, but in our classical viewpoint, areas generally are not. In 2D, we can think of this as saying that the area is positive if the vector (b,d) lies anti-clockwise from (a,c) in the parallelogram, while is it negative otherwise. Again, this is harder to visualise in higher dimensions, but it is at least plausible that one could develop a similar decomposition. Ultimately, we are happy with the notion of directed lengths (ie vectors on the real line), and these are easy to add up without having to separate into cases, and the case holds for areas and higher-dimensional volumes.

Evaluating determinants

If we actually want to compute the determinant of a given matrix, the sum over permutations is intractable since it doesn’t have any natural splits into stages. The implicit definitions and this area consideration are clearly useless for all but the most special of examples.

The Laplace expansion is the usual algorithm to calculate the determinant of an n x n matrix. You pick a row (or a column), and evaluate the determinants of the (n-1) x (n-1) minor matrices given by deleting this row (or column), and each column (or row) in turn. This leaves us with n determinants of smaller matrices, which we pre-multiply by the entries in the original deleted row (or column), and add up in an alternating way (*). This is highly computationally intensive for large matrices, but for 3×3 and 4×4 can be done by hand with probability of an error bounded away from 1.

There is the flexibility to choose the reference row or column. Since the entries of these affect the sum through small products, it is highly convenient to choose a row or column with a lot of zeros. In particular, if there’s a row or column with exactly one non-zero entry, this is an ideal candidate.

The sum over permutations also works well when a lot of the entries are zero, because then a lot of the permutations give a summand which is zero. Upper-triangular matrices are a good example: only one permutation (the identity permutation) avoids all the zero elements underneath the diagonal.

One can also observe from the multilinearity property of the determinant map that there are lots of operations we can apply to the matrix which leave the determinant fixed. These are often called elementary row operations, though obviously we can apply them to the columns as well. To summarise, if we interchange two rows, the sign of the determinant is reversed. And if we add some multiple of one row to any other row, the determinant stays the same.

When matrices are not square, it’s quite important to be specific about exactly what form you can reduce a general matrix to via such row operations, but in this context, it’s not hugely important. Reduced echelon form (without the condition that leading coefficients will be one) is achievable, but this is a special case of an upper triangular matrix, for which the determinant is given by the product of the diagonal entries, ie is easy.

Whether this is substantially easier than Laplace expansion depends on the matrix itself and taste, both to do manually, and to code.

(*) I’m not a fan personally of this alternating definition. It seems to me much more natural to define the minor as

M^{i,j}=(a_{i+k,j_\ell})_{1\le k,\ell\le n-1},

with indices taken modulo n. Then you don’t have any \pm 1s in the Laplace expansion.

Using determinants in abstract problems

So the determinant gives directly the area of the image (under A) of the unit hypercube. By linearity (of A), it is easy to see that it also gives the scale factor of the area change (under A) of any hyper-cuboid, parallel to the conventional axes, anywhere in the space. Then, eg by approximating any sensible n-dimensional shape (*) as a union of such hyper-cuboids, we can show that in fact the area of any sensible shape increases by a factor (det A) under application of A.

This is a good thing to remember, because it is an excellent heuristic for seeing why the determinant of a linear map is basis-independent. It also gives a much easier proof of the key result

\mathrm{det}(AB)=\mathrm{det}(A) \mathrm{det}(B),

than that given by fixing B and viewing det(AB) as a map from matrices to the field, just like the original definition of determinant.

Some of the theory in the course is proved using elementary row operations. But these invite complicated notation, so are best used only in simple arguments, or when things are fairly explicit to begin with. Given an abstract problem about determinants of matrices, it is often tempting to induct on the size of the matrix in some way. I think it’s worth saying that even though the Laplace expansion is explicitly set up in this way, the notation involved is also likely to be annoying here, while permutations are easy to describe inductively: eg let \sigma(1)=k, then view the remainder of the permutation as a bijection \{2,3,\ldots,n\}\rightarrow [n]\backslash \{k\}.

Shortly, we’ll have a second post answering the final question: what’s the point of working with determinants? We’ve already seen half an answer, in that they describe the change-of-volume factor of a matrix (or linear map), but this can be substantially developed.

Lagrange multipliers Part Two

My own question on last week’s BMO2 notwithstanding, inequalities seem out of fashion at the moment among mainstream international olympiads. Such problems often involve minimising some function subject to a constraint, and word has, over the years, filtered down to students interested in such things, that there’s a general method for achieving this via Lagrange multipliers. The motivation for my talk in Hungary, summarised by the previous post and this one, is to dispute the claim made by some of the UK students that these are hard to justify rigorously. I dispute this because I don’t think it’s qualitatively much harder to justify Lagrange multipliers rigorously than an unconstrained optimisation problem, whereas I would claim instead that Lagrange multipliers are merely unlikely to work at a computational level on the majority of olympiad problems.

Unconstrained optimisation in two variables

Before we can possibly discuss constrained optimisation, we should discuss unconstrained optimisation. That is, finding minima of a function of several variables. We don’t lose too much by assuming that our function f(x,y) depends on two variables.

Recall that our method in the previous post for justifying the A-level approach to minima was to find a necessary condition to be a local minimum, and also a general reason why there should be a global minimum. That way, if there are finitely many points satisfying the condition, we just check all of them, and the one with the smallest value of f is the global minimum. We’ll discuss the existence of the global minimum later.

If we hold one coordinate fixed, the local variation of a function is equivalent to the one-dimensional case.

f(x+h,y)-f(x,y)= h\frac{\partial f}{\partial x}(x,y)+O(h^2).

In general we want to vary both variables, which is fine since

f(x+h,y+\ell)-f(x+h,y)=\ell \frac{\partial f}{\partial y}(x+h,y) + O(h^2).

But since we really want everything to be determined by the function at (x,y), we really want

\frac{\partial f}{\partial y}(x+h,y) \approx \frac{\partial f}{\partial y}(x,y),

and so we be mindful that we may have to assume that both partial derivatives are continuous everywhere they exist. Once we have this though, we can rewrite as

f(x+h,y+\ell) - f(x,y)= h\frac{\partial f}{\partial x}(x,y) + \ell \frac{\partial f}{\partial y}(x,y) + O(h^2)

= (\frac{\partial f}{\partial x}(x,y) ,\frac{\partial f}{\partial y}(x,y) )\cdot (h,\ell) + O(h\vee \ell^2).

In particular, if we define grad of f to be \nabla f(x,y)=(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y) ) and apply a similar argument to that which we used in the original setting. If \nabla f(x,y)\ne 0, then we can choose some small (h,\ell), such that f(x+h,y+\ell)<f(x,y). Thus a necessary condition for (x,y) to be a local minimum for f is that \nabla f=0.

Lagrange multipliers

This is natural time to discuss where Lagrange multipliers emerge. The setting now is that we still want to minimise some function f(\mathbf{x}), but only across those values of \mathbf{x} which satisfy the constraint g(\mathbf{x})=0.

But our approach is exactly the same, namely we find a necessary condition to be a local minimum subject to the condition. As before, we have

f(\mathbf{x}+\mathbf{h}) - f(\mathbf{x}) = \mathbf{h}\cdot \nabla f + O(|\mathbf{h}|^2),

but we are only interested in those small vectors \mathbf{h} for which \mathbf{x}+\mathbf{h} actually satisfies the constraint, namely g(\mathbf{x}+\mathbf{h})=0. But then

0 = g(\mathbf{x}+\mathbf{h})- g(\mathbf{x})=\mathbf{h}\cdot \nabla g + O(|\mathbf{h}|^2).

From this, we conclude that the set of small relevant \mathbf{h} is described by \mathbf{h}\nabla g=O(|\mathbf{h}|^2). And now we really can revert to the original argument. If there’s a small \mathbf{h} such that \mathbf{h}\cdot \nabla g=0 but \mathbf{h}\cdot \nabla f \ne 0, then we can find some \mathbf{h'_+},\mathbf{h'_-}=\pm \mathbf{h}+O(|\mathbf{h}|^2) such that at least one of f(\mathbf{h'_+}),f(\mathbf{h'_-})<f(\mathbf{h}).

So a necessary condition to be a constrained local minimum is that every vector which is perpendicular to \nabla g must also be perpendicular to \nabla f. From which it follows that these two vectors must be parallel, that is \nabla f(\mathbf{x})=\lambda \nabla g(\mathbf{x}), where \lambda is the so-called Lagrange multiplier. Of course we must also have that g(\mathbf{x})=0, and so we have a complete characterisation for a necessary condition that the constrained optimisation has a local minimum at \mathbf{x}, assuming that all the derivatives of both f and g exist with suitable regularity near \mathbf{x}.

Technicalities

The point of setting up the one-variable case in the unusual way in the previous post was to allow me to say at this stage: “it’s exactly the same”. Well, we’ve already seen an extra differentiability condition we might require, but apart from that, the same approach holds. Multi-variate continuous functions also attain their bounds when the domain is finite and includes its boundary.

Checking the boundary might be more complicated in this setting. If the underlying domain is \{x,y,z\ge 0\}, then one will have to produce a separate argument for why the behaviour when at least one of the variables is zero fits what you are looking for. Especially in the constrained case, it’s possible that the surface corresponding to the constraint doesn’t actually have a boundary, for example if it is the surface of a sphere. Similarly, checking that the objective function gets large as the variables diverge to infinity can be annoying, as there are many ‘directions’ down which to diverge to infinity.

Motivating Cauchy-Schwarz

In practise, you probably want to have such methods in hand as a last resort on olympiad problems. It’s always possible that something will slip through the net, but typically problem-setters are going to trying to ensure that their problems are not amenable to mindless application of non-elementary methods. And even then, one runs the risk of accusations of non-rigour if you don’t state exact, precise results which justify everything which I’ve presented above.

One thing that can be useful, on the other hand, is to observe that the Lagrange multiplier condition looks a lot like the equality condition for Cauchy-Schwarz. So, even if you can’t solve the family of Lagrange multiplier ‘equations’, this does suggest that applying Cauchy-Schwarz to the vectors involved might give you some insight into the problem.

The following inequality from the IMO 2007 shortlist is a good example.

Suppose a_1,\ldots,a_{100}\ge 0 satisfy a_1^2+\ldots+a_{100}^2=1. Prove that

a_1^2a_2+a_2^2a_3+\ldots + a_{100}^2 a_1<\frac{12}{25}.

We shouldn’t be perturbed by the strictness. Maybe we’ll end up showing a true bound in terms of surds that is less neat to write down…

Anyway, applying Lagrange multipliers would require us to solve

a_{k-1}^2 + 2a_ka_{k+1}=2\lambda a_k,\quad k=1,\ldots,100,

with indices taken modulo 100. As so often with these cyclic but non-symmetric expressions, this looks quite hard to solve. However, it turns out that by applying Cauchy-Schwarz to the vectors (a_k),(a_{k-1}^2+2a_ka_{k+1}) gets us a long way into the problem by classical means. Working all the way through is probably best left as an exercise.

Lagrange multipliers Part One: A much simpler setting

I am currently in northern Hungary for our annual winter school for some of the strongest young school-aged mathematicians in the UK and Hungary. We’ve had a mixture of lectures, problem-solving sessions and the chance to enjoy a more authentic version of winter than is currently on offer in balmy Oxford.

One of my favourite aspects of this event is the chance it affords for the students and the staff to see a slightly different mathematical culture. It goes without saying that Hungary has a deep tradition in mathematics, and the roots start at school. The British students observe fairly rapidly that their counterparts have a much richer diet of geometry, and methods in combinatorics at school, which is certainly an excellent grounding for use in maths competitions. By contrast, our familiarity with calculus is substantially more developed – by the time students who study further maths leave school, they can differentiate almost anything.

But the prevailing attitude in olympiad circles is that calculus is unrigorous and hence illegal method. The more developed summary is that calculus methods are hard, or at least technical. This is true, and no-one wants to spoil a measured development of analysis from first principles, but since some of the British students asked, it seemed worth giving a short exposition of why calculus can be made rigorous. They are mainly interested in the multivariate case, and the underlying problem is that the approach suggested by the curriculum doesn’t generalise well at all to the multivariate setting. Because it’s much easier to imagine functions of one variable, we’ll develop the machinery of the ideas in this setting in this post first.

Finding minima – the A-level approach

Whether in an applied or an abstract setting, the main use of calculus at school is to find where functions attain their maximum or minimum. The method can be summarised quickly: differentiate, find where the derivative is zero, and check the second-derivative at that value to determine that the stationary point has the form we want.

Finding maxima and finding minima are a symmetric problem, so throughout, we talk about finding minima. It’s instructive to think of some functions where the approach outlined above fails.

20160131_124059In the top left, there clearly is a minimum, but the function is not differentiable at the relevant point. We can probably assert this without defining differentiability formally: there isn’t a well-defined local tangent at the minimum, so we can’t specify the gradient of the tangent. In the top right, there’s a jump, so depending on the value the function takes at the jump point, maybe there is a minimum. But in either case, the derivative doesn’t exist at the jump point, so our calculus approach will fail.

In the middle left, calculus will tell us that the stationary point in the middle is a ‘minimum’, but it isn’t the minimal value taken by the function. Indeed the function doesn’t have a minimum, because it seems to go off to -\infty in both directions. In the middle right, the asymptote provides a lower bound on the values taken by the function, but this bound is never actually achieved. Indeed, we wouldn’t make any progress by calculus, since there are no stationary points.

At the bottom, the functions are only defined on some interval. In both cases, the minimal value is attained at one of the endpoints of the interval, even though the second function has a point which calculus would identify as a minimum.

The underlying problem in any calculus argument is that the derivative, if it exists, only tells us about the local behaviour of the function. At best, it tells us that a point is a local minimum. This is at least a necessary condition to be a global minimum, which is what we actually care about. But this is a change of emphasis from the A-level approach, for which having zero derivative and appropriately-signed second-derivative is treated as a sufficient condition to be a global minimum.

Fortunately, the A-level approach is actually valid. It can be shown that if a function is differentiable everywhere, and it only has one stationary point, where the second-derivative exists and is positive, then this is in fact the global minimum. The first problem is that this is really quite challenging to show – since in general the derivative might not be continuous, although it might have many of the useful properties of a continuous function. Showing all of this really does require setting everything up carefully with proper definitions. The second problem is that this approach does not generalise well to multivariate settings.

Finding minima – an alternative recipe

What we do is narrow down the properties which the global minimum must satisfy. Here are some options:

0) There is no global minimum. For example, the functions pictured in the middle row satisfy this.

Otherwise, say the global minimum is attained at x. It doesn’t matter if it is attained at several points. At least one of the following options must apply to each such x.

1) f'(x)=0,

2) f'(x) is not defined,

3) x lies on the boundary of the domain where f is defined.

We’ll come back to why this is true. But with this decomposition, the key to identifying a global minimum via calculus is to eliminate options 0), 2) and 3). Hopefully we can eliminate 2) immediately. If we know we can differentiate our function everywhere, then 2) couldn’t possibly hold for any value of x. Sometimes we will be thinking about functions defined everywhere, in which case 3) won’t matter. Even if our function is defined on some interval, this only means we have to check two extra values, and this isn’t such hard work.

20160131_124716It’s worth emphasising why if x is a local minimum not on the boundary and f'(x) exists, then f'(x)=0. We show that if f'(x)\ne 0, then x can’t be a local minimum. Suppose f'(x)>0. Then both the formal definition of derivative, and the geometric interpretation in terms of the gradient of a tangent which locally approximates the function, give that, when h is small,

f(x-h) = f(x)-h f'(x) +o(h),

where this ‘little o’ notation indicates that for small enough h, the final term is much smaller than the second term. So for small enough h, f(x-h)<f(x), and so we don’t have a local minimum.

The key is eliminating option 0). Once we know that there definitely is a global minimum, we are in a good position to identify it using calculus and a bit of quick checking. But how would we eliminate option 0)?

Existence of global minima

This is the point where I’m in greatest danger of spoiling first-year undergraduate course content, so I’ll be careful.

As we saw in the middle row, when functions are defined on the whole real line, there’s the danger that they can diverge to \pm \infty, or approach some bounding value while never actually attaining it. So life gets much easier if you work with functions defined on a closed interval. We also saw what can go wrong if there are jumps, so we will assume the function is continuous, meaning that it has no jumps, or that as y gets close to x, f(y) gets close to f(x). If you think a function can be differentiated everywhere, then it is continuous, because we’ve seen that once a function has a jump (see caveat 2) then it certainly isn’t possible to define the derivative at the jump point.

It’s a true result that a continuous function defined on a closed interval is bounded and attains its bounds. Suppose such a function takes arbitrarily large values. The main idea is that if the function takes arbitrarily large values throughout the interval, then because the interval is finite it also takes arbitrarily large values near some point, which will make it hard to be continuous at that point. You can apply a similar argument to show that the function can’t approach a threshold without attaining it somewhere. So how do you prove that this point exists? Well, you probably need to set up some formal definitions of all the properties under discussion, and manipulate them carefully. Which is fine. If you’re still at school, then you can either enjoy thinking about this yourself, or wait until analysis courses at university.

My personal opinion is that this is almost as intuitive as the assertion that if a continuous function takes both positive and negative values, then it has a zero somewhere in between. I feel if you’re happy citing the latter, then you can also cite the behaviour of continuous functions on closed intervals.

Caveat 2) It’s not true to say that if a function doesn’t have jumps then it is continuous. There are other kinds of discontinuity, but in most contexts these are worse than having a jump, so it’s not disastrous in most circumstances to have this as your prime model of non-continuity.

Worked example

Question 1 of this year’s BMO2 was a geometric inequality. I’ve chosen to look at this partly because it’s the first question I’ve set to make it onto BMO, but mainly because it’s quite hard to find olympiad problems which come down to inequalities in a single variable.

Anyway, there are many ways to parameterise and reparameterise the problem, but one method reduces, after some sensible application of Pythagoras, to showing

f(x)=x+ \frac{1}{4x} + \frac{1}{4x+\frac{1}{x}+4}\ge \frac{9}{8}, (*)

for all positive x.

There are simpler ways to address this than calculus, especially if you establish or guess that the equality case is x=1/2. Adding one to both sides is probably a useful start.

But if you did want to use calculus, you should argue as follows. (*) is certainly true when x\ge \frac{9}{8} and also when $x\le \frac{2}{9}$. The function f(x) is continuous, and so on the interval [\frac{2}{9},\frac{9}{8}] it has a minimum somewhere. We can differentiate, and fortunately the derivative factorises (this might be a clue that there’s probably a better method…) as

(1-\frac{1}{4x^2}) \left[ 1 - \frac{4}{(4x+\frac{1}{x}+4)^2} \right].

If x is positive, the second bracket can’t be zero, so the only stationary point is found at x=1/2. We can easily check that f(\frac12)=\frac98, and we have already seen that f(\frac29),f(\frac98)>\frac98. We know f attains its minimum on [\frac29,\frac98], and so this minimal value must be $\frac98$, as we want.

Overall, the moral of this approach is that even if we know how to turn the handle both for the calculation, and for the justification, it probably would be easier to use a softer approach if possible.

Next stage

For the next stage, we assess how much of this carries across to the multivariate setting, including Lagrange multipliers to find minima of a function subject to a constraint.

Perturbation of Eigenvectors

In the previous post, I talked about eigenvalues, and some alternative characterisations which could be useful in some circumstances. Recently, I’ve been interested in controlling how eigenvalues and eigenvectors change as the matrix is varied. My particular example concerns positive matrices, which have a well-defined largest eigenvalue (or Perron root), and a unique (up to normalising in some way) principal eigenvector.

We might expect that perturbing a matrix slightly does not change the eigenvectors very much, since any original eigenvector is still almost an eigenvector, in the sense that its image under the action of the perturbed matrix is almost equal to a multiple of itself. But how to make this precise? And when does it go wrong?

Eigenvalues – The non-multiple case

Throughout, we assume we have a k x k matrix. We might want to allow the entries to be complex, but for now, real entries are perfectly interesting enough.

It makes sense to start with eigenvalues, since it’s easy to define these through the characteristic equation of the matrix. The coefficients of this polynomial are well-behaved (indeed polynomial) functions of the entries of the matrix. So we are really asking how the set of roots of a finite polynomial evolves as the (k+1) coefficients of the polynomial evolve. It is fairly clear that, under any sensible choice of topology on the space of k-(multi)-subsets of \mathbb{C}, the multiset of roots is continuous in the coefficients of the polynomial.

To say anything more precise, we have to introduce some notation.

Let \chi_{A}(z)=z^k+\gamma_{k-1}(A)z^{k-1}+\ldots+\gamma_1(A)z+\gamma_0(A) be the characteristic polynomial of A. Each \gamma_i is a polynomial of degree k-i in the entries of A. Let’s consider now a matrix-valued function A(t), and we assume that the entries of A(t) are all differentiable with respect to t. So each \gamma_i(A(t)) is also differentiable with respect to t.

At this point, let’s make the assumption that t lies in some interval [r,s] for which the eigenvalues of A(t) are distinct. Let \lambda(t) be some eigenvalue of A(t), chosen such that \lambda is a continuous function of t. For example, we might take \lambda(t)=\Lambda_1(t), the eigenvalue with largest absolute value (with some canonical tie-breaking mechanism). Then \chi_{A(t)}(\lambda(t))=0, and so differentiating with respect to \gamma_i:

0=\chi'_{A(t)}(\lambda(t)) \frac{\partial \lambda}{\partial \gamma_i} \Big|_{A(t)} + \lambda(t)^i.

Because we deliberately demanded that the eigenvalues were disjoint, we have \chi'_{A(t)}(\lambda(t))\ne 0, and so \frac{\partial \lambda(t)}{\partial \gamma_i}=-\lambda(t)^i / \chi'_{A(t)}(\lambda(t)). In particular, \lambda(t) is differentiable with respect to the coefficients of the characteristic polynomial, and thus with respect to t also.

Multiple Eigenvalues

It gets more complicated when the characteristic equation has multiple roots. Typically we will be interested in the evolution of the eigenvalue with some extremal property, probably the largest one. Let’s restrict to the real, symmetric case, where the set of eigenvalues is complete and real. Suppose we have t_0 such that A(t_0) has a repeated eigenvalue. Then, in a small enough region of t_0, we can define eigenvalues \lambda(t),\mu(t) continuously such that \lambda(t_0)=\mu(t_0) while \lambda(t)\ne \mu(t) for t=\ne t_0. Then, if the entries of A(t) are analytic functions of t, then so are \lambda(t),\mu(t).

But then \max(\lambda(t),\mu(t)) will in general not be analytic, as the maximum of two smooth functions is in general Lipschitz.

This effect is most obvious in the case of a diagonal matrix A(t)=\begin{pmatrix}t&0\\0&-t\end{pmatrix}, for which the largest eigenvalue is |t|.

Eigenvectors

When the matrix A is real and symmetric, we know it has real eigenvalues, and an orthogonal basis of eigenvectors. Then the Rayleigh quotient characterises the eigenvector as well as the eigenvalue. Recall that for any x\in\mathbb{R}^k with ||x||_2=1, we have

\lambda_1\ge x^T A x \ge \lambda_k,

with equality precisely at the respective eigenvectors. So if we perturb A slightly, keeping it real and symmetric, we can control the principal eigenvector quite well by this method.

If A is not diagonalisable, we can still say something about this principal eigenvector, via large powers of A, sometimes called the Van Mises iteration. This says that for large N, A^N v should have direction close to that of the eigenvector, for any test vector v. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue, though if the matrix is not diagonalisable, it is not completely trivial to quantify this convergence. We have to be careful though, since A maps the subspace orthogonal to the eigenvector to itself, so the magnitude of the projection of v onto the eigenvector determines the speed of convergence. Indeed, if v is orthogonal to the eigenvector, it won’t converge towards the principal eigenvector at all. (But if there is a well-defined ‘second eigenvector’ then it will converge towards that.)

Continuity of Eigenvectors

The reason why I ended up reading about some of these topics was that I wanted to show that the Perron eigenvector of a positive matrix (that is, the unique eigenvector corresponding to the Perron root) was Lipschitz continuous as a function of the entries of the positive matrix. Since for such a matrix, the largest eigenvalue is simple, we are able to make some progress.

In general, the condition that v is an eigenvector of matrix A with eigenvalue \lambda is described by the relation:

(A-\lambda I)v=0,\quad ||u||_1=1, (*)

or whatever the most appropriate normalising condition appears. This describes an implicit relation between A and the eigenvalue-eigenvector pair (\lambda,v). So given a matrix A_0 with eigenvalue \lambda_0 corresponding to eigenvector v_0, in a neighbourhood of A_0 in \mathbb{R}^{k\times k} we can use the implicit function theorem to comment on the differentiability of (\lambda,v) with respect to A in this neighbourhood.

Precisely, we require the matrix of partial derivatives from (*)

\begin{pmatrix} A_0-\lambda_0 I&v_0 \\ \mathbf{1}&0\end{pmatrix}

to have non-zero determinant. But if \lambda_0 is not simple, then if we apply this matrix from the left to one of the other eigenvectors (with a zero appended) we can see that it has non-trivial kernel. With a bit more work, we can show the converse too, and conclude that (\lambda,v) are smooth with respect to A in some neighbourhood of A_0.

Finally, we observe that when the eigenvalues are not simple, we can’t even guarantee continuity of the eigenvectors. This is unsurprising really, since for a multiple eigenvalue, a) we might not know how many LI eigenvectors exists; and b) we might have complete freedom over the choice of eigenvectors. Think about the identity matrix! Indeed the eigenvectors of \begin{pmatrix}1+\epsilon &0\\ 0&1+\epsilon\end{pmatrix} are (1,0) and (0,1), while the eigenvectors of \begin{pmatrix}1&\epsilon\\ \epsilon&1\end{pmatrix} are (1,1), (1,-1). So no continuous choice of eigenvectors is possible here.

Characterisations of Eigenvalues

I’ve been working for much of the past few months on a version of the frozen percolation random graph process with types. The connectivity between types is controlled by a (finite) non-negative square matrix, and so I’ve been engaging with linear algebra theory to an extent I haven’t really experienced since the second or third year of undergraduate maths.

We are interested in whether the graphs in question are subcritical, critical or supercritical. As in the case of multitype branching processes, this is controlled by the principal eigenvalue of a related non-negative matrix. So I’ve been looking up lots of methods for controlling eigenvalues, and some have proved useful, and some have not, but I thought it would be worthwhile to present some of them here.

Bounds and characterisations of spectral radius

Throughout, I will be talking about finite, square matricies. Eigenvalues may be defined as roots of the characteristic polynomial, and so by the fundamental theorem of algebra, there is always at least one complex eigenvalue. There is always at least one eigenvector associated to any eigenvalue. However, the dimension of the eigenspace is not always the same as the multiplicity of the eigenvalue as a root of the characteristic polynomial. The latter is called algebraic multiplicity, while the former is geometric multiplicity.

For now though, this distinction will be unimportant. The spectral radius of a matrix A is defined as

\rho(A)=\max \{|\lambda|\, : \, \lambda \text{ and eigenvalue of }A\}.

We can bound the spectral radius in terms of the norm of the matrix. Remember that a matrix norm has to satisfy all the usual properties of a norm, as well as a submultiplicative property |||AB|||\le |||A|||\cdot |||B|||. This is good, as otherwise we would be free to replace any norm by an arbitrary multiple of itself, and so no useful bounds could ever emerge. Note that the submultiplicativity implies that |||I_n||\ge 1.

Now, let \lambda,x be some eigenvalue and associated (right-)eigenvector respectively of matrix A. Let X be the square matrix given by taking all the columns to be x. Now Ax=\lambda x implies AX=\lambda X, and so

|\lambda| \cdot|||X||| = |||\lambda X||| = |||AX||| \le |||A|||\cdot |||X|||,

and thus we conclude our most basic bound \lambda \le |||A|||.

When A is diagonalisable, life is particularly easy, but in general we can write A as a conjugate of its Jordan normal form. Then, by looking at each diagonal block of the Jordan normal form separately, we can show that

\lim_{k\rightarrow 0}A^k = 0\quad \iff \quad \rho(A)<1.

Then, applying this, with additional care, to the matrices A / (\rho(A)\pm \epsilon), we derive Gelfand’s Formula, that \rho(A) = \lim_{k\rightarrow \infty} ||A^k||^{1/k}. Again, this applies for any matrix norm.

Real symmetric matrices

When the matrix is real and symmetric, it is not too hard to show that all the eigenvalues are real, and furthermore that all the geometric multiplicities are equal to the algebraic multiplicities. That is, the matrix is diagonalisable, and there is an (orthogonal) basis of eigenvectors. Once we assume we are working with respect to this eigenbasis, it is easy to see how the Rayleigh quotient characterisation of the largest (and smallest) eigenvalue works. Let’s say the eigenvalues are \lambda_1\ge \lambda_2\ge\ldots \ge \lambda_n, then for any ||x||_2=1, we have \lambda_1\ge x^T A x\ge \lambda_n, and equality is attained when x is the respective eigenvector, normalised appropriately.

This is an especially useful characterisation of the largest eigenvalue, as for example we can see fairly easily that this means \lambda_1 is a convex function of the (real, symmetric) matrix.

We can generalise this Rayleigh quotient idea if we take k orthonormal vectors in R^k, arrange them in an nxk matrix P, so that P^T P = I_k. Now we consider the matrix P^TAP. [Note that if k=1, we are exactly considering x^TA x as before.] Then Poincare’s Separation Theorem say that the eigenvalues \mu_1\ge \mu_2\ge\ldots \mu \mu_k of P^TAP (which is also real, symmetric) are bounded by the original eigenvalues:

\lambda_{n-k+i} \ge \mu_i\ge \lambda_i.

Since the trace is preserved under conjugation, and the trace is the sum of eigenvalues, we can apply this result with P’s columns taken to be the any k canonical basis vectors of \mathbb{R}^k. Without loss of generality, we may assume the basis has been chosen so that the diagonal elements of A satisfy a_{11}\ge a_{22}\ge\ldots\ge a_{nn}, and so now we have that the sequence (a_{11},a_{22},\ldots,a_{nn}) is majorised by (\lambda_1,\lambda_2,\ldots,\lambda_n) and majorises (\lambda_n,\lambda_{n-1},\ldots,\lambda_1). The first of these relations can be used via the setup of Karamata’s inequality to conclude that for any convex function f, we have

\sum_{i=1}^n f(\lambda_i)\ge \sum_{i=1}^n f(a_{ii}).

Gershgorin Circles

In fact, we can relate the eigenvalues to the diagonal entries of the matrix in a more general setting. We are motivated by the thought that if the off-diagonal entries are all very small, then the set of eigenvalues should be approximately given by the set of diagonal entries.

For a square complex matrix, let \lambda,x be an eigenvalue, eigenvector pair. For any index i, we have

\lambda - a_{ii}= \frac{\sum_j a_{ij}x_j}{x_i} - a_{ii} = \frac{\sum_{j\ne i}a_{ij}x_j}{x_i}.

Now consider the i such that x_i=\max |x_j|, and take absolute values and apply the triangle inequality,

|\lambda - a_{ii}| \le \sum_{j\ne i} \left| \frac{a_{ij}x_j}{x_i} \right| \le \sum_{j\ne i}|a_{ij}|.

Let’s define R_i=\sum_{j\ne i}|a_{ij}| to be the sum of the non-diagonal entries of the ith row. Then the Gershgorin circle theorem says that every eigenvalue lies within at least one of the discs B(a_{ii},R_i), in the complex plane. So our motivation still makes sense. If the off-diagonal entries are small, this is a strong restriction, and if they are not typically smaller than the diagonal entries, then we perhaps do not learn very much. Obviously, we could apply the same argument to the columns too.

When the diagonal entries are distinct, and the off-diagonal entries are small, the Gershgorin discs are distinct, and we would expect each to contain exactly one eigenvalue, corresponding to the appropriate diagonal entry. In fact, we can say something stronger. In general, the union of the discs is a subset of the complex plane with some connected components. Then, if a component is the union of exactly r discs, then it contains exactly r of the eigenvalues.

To see this, consider multiplying all the off-diagonal entries by z\in[0,1] and observe what happens as z varies from 0 to 1. When z=0, the matrix is diagonal, and each eigenvalue is in the Gershgorin disc (which is a single complex number). As z varies continuously, the characteristic polynomial varies continuously, and also its roots, that is the set of eigenvalues. So since each of the r eigenvalues are initially within the union of the r original, large Gershgorin discs, they must remain within this union as z varies, since they cannot ‘jump’ to another component.

It’s hard to know how time will allow, but provisionally in the next post I will talk about how to control the evolution of eigenvectors as a function of the matrix, and in particular what can go wrong.

REFERENCES

For the middle section, I used the progression from Chapter 4 of Matrix Differential Calculus with Applications in Statistics and Econometrics (Magnus and Neudecker).