Linear Algebra II: Eigenvectors and Diagonalisability

This post continues the discussion of the Oxford first-year course Linear Algebra II. We’ve moved on from determinants, and are now considering eigenvalues and eigenvectors of matrices and linear maps.

A good question to ask is: what’s the point of knowing about eigenvectors? I can think of a quick answer and a longer answer. The quick answer is that whenever we have a mapping of any kind, it is natural to ask about its fixed points. And since we are thinking about vector spaces and linear maps, if we can’t find any fixed points, we might nonetheless be able to find the best thing, some vectors whose direction is fixed by the map. In general, knowing about fixed points of a mapping might tell us other more qualitative properties, including the behaviour seen when you apply the map iteratively a large number of times. (Indeed a recent post discusses this exact problem for positive matrices in a context relevant to a chapter of my thesis…)

A more specific answer concerns bases. Recall that a linear map is defined independently of any basis: it’s just a map from the vector space to itself. We can express the linear map via a matrix with respect to some basis, but how to choose the basis? We could always choose the canonical basis in \mathbb{R}^n, since it’s easy to do vector and matrix calculations when most of the entries of all the vectors are zero. We also have a good visual idea (at least in up to three dimensions) of what a matrix might mean with respect to that basis. If we needed to divide the three-dimensional world around us into small volumes, we’d tend to describe it with small cubes rather than small arbitrary parallelopipeds.

But once we know something about the linear map, we might want to choose a basis of vectors on which the behaviour of the map is particularly easy to describe. And eigenvectors fulfil precisely this role. If we are able to choose a basis of eigenvectors, describing the map’s action, either abstractly, or via a (diagonal) matrix, is very straightforward. If we are given a matrix to begin with, we know how to do a change of basis, and changing to the basis of eigenvectors is precisely what’s going when we write A=P^{-1}DP, where D is a diagonal matrix. We construct P by taking its columns to be these eigenvectors. In particular, for a given vector x, y=Px is the vector giving the coefficients of x in the basis of eigenvectors.

So the case where we have a basis of eigenvectors is particularly useful, and in this case, we say the matrix or the map is diagonalisable. Remember how we find eigenvalues. If there exists a non-zero vector x satisfying Ax=\lambda x, then x is in the kernel of A-\lambda I. As we discussed last time, introducing the determinant gives a much more manageable way to verify which values of \lambda result in A-\lambda I having a non-trivial kernel. In particular, if non-zero x is in the kernel, we have \mathrm{det}(A-\lambda I)=0, and this leads to a polynomial of degree n (the dimensional of the vector space / size of the matrix) for \lambda, called the characteristic polynomial \chi_A(z), which has the eigenvalues as its roots.

If we agree to work over the complex field, then this is good, because it means we always have eigenvalues, and so it becomes sensible to talk about exactly how many eigenvalues and eigenvectors we have. Observe that if we restrict to real vector spaces, this might not be the case. In the plane, the rotation by \pi/2 for example has no fixed vectors.

Multiplicities of eigenvalues

We call the algebraic multiplicity \alpha(\lambda) of an eigenvalue \lambda to be the exponent of the factor (z-\lambda) in the factorisation of the characteristic polynomial. To define the geometric multiplicity, observe that all the eigenvectors with eigenvalue \lambda form a subspace, and so it is meaningful to talk about the dimension of this subspace (‘eigenspace’), which is the geometric multiplicity \gamma(\lambda). There are two facts that one needs to remember. The slightly less obvious one is that \gamma(\lambda)\le \alpha(\lambda) for all \lambda. One can see this by, for example, working in a basis that extends a basis of the \lambda-eigenspace. Observe at this stage that the sum of the algebraic multiplicities has to be n by definition, while the sum of geometric multiplicities is at most n. And this makes sense, because the space spanned by all the eigenvectors is a subspace, and so has dimension at most n.

The more obvious, but more frequently forgotten result is that

\alpha(\lambda)\ge 1 \quad \iff \quad \gamma(\lambda)\ge 1,

which is simply a consequence of the property discussed a few paragraphs previously concerning the kernel of A-\lambda I.

In particular, we might make the heuristic observation that ‘most’ polynomials of degree n have n distinct roots. This is certainly true for quadratics: there is only one value that the discriminant can take such that we see a repeated root. Alternatively, imagine shifting the quadratic up and down (in a complex way if necessary); again there is only one moment at which it might have a repeated root. This observation can be generalised easily to higher degree polynomials in a number of ways.

So if we lift this observation across to matrices, we see that most matrices have n distinct eigenvalues, and thus have n linearly independent eigenvectors which form a basis, hence the matrix is diagonalisable. I think it’s really worth reflecting on this, since much of a first exploration into linear algebra ends up treating exactly the case where the matrix is not diagonalisable.

The principal example of a non-diagonalisable matrix is \begin{pmatrix}2&1\\0&2\end{pmatrix}, where the 2s can be replaced by any value, and the 1 can be replaced by an non-zero value. There’s plenty to learn about to what extent versions of this matrix of higher size represent all non-diagonalisable matrices, but such an exposition of Jordan normal form comes next year for the students taking this course.

It probably is worth saying now though, that this example gives a good sanity check for whether a method is actually using diagonalisability correctly. For example, it is easily seen that elementary row operations to not preserve diagonalisability by starting from \begin{pmatrix}2&0\\0&2\end{pmatrix} and ending up at our counter-example. One could also argue from this that the set of non-diagonalisable matrices are dense within the set of matrices with a repeated eigenvalue. That is, having a repeated eigenvalue but full eigenspace is doubly-infinitely-unlikely.

Cayley-Hamilton theorem

Anyway, among other results, we also saw the Cayley-Hamilton theorem, which states that a matrix A satisfies its own characteristic equation. That is \chi_A(A)=0, where the zero on the right-hand side is the zero matrix. It’s tempting to substitute A into the expression \mathrm{det}(A-\lambda I), but of course this is not valid. Indeed imagine a typical eigenvalue determinant matrix with terms like (7-\lambda) on the diagonal; it doesn’t make sense to substitute a matrix for \lambda as one of the entries of the overall matrix!

Fortunately, we can argue convincingly in the case where A is a diagonalisable matrix. Remember that \chi_A(A) is a matrix. Now looki at the action of \chi_A(A) on any eigenvector v, corresponding to eigenvalue \lambda. Applying some power of A to v gives v multiplied by the same power of \lambda, and so we end up with

\chi_A(A)v = \chi_A(\lambda)v = 0.

This only worked when v was an eigenvector, but fortunately there is a basis of eigenvectors if A is diagonalisable, and so \chi_A(A)v=0 for all v, hence \chi_A(A)=0.

But \chi_A(A) is just a matrix-valued function of A. If you think about it, \chi_A is a monic polynomial, all of whose non-leading coefficients are multinomials of degree at most n-1 in the entries of A. Furthermore, these multinomials have (non-negative) integer coefficients. Therefore the entries of \chi_A(A) are multinomials of degree at most 2n-1 in the entries of A, and again have (non-negative) integer coefficients.

Even without the integrality of the coefficients, this says that, under any reasonable definition of continuity of matrices (which could be induced from any topology on \mathbb{R}^{n\times n}) the function \chi_A(A) should be continuous as a function of A. But we’ve shown \chi_A(A)=0 for all diagonalisable A, and also argued that most complex-valued matrices are diagonalisable. Turning this into a formal statement about denseness means that we’ve shown the Cayley-Hamilton theorem for non-diagonalisable matrices also. It feels that because the coefficients are non-negative integers, we might also have shown the result for other fields too, but I have minimal knowledge or recollection at the moment of the things one has to check for this sort of result.

It’s worth ending with the brief comment that Cayley-Hamilton is useful, among other reasons because it enables us to write the inverse of A as a polynomial of degree at most n-1 in terms of A. In many settings this is a lot easier to work with in terms of calculations than an argument with minors.

Perturbation of Eigenvectors

In the previous post, I talked about eigenvalues, and some alternative characterisations which could be useful in some circumstances. Recently, I’ve been interested in controlling how eigenvalues and eigenvectors change as the matrix is varied. My particular example concerns positive matrices, which have a well-defined largest eigenvalue (or Perron root), and a unique (up to normalising in some way) principal eigenvector.

We might expect that perturbing a matrix slightly does not change the eigenvectors very much, since any original eigenvector is still almost an eigenvector, in the sense that its image under the action of the perturbed matrix is almost equal to a multiple of itself. But how to make this precise? And when does it go wrong?

Eigenvalues – The non-multiple case

Throughout, we assume we have a k x k matrix. We might want to allow the entries to be complex, but for now, real entries are perfectly interesting enough.

It makes sense to start with eigenvalues, since it’s easy to define these through the characteristic equation of the matrix. The coefficients of this polynomial are well-behaved (indeed polynomial) functions of the entries of the matrix. So we are really asking how the set of roots of a finite polynomial evolves as the (k+1) coefficients of the polynomial evolve. It is fairly clear that, under any sensible choice of topology on the space of k-(multi)-subsets of \mathbb{C}, the multiset of roots is continuous in the coefficients of the polynomial.

To say anything more precise, we have to introduce some notation.

Let \chi_{A}(z)=z^k+\gamma_{k-1}(A)z^{k-1}+\ldots+\gamma_1(A)z+\gamma_0(A) be the characteristic polynomial of A. Each \gamma_i is a polynomial of degree k-i in the entries of A. Let’s consider now a matrix-valued function A(t), and we assume that the entries of A(t) are all differentiable with respect to t. So each \gamma_i(A(t)) is also differentiable with respect to t.

At this point, let’s make the assumption that t lies in some interval [r,s] for which the eigenvalues of A(t) are distinct. Let \lambda(t) be some eigenvalue of A(t), chosen such that \lambda is a continuous function of t. For example, we might take \lambda(t)=\Lambda_1(t), the eigenvalue with largest absolute value (with some canonical tie-breaking mechanism). Then \chi_{A(t)}(\lambda(t))=0, and so differentiating with respect to \gamma_i:

0=\chi'_{A(t)}(\lambda(t)) \frac{\partial \lambda}{\partial \gamma_i} \Big|_{A(t)} + \lambda(t)^i.

Because we deliberately demanded that the eigenvalues were disjoint, we have \chi'_{A(t)}(\lambda(t))\ne 0, and so \frac{\partial \lambda(t)}{\partial \gamma_i}=-\lambda(t)^i / \chi'_{A(t)}(\lambda(t)). In particular, \lambda(t) is differentiable with respect to the coefficients of the characteristic polynomial, and thus with respect to t also.

Multiple Eigenvalues

It gets more complicated when the characteristic equation has multiple roots. Typically we will be interested in the evolution of the eigenvalue with some extremal property, probably the largest one. Let’s restrict to the real, symmetric case, where the set of eigenvalues is complete and real. Suppose we have t_0 such that A(t_0) has a repeated eigenvalue. Then, in a small enough region of t_0, we can define eigenvalues \lambda(t),\mu(t) continuously such that \lambda(t_0)=\mu(t_0) while \lambda(t)\ne \mu(t) for t=\ne t_0. Then, if the entries of A(t) are analytic functions of t, then so are \lambda(t),\mu(t).

But then \max(\lambda(t),\mu(t)) will in general not be analytic, as the maximum of two smooth functions is in general Lipschitz.

This effect is most obvious in the case of a diagonal matrix A(t)=\begin{pmatrix}t&0\\0&-t\end{pmatrix}, for which the largest eigenvalue is |t|.

Eigenvectors

When the matrix A is real and symmetric, we know it has real eigenvalues, and an orthogonal basis of eigenvectors. Then the Rayleigh quotient characterises the eigenvector as well as the eigenvalue. Recall that for any x\in\mathbb{R}^k with ||x||_2=1, we have

\lambda_1\ge x^T A x \ge \lambda_k,

with equality precisely at the respective eigenvectors. So if we perturb A slightly, keeping it real and symmetric, we can control the principal eigenvector quite well by this method.

If A is not diagonalisable, we can still say something about this principal eigenvector, via large powers of A, sometimes called the Van Mises iteration. This says that for large N, A^N v should have direction close to that of the eigenvector, for any test vector v. The rate of convergence depends on the ratio of the largest eigenvalue to the second largest eigenvalue, though if the matrix is not diagonalisable, it is not completely trivial to quantify this convergence. We have to be careful though, since A maps the subspace orthogonal to the eigenvector to itself, so the magnitude of the projection of v onto the eigenvector determines the speed of convergence. Indeed, if v is orthogonal to the eigenvector, it won’t converge towards the principal eigenvector at all. (But if there is a well-defined ‘second eigenvector’ then it will converge towards that.)

Continuity of Eigenvectors

The reason why I ended up reading about some of these topics was that I wanted to show that the Perron eigenvector of a positive matrix (that is, the unique eigenvector corresponding to the Perron root) was Lipschitz continuous as a function of the entries of the positive matrix. Since for such a matrix, the largest eigenvalue is simple, we are able to make some progress.

In general, the condition that v is an eigenvector of matrix A with eigenvalue \lambda is described by the relation:

(A-\lambda I)v=0,\quad ||u||_1=1, (*)

or whatever the most appropriate normalising condition appears. This describes an implicit relation between A and the eigenvalue-eigenvector pair (\lambda,v). So given a matrix A_0 with eigenvalue \lambda_0 corresponding to eigenvector v_0, in a neighbourhood of A_0 in \mathbb{R}^{k\times k} we can use the implicit function theorem to comment on the differentiability of (\lambda,v) with respect to A in this neighbourhood.

Precisely, we require the matrix of partial derivatives from (*)

\begin{pmatrix} A_0-\lambda_0 I&v_0 \\ \mathbf{1}&0\end{pmatrix}

to have non-zero determinant. But if \lambda_0 is not simple, then if we apply this matrix from the left to one of the other eigenvectors (with a zero appended) we can see that it has non-trivial kernel. With a bit more work, we can show the converse too, and conclude that (\lambda,v) are smooth with respect to A in some neighbourhood of A_0.

Finally, we observe that when the eigenvalues are not simple, we can’t even guarantee continuity of the eigenvectors. This is unsurprising really, since for a multiple eigenvalue, a) we might not know how many LI eigenvectors exists; and b) we might have complete freedom over the choice of eigenvectors. Think about the identity matrix! Indeed the eigenvectors of \begin{pmatrix}1+\epsilon &0\\ 0&1+\epsilon\end{pmatrix} are (1,0) and (0,1), while the eigenvectors of \begin{pmatrix}1&\epsilon\\ \epsilon&1\end{pmatrix} are (1,1), (1,-1). So no continuous choice of eigenvectors is possible here.