CS3220 Lecture Notes: Singular Value Decomposition and Applications
CS3220 Lecture Notes: Singular Value Decomposition and Applications
1 From QR to SVD
We’ve just been looking at an orthogonal matrix factorization of an m×n matrix
A that gives us an orthogonal factor on the left:
A = QR
1
In the full-rank case the row space of a tall matrix or the column space of
a wide matrix are uninteresting, because either the rows (of a tall matrix) or
the columns (of a wide matrix) span their whole space. So we only ever need
to know about either the rows or the columns, and we can pick one of these
two factorizations. If the matrix is not full rank (it is rank deficient), then both
the row space and the column space are interesting (meaning that they are less
than the full space IRm or IRn ), but we will still be stuck choosing one or the
other.
The next factorization we’ll look at, the SVD, has orthogonal factors on both
sides, and it works fine in the presence of rank deficiency:
A = U ΣV T (or AV = U Σ)
The SVD writes A as a product of two orthogonal transformations with a di-
agonal matrix (a scaling operation) in between. It says that we can replace
any transformation by a rotation1 from “input” coordinates into convenient co-
ordinates, followed by a simple scaling operation, followed by a rotation into
“output” coordinates. Furthermore, the diagonal scaling Σ comes out with its
elements sorted in decreasing order.
2
minor axes of that ellipse (being on the left they live in the “output” space).
The right singular vectors v1 , v2 are the vectors that get mapped to the major
and minor axes (being on the right they live in the “input” space).
If we break the transformation down into these three stages we see a circle
being rotated to align the vs with the coordinate axes, then scaled along those
axes, then rotated to align the ellipse with the us:
Another way to say some of this is that Avi = σi ui , which you can see by
this 2D example: T
v1 1 1
v 1 = ; U Σ = σ1 u1
v2T 0 0
We can easily promote this idea to 3D, though it becomes harder to draw:
T
a v1
b v2T
A = u1 u2 u3
0 v3T
3
This means one of the singular values (the last one, since we sort them in
decreasing order) is zero. The last left singular vector is the normal to that
ellipse.
A rank-deficient matrix is also one that has a nontrivial null space: some
direction that gets mapped to zero. In this case, that vector is v3 , since
0 0 0
V T v3 = 0 and Σ 0 = 0 .
1 1 0
• U1 is a basis for ran(A), aka span(A), aka the column space of A (just like
in QR) (dim = r). After multiplication by Σ only the first r entries can
be nonzero, so only vectors in span(U1 ) can be produced by multiplication
with A.
• U2 is a basis for ran(A)⊥ , aka null(AT ) (dim = m − r). Since all the
entries of ΣV T x corresponding to those columns of U are zero, no vectors
in span(U2 ) can be generated by multiplication with A.
• V2 is a basis for null(A), aka ran(A)⊥ (dim = n − r). Since any vector
in span(V2 ) will end up multiplying the zero singular values, it will get
mapped to zero.
4
This is exactly analogous to the “skinny” QR factorization.
Ax = b x = V Σ−1 U T b
The first term can be made zero; the second is the residual. But because of rank
deficiency, minimizing the first term is an overdetermined problem—the matrix
5
T T T
Σ1 0 V is wider than tall. If we let y = V x and c = U1 b, then split y into
y1
the system to be solved is
y2
y1
Σ1 0 =c
y2
Σ1 y1 = c
Since y2 does not change the answer we’ll go for the minimum-norm solution
and set y2 = 0. Then x = V1 y1 .
So the solution process boils down to just:
1. Compute SVD.
2. c = U1T b.
3. y1 = Σ−1
1 c.
4. x = V1 y1 .
Or, more briefly,
x = V1 Σ−1 T
1 U1 b. (1)
This is just like the full-rank square solution, but with U2 and V2 left out.
I can also explain this process in words:
1. Transform b into the U basis.
2. Throw out components we can’t produce anyway, since there’s no point
trying to match them.
3. Undo the scaling effect of A.
4. Fill in zeros for the components that A doesn’t care about, since it doesn’t
matter what they are.
5. Transform out of the V basis to obtain x.
3.3 Pseudoinverse
If A is square and full-rank it has an inverse. Its SVD has a square Σ that has
nonzero entries all the way down the diagonal. We can invert Σ easily by taking
the reciprocals of these diagonal entries:
Σ = diag(σ1 , σ2 , . . . , σn )
−1
Σ = diag(1/σ1 , 1/σ2 , . . . , 1/σn )
6
So the SVD of A−1 is closely related to the SVD of A. We simply have swapped
the roles of U and V and inverted the scale factors in Σ. Earlier we interpreted
U and V as bases for the “output” and “input” spaces of A, so it makes sense
that they are exchanged in the inverse. A nice thing about this way of getting
the inverse is that it’s easy to see before we start whether and where things will
go badly: the only thing that can go wrong is for the singular values to be zero
or nearly zero. In our other procedures for the inverse we just plowed ahead and
waited to see if things would blow up. (Of course, don’t forget that computing
the SVD is many times more expensive than a procedure like LU or QR.)
If A is non-square and/or rank deficient, it no longer has an inverse, but we
can go ahead and generalize the SVD-based inverse, modeling it on the rank-
deficient least squares process we just saw. Looking back at (1), we can see that
we are treating the matrix
A+ = V1 Σ−1
1 U1
T
(3)
like an inverse for A, in the sense that we solve the problem Ax ≈ b by writing
x = A+ b. It is also just like the formula in (2) for A−1 except that we have kept
only the first r rows and/or columns of each matrix.
The matrix A+ is known as the pseudoinverse of A. Multiplying a vector by
the pseudoinverse solves a least-squares problem and gives you the minimum-
norm solution. Note that writing down a pseudoinverse for a matrix always
involves deciding on a rank for the matrix first.
There’s another way in which the pseudoinverse is like an inverse: it is the
matrix that comes closest to acting like an inverse, in the sense that multiplying
A by A+ almost produces the identity:
A+ = min kAX − Im kF
X∈IRn×m
= min kXA − In kF
X∈IRn×m
X 0 = U XV T
so that
AX − I = (U ΣV T )(V X 0 U T ) − I
U T (AX − I)U = ΣX 0 − I
7
so that
Ir 0
ΣΣ+ = .
0 0
Now that we know the optimal X 0 we can compute the corresponding X, which
is A+ , and you can easily verify that A+ comes out to the expression in (3).
Here are some facts about A+ for special cases:
• A full rank, square: A+ = A−1 (as it has to be if it’s going to be the closes
thing to an inverse, when the inverse in fact exists!)
• A full rank, tall: A+ A = In but AA+ 6= Im . By the normal equations
we learned earlier, A+ = (AT A)−1 AT since both matrices compute the
(unique) solution to a full-rank least squares problem.
∆x = A−1 ∆b
k∆xk ≤ kA−1 kk∆bk
Likewise,
b = Ax
kbk ≤ kAkkxk
kxk ≥ kbk/kAk
8
In the (full rank) least-squares case, x = A+ b and ∆x = A+ ∆b, but Ax 6= b.
From the SVD of A+ it’s clear that kA+ k = 1/σn and we’ll define cond(A) =
σ1 /σn = kAkkA+ k. We can then follow roughly the same argument to get:
∆x = A+ ∆b
k∆xk ≤ kA+ kk∆bk
Likewise,
kAxk ≤ kAkkxk
kxk ≥ kAxk/kAk
So the condition number of the least squares problem is cond(A) times the ratio
kbk/kAxk. This ratio is often called 1/ cos θ because it is the angle between the
vector b and the range of A. If b ∈ ran(A) then cos θ = 1 and the conditioning
is the same as the equality system. As b approaches orthogonal to ran(A), cos θ
decreases towards zero and the condition number of the least squares problem
becomes arbitrarily large. This is not too surprising: in this case the output of
the problem (x) becomes small but the propagated error only depends on kA+ k
which doesn’t depend on b, so the relative error can be very large.
On the other hand, it takes a special b to have cos θ very near zero. The
sensitivity of the solution to errors in b is similar to the equality case for “non-
special” values of b.
Let’s examine the effects of errors in A. In the equality case we saw that
these errors behave the same as errors in b. This time we change A to A + E
and ask how we have to change x to maintain equality:
(A + E)(x + ∆x) = b
Ax + Ex + A∆x + E∆x = b
A∆x ≈ −Ex
In this last step we have discarded E∆x on the basis that it is second-order
in terms of the errors and is therefore negligible for small errors relative to the
other terms, and we have canceled Ax with b. Carrying on and applying the
usual matrix-norm bounds:
∆x ≈ −A−1 Ex
k∆xk ≤ kA−1 kkEkkxk
k∆xk kEk
≤ kA−1 kkEk = cond(A)
kxk kAk
9
So relative errors in A propagate to relative errors in x the same way as errors
in b do.
Now let’s look at this in the least-squares case. For this we will use the
normal equations, and we’ll need a fact about the matrix AT A:
AT A = V ΣU T U ΣV T = V Σ2 V T
(where U, Σ, V is the SVD of A). So its singular values are the squares of A’s
singular values. This means kAT Ak = σ12 = kAk2 , k(AT A)−1 k = 1/σn2 =
kA−1 k2 , and therefore cond(AT A) = σ12 /σn2 = cond(A)2 .
Now, following the approach used for equality systems, we ask about the
effect of changing A to A + E, using the normal equations to get an equation
for its effect on ∆x:
In the second line we have discarded four terms that are second or third order
in the errors (e.g. AT E∆x). Canceling AT Ax with AT b we get
so we see in this case that the condition number for error in x propagated from
error in E is the sum of two pieces: the familiar condition number of A, and a
second term proportional to the square of that condition number and the ratio
krk/kAxk. This ratio is known as tan θ since it is the tangent of the angle
between b and the range of A. This one is more troubling because it will be
large except for “special” values of b that are very near ran(A).
10
3.5 Numerical rank
Sometimes we have to deal with rank-deficient matrices that are corrupted by
noise. In fact we always have to do this, because when you take some rank-
deficient matrix and represent it in floating point numbers, you’ll usually end
up with a full-rank matrix! (This is just like representing a point on a plane
in 3D – it’s only if we are lucky that a floating point number happens to lie
exactly on the plane; usually it has to be approximated with one near the plane
instead.)
When our matrix comes from some kind of measurement that has uncertainty
associated with it, it will be quite far from rank-deficient even if the underlying
“true” matrix is rank-deficient.
SVD is a good tool for dealing with numerical rank issues because it answers
the question:
Given a rank-k matrix, how far is it from the nearest rank-(k − 1)
matrix?
That is, How rank-k is it?, or How precisely to we have to know the matrix in
order to be sure that it’s not actually rank-(k − 1)?
SVD lets us get at these questions because, since rank and distances are
unaffected by the orthogonal transformations U and V , numerical rank questions
about A are reduced to questions about the diagonal matrix Σ.
If Σ has k entries on its diagonal, we can make it rank-(k − 1) by zeroing
the k th diagonal entry. The distance between the two matrices is equal to the
entry we set to zero, namely σk , in both the 2-norm and the F − norm. It’s not
surprising (though I won’t prove it) that this is the closest rank-(k − 1) matrix.
If we have a rank-k matrix and perturb it by adding a noise matrix that has
norm less then , it can’t change the zero singular values by more than . This
means noise has a nicely definable effect on our ability to detect rank: if the
singular values are more than we know they are for real (they did not just
come from the noise) and if they are less then we don’t know whether they
came from the “real” matrix or from the noise.
We can roll these ideas up in a generalization of rank known as numeri-
cal rank: whereas rank(A) is the (exact) rank of the matrix A, we can write
rank(A, ) for the highest rank we’re sure A has if it could be corrupted by noise
of norm . That is, rank(A, ) is the highest r for which there are no rank-r
matrices within a distance of A.
11
[Aside: recall the outer product version of matrix multiply, from way back
at the beginning:
X X
C = AB ; cij = aik bkj ; C = a:k bk:
k k
This is a sum of a bunch of rank-1 matrices (outer product matrices are always
rank 1), and the norms of these matrices are steadily decreasing.
By truncating this sum and including only the first r terms, we wind up
with a matrix of rank r that approximates A. In fact, it is the best rank-r
approximation, in both the 2-norm and the Frobenius norm.
Interpreting this low-rank approximation in terms of column vectors leads
to a powerful and widely used statistical tool known as principal component
analysis. If we have a lot of vectors that we think are somehow related—they
came from some process that only generates certain kinds of vectors—then one
way to try to identify the structure in the data is to try and find a subspace that
all the vectors are close to. If the subspace fits the data well, then that means
the data are low rank: they don’t have as many degrees of freedom as there
are measurements for each data point. Then, with a basis for that subspace in
hand, we can talk about our data in terms of combinations of those basis vectors,
which is more efficient, especially if we can make do with a fairly small number
of vectors. It can make the data easier to understand by reducing the dimension
to where we can manage to think about it, or it can make computations more
efficient by allowing us to work with the (short) vectors of coefficients in place
of the (long) original vectors.
To be a little more specifc, let’s say we have a bunch of m-vectors called
a1 , . . . , an . Stack them, columnwise, into a matrix A, and then compute the
SVD of A.
A = U ΣV T
The first n columns of U are the principal components of the data vectors
a1 , . . . , an . The entries of Σ tell us the relative importance of the principal
components. The rows of V give the coefficients of the data vectors in the
principal components basis. Once we have this representation of the data, we
can look at the singular values to see how many of the principal components
seem to be important. If we use only the first k principal components, leaving
out the remaining n − k components, we’ve chosen a k-dimensional subspace,
12
and the error of approximating our data vectors by their projections into that
subspace is limited by the magnitude of the singular values belonging to the
components we left off. This subspace is optimal in the 2-norm sense, and in
the F-norm sense.
One adjustment to this process, which sacrifices a bit of optimality in ex-
change for more meaningful principal components, is to first subtract the mean
of the ai s from each ai . Of course, you need to keep the mean vector around and
add it back in whenever you are approximating vectors from the components.
Sources
• Our textbook: Heath, Scientific Computing: An introductory survey, 2e.
Section 3.6.
• Golub and Van Loan, Matrix Computations, Third edition. Johns Hopkins
Univ. Press, 1996. Chapter 5.
13