0% found this document useful (0 votes)
96 views13 pages

CS3220 Lecture Notes: Singular Value Decomposition and Applications

The document summarizes the singular value decomposition (SVD) and its applications. The SVD decomposes a matrix into three components: left singular vectors (U), singular values (Σ), and right singular vectors (V). It can be used to find a matrix's fundamental subspaces, solve rank-deficient least squares problems, determine matrix rank in noisy data, and for principal component analysis. The SVD works even when the matrix is rank-deficient, providing insight into the column space, nullspace, and more.

Uploaded by

Onkar Pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views13 pages

CS3220 Lecture Notes: Singular Value Decomposition and Applications

The document summarizes the singular value decomposition (SVD) and its applications. The SVD decomposes a matrix into three components: left singular vectors (U), singular values (Σ), and right singular vectors (V). It can be used to find a matrix's fundamental subspaces, solve rank-deficient least squares problems, determine matrix rank in noisy data, and for principal component analysis. The SVD works even when the matrix is rank-deficient, providing insight into the column space, nullspace, and more.

Uploaded by

Onkar Pandit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

CS3220 Lecture Notes: Singular Value

decomposition and applications


Steve Marschner
Cornell University
5–7 April 2010

This document is a transcription of the notes I have used to give CS322/3220


lectures on the SVD over the years, so it includes a bit more material than I’ve
covered this year. It covers the SVD and what it is, and then applications to
finding a matrix’s fundamental subspaces, solving rank-deficient least squares
problems, deciding matrix rank in the presence of noise, and in principal com-
ponent analysis.

1 From QR to SVD
We’ve just been looking at an orthogonal matrix factorization of an m×n matrix
A that gives us an orthogonal factor on the left:

A = QR

By drawing in that dotted line we separate Q in to a basis for the range of


A (the left-hand part, called Q1 ) and a basis for the rest of IRm (the right-hand
part, called Q2 ). This Q factor tells us about the column space of A.
When we wanted to know about the row space of A instead, we factored AT ,
resulting in something that might be called the “LQ factorization” (except that
it isn’t):
AT = QR or A = RT QT

1
In the full-rank case the row space of a tall matrix or the column space of
a wide matrix are uninteresting, because either the rows (of a tall matrix) or
the columns (of a wide matrix) span their whole space. So we only ever need
to know about either the rows or the columns, and we can pick one of these
two factorizations. If the matrix is not full rank (it is rank deficient), then both
the row space and the column space are interesting (meaning that they are less
than the full space IRm or IRn ), but we will still be stuck choosing one or the
other.
The next factorization we’ll look at, the SVD, has orthogonal factors on both
sides, and it works fine in the presence of rank deficiency:

A = U ΣV T (or AV = U Σ)
The SVD writes A as a product of two orthogonal transformations with a di-
agonal matrix (a scaling operation) in between. It says that we can replace
any transformation by a rotation1 from “input” coordinates into convenient co-
ordinates, followed by a simple scaling operation, followed by a rotation into
“output” coordinates. Furthermore, the diagonal scaling Σ comes out with its
elements sorted in decreasing order.

2 SVD definitions and interpretation


The pieces of the SVD have names following the “singular” theme. The columns
of U are the left singular vectors ui ; the entries on the diagonal of Σ are the
singular values; and the columns of V (which are the rows of V T are (you
guessed it) the right singular vectors vi .
The SVD has a nice, simple geometric interpretation (see also Todd Will’s
SVD tutorial linked from the Readings page, which has a similar take). It’s
easiest to draw in 2D.  T
  v
Let U = u1 u2 and V T = 1T .
v2

If we take the unit circle and transform it by A, we get an ellipse (because A


is a linear transformation). The left singular vectors u1 , u2 are the major and
1 Really, an orthographic transformation, which is a rotation plus an optional refleciton.

But rotation is the right picture to have in your mind.

2
minor axes of that ellipse (being on the left they live in the “output” space).
The right singular vectors v1 , v2 are the vectors that get mapped to the major
and minor axes (being on the right they live in the “input” space).
If we break the transformation down into these three stages we see a circle
being rotated to align the vs with the coordinate axes, then scaled along those
axes, then rotated to align the ellipse with the us:

Another way to say some of this is that Avi = σi ui , which you can see by
this 2D example:  T    
v1 1 1
v 1 = ; U Σ = σ1 u1
v2T 0 0
We can easily promote this idea to 3D, though it becomes harder to draw:

Let’s look at what happens when we have a singular (aka. rank-deficient)


matrix, in this 3 × 3 setting. Recall that the rank of a matrix is the dimension
of the span of the columns (or rows—they are always the same): if the third
column is in the span of the first two, it’s a rank-2 matrix; if the second and
third columns are both in the span of the first one (that is, all three are parallel)
then it is a rank-1 matrix.
A rank-deficient matrix is one whose range is a subspace of IR3 , not all of
3
IR , so it maps the sphere to a flat ellipse (in the rank-2 case) rather than an
ellipsoid:

  T
 a v1
b  v2T 

A = u1 u2 u3 
0 v3T

3
This means one of the singular values (the last one, since we sort them in
decreasing order) is zero. The last left singular vector is the normal to that
ellipse.
A rank-deficient matrix is also one that has a nontrivial null space: some
direction that gets mapped to zero. In this case, that vector is v3 , since
     
0 0 0
V T v3 = 0 and Σ 0 = 0 .
1 1 0

More generally, the SVD of a rank-r matrix looks like this:

In this picture, r < n < m. So the matrix is rank-deficient (r < n) with an


r-dimensional range and an (n − r)-dimensional null space.

3 Applications of the SVD


3.1 Fundamental subspaces of a matrix
From our rectangles-and-squares picture of the SVD, we can read off the four
fundamental spaces:

• U1 is a basis for ran(A), aka span(A), aka the column space of A (just like
in QR) (dim = r). After multiplication by Σ only the first r entries can
be nonzero, so only vectors in span(U1 ) can be produced by multiplication
with A.
• U2 is a basis for ran(A)⊥ , aka null(AT ) (dim = m − r). Since all the
entries of ΣV T x corresponding to those columns of U are zero, no vectors
in span(U2 ) can be generated by multiplication with A.
• V2 is a basis for null(A), aka ran(A)⊥ (dim = n − r). Since any vector
in span(V2 ) will end up multiplying the zero singular values, it will get
mapped to zero.

• V1 is a basis for the row space of A, aka ran(AT ) (dim = r).


Again, when the matrix is rank-deficient, all four spaces are nontrivial.
If we are just interested in a factorization of A from which we can reconstruct
A we need only the first r singular vectors and singular values:

4
This is exactly analogous to the “skinny” QR factorization.

3.2 Rank-deficient least squares


As with most of our other linear algebra tools, SVD provides yet another way
to solve linear systems.

Ax = b x = V Σ−1 U T b

No problem! It’s just three matrix multiplications. The inverse of Σ is easy:


it is just diag(1/σ1 , 1/σ2 , . . . , 1/σn ). This may be the simplest yet—but not
the fastest. Note that back substitution takes less arithmetic than full matrix
multiplication.
This process is nice and stable, and it’s very clear exactly where the numbers
can get bigger: only when they get divided by small singular values. It’s also
clear up front if the process will fail: it’s when we divide by a zero singular
value.
One of the strengths of the SVD is that it works when the matrix is singular.
How would we go about solving a singular system? More specifically, how can
we solve Ax ≈ b, A square n × n, rank r < n?
This is no longer a system of equations we can expect to have an exact
solution, since ran(A) ⊂ IRn and b might not be in there. It also will not have
a unique solution. If Ax∗ − b is minimum, so is A(x∗ + y) − b for any vector y ∈
null(A). Thus a system like this is both overdetermined and underdetermined.
SVD gives us easy access to the solution space, though. The procedure for
doing this is a combination of the procedures we used for over- and underdeter-
mined systems using QR. First we expand the residual we’re trying to minimize
using the SVD of A and then transform it by U T (which does not change the
norm of the residual):

kAx − bk = kU ΣV T x − bk2 = kΣV T x − U T bk2


   T  2
Σ1 0 T U
= V x − 1T b
0 0 U2

Breaking apart the top and bottom rows:

0 V T x − U1T bk2 + kU2T bk2


 
= k Σ1

The first term can be made zero; the second is the residual. But because of rank
deficiency, minimizing the first term is an overdetermined problem—the matrix

5
  T T T
Σ1 0 V is wider than tall. If we let y = V x and c = U1 b, then split y into
y1
the system to be solved is
y2
 
  y1
Σ1 0 =c
y2
Σ1 y1 = c

Since y2 does not change the answer we’ll go for the minimum-norm solution
and set y2 = 0. Then x = V1 y1 .
So the solution process boils down to just:
1. Compute SVD.
2. c = U1T b.
3. y1 = Σ−1
1 c.

4. x = V1 y1 .
Or, more briefly,
x = V1 Σ−1 T
1 U1 b. (1)
This is just like the full-rank square solution, but with U2 and V2 left out.
I can also explain this process in words:
1. Transform b into the U basis.
2. Throw out components we can’t produce anyway, since there’s no point
trying to match them.
3. Undo the scaling effect of A.
4. Fill in zeros for the components that A doesn’t care about, since it doesn’t
matter what they are.
5. Transform out of the V basis to obtain x.

3.3 Pseudoinverse
If A is square and full-rank it has an inverse. Its SVD has a square Σ that has
nonzero entries all the way down the diagonal. We can invert Σ easily by taking
the reciprocals of these diagonal entries:

Σ = diag(σ1 , σ2 , . . . , σn )
−1
Σ = diag(1/σ1 , 1/σ2 , . . . , 1/σn )

and we can write down the inverse of A directly:

A−1 = (U ΣV T )−1 = (V T )−1 Σ−1 U −1 = V Σ−1 U T (2)

6
So the SVD of A−1 is closely related to the SVD of A. We simply have swapped
the roles of U and V and inverted the scale factors in Σ. Earlier we interpreted
U and V as bases for the “output” and “input” spaces of A, so it makes sense
that they are exchanged in the inverse. A nice thing about this way of getting
the inverse is that it’s easy to see before we start whether and where things will
go badly: the only thing that can go wrong is for the singular values to be zero
or nearly zero. In our other procedures for the inverse we just plowed ahead and
waited to see if things would blow up. (Of course, don’t forget that computing
the SVD is many times more expensive than a procedure like LU or QR.)
If A is non-square and/or rank deficient, it no longer has an inverse, but we
can go ahead and generalize the SVD-based inverse, modeling it on the rank-
deficient least squares process we just saw. Looking back at (1), we can see that
we are treating the matrix
A+ = V1 Σ−1
1 U1
T
(3)
like an inverse for A, in the sense that we solve the problem Ax ≈ b by writing
x = A+ b. It is also just like the formula in (2) for A−1 except that we have kept
only the first r rows and/or columns of each matrix.
The matrix A+ is known as the pseudoinverse of A. Multiplying a vector by
the pseudoinverse solves a least-squares problem and gives you the minimum-
norm solution. Note that writing down a pseudoinverse for a matrix always
involves deciding on a rank for the matrix first.
There’s another way in which the pseudoinverse is like an inverse: it is the
matrix that comes closest to acting like an inverse, in the sense that multiplying
A by A+ almost produces the identity:

A+ = min kAX − Im kF
X∈IRn×m
= min kXA − In kF
X∈IRn×m

We can prove this by looking at X in A’s singular vector space:

X 0 = U XV T

so that

AX − I = (U ΣV T )(V X 0 U T ) − I
U T (AX − I)U = ΣX 0 − I

Since U is orthogonal, multiplying by it didn’t change the F-norm of AX − I.


This means we’ve reduced the problem of minimizing kAX − IkF to the easier
problem of minimizing kΣX 0 − IkF . Since the last m − r rows of Σ are zero,
the closest we can hope to match I is to get the first r rows to match. We can
do that easily by setting
 −1 
0 + Σ1 0
X =Σ =
0 0

7
so that  
Ir 0
ΣΣ+ = .
0 0
Now that we know the optimal X 0 we can compute the corresponding X, which
is A+ , and you can easily verify that A+ comes out to the expression in (3).
Here are some facts about A+ for special cases:

• A full rank, square: A+ = A−1 (as it has to be if it’s going to be the closes
thing to an inverse, when the inverse in fact exists!)
• A full rank, tall: A+ A = In but AA+ 6= Im . By the normal equations
we learned earlier, A+ = (AT A)−1 AT since both matrices compute the
(unique) solution to a full-rank least squares problem.

• A full rank, wide: AA+ = Im but AT A 6= In . Again by normal equations,


A+ = AT (AAT )−1 .
Said informally, in both cases you can get the small identity but not the big
identity, because the big identity has a larger rank than A and A+ .

3.4 Conditioning of least-squares problems


With the SVD and pseudoinverse we are equipped to ask about the conditioning
of least squares problems: when we solve minx ||Ax − b|| what can we say about
the error in x relative to the error in A and b? For square, nonsingular linear
systems the answer was the same for either kind of errors: the bound is provided
by cond(A). Let’s look at what happens to these bounds as we move from
equality to least squares.
For errors in b, in the equality case, we assumed there is an error ∆b added
to b and asked how much we have to change x to keep the equality true. We
found that A∆x = ∆b so that:

∆x = A−1 ∆b
k∆xk ≤ kA−1 kk∆bk

Likewise,

b = Ax
kbk ≤ kAkkxk
kxk ≥ kbk/kAk

and dividing these two inequalities,

k∆xk k∆bk k∆bk


≤ kAkkA−1 k = cond(A)
kxk kbk kbk

8
In the (full rank) least-squares case, x = A+ b and ∆x = A+ ∆b, but Ax 6= b.
From the SVD of A+ it’s clear that kA+ k = 1/σn and we’ll define cond(A) =
σ1 /σn = kAkkA+ k. We can then follow roughly the same argument to get:

∆x = A+ ∆b
k∆xk ≤ kA+ kk∆bk

Likewise,

kAxk ≤ kAkkxk
kxk ≥ kAxk/kAk

and dividing these two inequalities,


 
k∆xk −1 k∆bk k∆bk kbk k∆bk
≤ kAkkA k = cond(A) = cond(A)
kxk kAxk kAxk kAxk kbk

So the condition number of the least squares problem is cond(A) times the ratio
kbk/kAxk. This ratio is often called 1/ cos θ because it is the angle between the
vector b and the range of A. If b ∈ ran(A) then cos θ = 1 and the conditioning
is the same as the equality system. As b approaches orthogonal to ran(A), cos θ
decreases towards zero and the condition number of the least squares problem
becomes arbitrarily large. This is not too surprising: in this case the output of
the problem (x) becomes small but the propagated error only depends on kA+ k
which doesn’t depend on b, so the relative error can be very large.
On the other hand, it takes a special b to have cos θ very near zero. The
sensitivity of the solution to errors in b is similar to the equality case for “non-
special” values of b.
Let’s examine the effects of errors in A. In the equality case we saw that
these errors behave the same as errors in b. This time we change A to A + E
and ask how we have to change x to maintain equality:

(A + E)(x + ∆x) = b
Ax + Ex + A∆x + E∆x = b
A∆x ≈ −Ex

In this last step we have discarded E∆x on the basis that it is second-order
in terms of the errors and is therefore negligible for small errors relative to the
other terms, and we have canceled Ax with b. Carrying on and applying the
usual matrix-norm bounds:

∆x ≈ −A−1 Ex
k∆xk ≤ kA−1 kkEkkxk
k∆xk kEk
≤ kA−1 kkEk = cond(A)
kxk kAk

9
So relative errors in A propagate to relative errors in x the same way as errors
in b do.
Now let’s look at this in the least-squares case. For this we will use the
normal equations, and we’ll need a fact about the matrix AT A:

AT A = V ΣU T U ΣV T = V Σ2 V T

(where U, Σ, V is the SVD of A). So its singular values are the squares of A’s
singular values. This means kAT Ak = σ12 = kAk2 , k(AT A)−1 k = 1/σn2 =
kA−1 k2 , and therefore cond(AT A) = σ12 /σn2 = cond(A)2 .
Now, following the approach used for equality systems, we ask about the
effect of changing A to A + E, using the normal equations to get an equation
for its effect on ∆x:

(A + E)T (A + E)(x + ∆x) = (A + E)T b


AT Ax + AT Ex + E T Ax + AT A∆x ≈ AT b + E T b

In the second line we have discarded four terms that are second or third order
in the errors (e.g. AT E∆x). Canceling AT Ax with AT b we get

(AT A)∆x ≈ E T (b − Ax) − AT Ex


∆x ≈ (AT A)−1 E T r − A+ Ex

where r = b − Ax is the residual, and we’ve substituted A+ for its definition in


terms of normal equations. Applying matrix norms:

k∆xk ≤ k(AT A)−1 kkEkkrk + kA+ kkEkkxk


kEk
k∆xk ≤ kA−1 k2 kEkkrk + cond(A) kxk
kAk
k∆xk kA−1 k2 kEk
≤ kEkkrk + cond(A)
kxk kxk kAk
kEk krk kEk
= cond(A)2 + cond(A)
kAk kAkkxk kAk

Since kAkkxk ≥ kAxk,


 
k∆xk 2 krk kEk
≤ cond(A) + cond(A)
kxk kAxk kAk

so we see in this case that the condition number for error in x propagated from
error in E is the sum of two pieces: the familiar condition number of A, and a
second term proportional to the square of that condition number and the ratio
krk/kAxk. This ratio is known as tan θ since it is the tangent of the angle
between b and the range of A. This one is more troubling because it will be
large except for “special” values of b that are very near ran(A).

10
3.5 Numerical rank
Sometimes we have to deal with rank-deficient matrices that are corrupted by
noise. In fact we always have to do this, because when you take some rank-
deficient matrix and represent it in floating point numbers, you’ll usually end
up with a full-rank matrix! (This is just like representing a point on a plane
in 3D – it’s only if we are lucky that a floating point number happens to lie
exactly on the plane; usually it has to be approximated with one near the plane
instead.)
When our matrix comes from some kind of measurement that has uncertainty
associated with it, it will be quite far from rank-deficient even if the underlying
“true” matrix is rank-deficient.
SVD is a good tool for dealing with numerical rank issues because it answers
the question:
Given a rank-k matrix, how far is it from the nearest rank-(k − 1)
matrix?

That is, How rank-k is it?, or How precisely to we have to know the matrix in
order to be sure that it’s not actually rank-(k − 1)?
SVD lets us get at these questions because, since rank and distances are
unaffected by the orthogonal transformations U and V , numerical rank questions
about A are reduced to questions about the diagonal matrix Σ.
If Σ has k entries on its diagonal, we can make it rank-(k − 1) by zeroing
the k th diagonal entry. The distance between the two matrices is equal to the
entry we set to zero, namely σk , in both the 2-norm and the F − norm. It’s not
surprising (though I won’t prove it) that this is the closest rank-(k − 1) matrix.
If we have a rank-k matrix and perturb it by adding a noise matrix that has
norm less then , it can’t change the zero singular values by more than . This
means noise has a nicely definable effect on our ability to detect rank: if the
singular values are more than  we know they are for real (they did not just
come from the noise) and if they are less then  we don’t know whether they
came from the “real” matrix or from the noise.
We can roll these ideas up in a generalization of rank known as numeri-
cal rank: whereas rank(A) is the (exact) rank of the matrix A, we can write
rank(A, ) for the highest rank we’re sure A has if it could be corrupted by noise
of norm . That is, rank(A, ) is the highest r for which there are no rank-r
matrices within a distance  of A.

3.6 Principal component analysis


Another way of viewing the SVD is that it gives you a sequence of low-rank
approximations to a data matrix A. These approximations become accurate as
the rank of the approximation approaches the “true” dimension of the data. To
do this, think of the product U ΣV T as a sum of outer products.

11
[Aside: recall the outer product version of matrix multiply, from way back
at the beginning:
X X
C = AB ; cij = aik bkj ; C = a:k bk:
k k

We can use the same idea even with Σ in the middle.]


We can express the SVD as a sum of outer products of corresponding left
and right singular vectors:
  — v T —
| | 1
A = u1 · · · um  Σ 
 .. 
. 
| | T
— vn —
X
T
= σi ui vi
i

This is a sum of a bunch of rank-1 matrices (outer product matrices are always
rank 1), and the norms of these matrices are steadily decreasing.
By truncating this sum and including only the first r terms, we wind up
with a matrix of rank r that approximates A. In fact, it is the best rank-r
approximation, in both the 2-norm and the Frobenius norm.
Interpreting this low-rank approximation in terms of column vectors leads
to a powerful and widely used statistical tool known as principal component
analysis. If we have a lot of vectors that we think are somehow related—they
came from some process that only generates certain kinds of vectors—then one
way to try to identify the structure in the data is to try and find a subspace that
all the vectors are close to. If the subspace fits the data well, then that means
the data are low rank: they don’t have as many degrees of freedom as there
are measurements for each data point. Then, with a basis for that subspace in
hand, we can talk about our data in terms of combinations of those basis vectors,
which is more efficient, especially if we can make do with a fairly small number
of vectors. It can make the data easier to understand by reducing the dimension
to where we can manage to think about it, or it can make computations more
efficient by allowing us to work with the (short) vectors of coefficients in place
of the (long) original vectors.
To be a little more specifc, let’s say we have a bunch of m-vectors called
a1 , . . . , an . Stack them, columnwise, into a matrix A, and then compute the
SVD of A.
A = U ΣV T
The first n columns of U are the principal components of the data vectors
a1 , . . . , an . The entries of Σ tell us the relative importance of the principal
components. The rows of V give the coefficients of the data vectors in the
principal components basis. Once we have this representation of the data, we
can look at the singular values to see how many of the principal components
seem to be important. If we use only the first k principal components, leaving
out the remaining n − k components, we’ve chosen a k-dimensional subspace,

12
and the error of approximating our data vectors by their projections into that
subspace is limited by the magnitude of the singular values belonging to the
components we left off. This subspace is optimal in the 2-norm sense, and in
the F-norm sense.
One adjustment to this process, which sacrifices a bit of optimality in ex-
change for more meaningful principal components, is to first subtract the mean
of the ai s from each ai . Of course, you need to keep the mean vector around and
add it back in whenever you are approximating vectors from the components.

Sources
• Our textbook: Heath, Scientific Computing: An introductory survey, 2e.
Section 3.6.
• Golub and Van Loan, Matrix Computations, Third edition. Johns Hopkins
Univ. Press, 1996. Chapter 5.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy