Linear-Algebra-Review Xid-8243921 1
Linear-Algebra-Review Xid-8243921 1
These notes give a review of basic concepts from linear algebra.1 Data are often represented
and manipulated as matrices, and linear algebra becomes the natural tool.
a) x + y 2 X
b) x + y = y + x
c) (x + y) + z = x + (y + z)
d) 9 0 2 X , such that x + 0 = x
e) 8x 2 X , 9 x 2 X such that x + ( x) = 0
multiplication: 8 x, y 2 X and a, b 2 R
a) a x 2 X
b) a(b x) = (ab) x
c) 1x = x, 0x = 0
d) a(x + y) = ax + ay
Example 1 Here are two examples of linear vector spaces. The familiar d-dimensional Euclidean
space Rd and the space of finite energy signals/functions supported on the interval [0, T ]
⇢ Z T
L2 ([0, T ]) := x : x2 (t) dt < 1
0
Definition 3 An inner product is a mapping from X ⇥ X to R. The inner product between any
x, y 2 X is denoted by hx, yi and it satisfies the following properties for all x, y, z 2 X :
a) hx, yi = hy, xi
1
c) hx + y, zi = hx, zi + hy, zi
d) hx, xi 0
Definition 4 An inner product space that contains all its limits is called a Hilbert Space and in
this case we often denote the space by H; i.e., if x1 , x2 , . . . are in H and limn!1 xn exists, then
the limit is also in H.
It is easy to verify that Rn , L2 ([0, T ]), and `2 (Z), the set of all finite energy sequences (e.g.,
discrete-time signals), are all Hilbert Spaces.
Definition 6 The set of all vectors that can be generated by taking linear combinations of {x1 , . . . , xk }
have the form
Xk
v = ✓i x i ,
i=1
2
Definition 7 A set of linearly independent vectors { i }i 1 is a basis for H if every x 2 H can be
represented as a unique linear combination of { i }. That is, every x 2 H can be expressed as
X
x = ✓i i
i 1
Any basis can be converted into an orthonormal basis using Gram-Schmidt Orthogonalization.
Let { i } be a basis for a vector space X . An orthobasis for X can be constructed as follows.
3 Orthogonal Projections
One of the most important tools that we will use from linear algebra is the notion of an orthogonal
projection. Let H be a Hilbert space and let M ⇢ H be a subspace. Every x 2 H can be written
as x = y + z, where y 2 M and z ? M, which is shorthand for z orthogonal to M; that is
8v 2 M, hv, zi = 0. The vector y is the optimal approximation to x in terms of vectors in M in
the following sense:
kx yk = min kx vk
v2M
3
Gram-Schmidt Orthogonalization
1. 1 := 1 /k 1 k
2. 0
2 := 2 h 1, 2i 1; 2 = 0 0
2 /k 2 k
..
.
Pk 1
k. 0
k = k i=1 h i , k i i; k =
0 0
k /k k k
..
.
and this projection can be viewed as a sort of filter that removes all components of the signal that
are orthogonal to M.
1 0
Example 7 Let H = R . Consider the canonical coordinate system 1 =
2
and 2 = .
0 1
Consider the subspace spanned by 1 . The projection of any x = [x1 x2 ]T 2 R2 onto this subspace
is ✓ ◆
1 1 1 1 x1
P1 x = hx, i = [x1 x2 ] =
0 0 0 0 0
The projection operator P1 is just a matrix and it is given by
T 1 1 0
P1 := 1 1 = [1 0] =
0 0 0
p p
1/p2 1/ p2
It is also easy to check that 1 = and 2 = is an orthobasis for R2 .
1/ 2 1/ 2
What is the projection operator onto the span of 1 in this case?
More generally suppose we are considering Rn and we have an orthobasis { i }ri=1 P for some r-
dimensional, r < n, subspace M of R . Then the projection matrix is given by PM = ri=1 i Ti .
n
4
Example 8 Let H = L2 ([0, 1]) and let M = {linear functions on [0, 1]}. Since all linear functions
have the form at + b, for t 2 [0, 1], here is a basis for M: 1 (t) = 1, 2 (t) = t. Note that this
means that M is two-dimensional. That makes sense since every line is defined by its slope and
intercept (two real numbers). Using the Gram-Schmidt procedure we can construct the orthobasis
1 (t) = 1, 2 (t) = t 1/2. Now, consider any function x 2 L2 ([0, 1]). The projection of x onto
M is
PM x = hx, 1i + hx, t 1/2i(t 1/2)
Z 1 Z 1
= x(⌧ ) d⌧ + (t 1/2) (⌧ 1/2)x(t) d⌧
0 0
4 Eigenanalysis
Suppose A is an m ⇥ n matrix with entries from a field (e.g., R or C, the latter being the complex
numbers). Then there exists a factorization of the form
A = U DV⇤
where U = [u1 · · · um ] is m⇥m with orthonormal columns, V = [v1 · · · vn ] is n⇥n with orthonor-
mal columns and the superscript ⇤ means transposition and conjugation (if complex-valued), and
the matrix D is m ⇥ n and has the form
2 3
1 0 0 ··· 0
6 0 0 ··· 0 7
6 2 7
D = 6 .. . . .. 7
4 . 0 . ··· . 5
0 0 ··· m 0 · ··
The values 1 , . . . , m are called the singular values of A and the factorization is called the singular
value decomposition (SVD). Because of the orthonormality of the columns of U and V we have
Avi = i ui and A⇤ ui = i vi , i = 1, . . . , m.
A vector u is called an eigenvector of A if A u = u for some scalar . The scalar is called
the eigenvalue associated with u. Symmetric matrices (which are always square) always have real
eigenvalues and have an eigendecomposition of the form A = U DU ⇤ , where the columns of U
the orthonormal eigenvectors of A, D is a diagonal matrix written D = diag( 1 , . . . , n ), and
diagonal entries are the eigenvalues. This is just a special case of the SVD. A symmetric positive-
semidefinite matrix satisfies the property v T Av 0 for all v. This implies that the eigenvalues of
symmetric positive-semidefinite matrices are non-negative.
Example 9 Let X be a random vector taking values in Rn and recall the definition of the covari-
ance matrix:
⌃ := E[(X µ)(X µ)T ]
It is easy to see that v T ⌃v 0, and of course ⌃ is symmetric. Therefore, every covariance matrix
has an eigendecomposition of the form ⌃ = U DU ⇤ , where D = diag( 1 , . . . , n ) and i 0 for
i = 1, . . . , n.
5
The Karhunen-Loève Transform (KLT), which is also called Principal Component Analysis
(PCA), is based on transforming a random vector X into the coordinate system associated with
the eigendecomposition of the covariance of X. Let X be an n-dimensional random vector with
covariance matrix ⌃ = U DU ⇤ . Let u1 , . . . , un be the eigenvectors. Assume that the eigenvectors
and eigenvalues are ordered such that 1 2 ··· n . The KLT or PCA representation of the
random vector X is given by
Xn
X = (uTi X)ui
i=1
Note that this approximation involves only r scalar random variables {(uTi X)}ri=1 rather than n.
In fact, it is easy to show that among all r-term linear approximations of X in terms of r random
variables, Xr has the smallest mean square error; that is if we let Sr denote all r-term linear
approximations to X, then