0% found this document useful (0 votes)
5K views30 pages

MML Book (061 090)

Math for Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5K views30 pages

MML Book (061 090)

Math for Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

2.

7 Linear Mappings 55

the k th column of T is the coordinate representation of c̃k with respect to


C . Note that both S and T are regular.
We are going to look at Φ(b̃j ) from two perspectives. First, applying the
mapping Φ, we get that for all j = 1, . . . , n
m m m m m
!
(2.107)
X X X X X
Φ(b̃j ) = ãkj c̃k = ãkj tlk cl = tlk ãkj cl , (2.108)
k=1
| {z } k=1 l=1 l=1 k=1
∈W

where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.106)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1

where we exploited the linearity of Φ. Comparing (2.108) and (2.109b),


it follows for all j = 1, . . . , n and l = 1, . . . , m that
m
X n
X
tlk ãkj = ali sij (2.110)
k=1 i=1

and, therefore,

T ÃΦ = AΦ S ∈ Rm×n , (2.111)

such that

ÃΦ = T −1 AΦ S , (2.112)

which proves Theorem 2.20.

Theorem 2.20 tells us that with a basis change in V (B is replaced with


B̃ ) and W (C is replaced with C̃ ), the transformation matrix AΦ of a
linear mapping Φ : V → W is replaced by an equivalent matrix ÃΦ with

ÃΦ = T −1 AΦ S. (2.113)

Figure 2.11 illustrates this relation: Consider a homomorphism Φ : V →


W and ordered bases B, B̃ of V and C, C̃ of W . The mapping ΦCB is an
instantiation of Φ and maps basis vectors of B onto linear combinations
of basis vectors of C . Assume that we know the transformation matrix AΦ
of ΦCB with respect to the ordered bases B, C . When we perform a basis
change from B to B̃ in V and from C to C̃ in W , we can determine the

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


56 Linear Algebra

Figure 2.11 For a Φ Φ


Vector spaces V W V W
homomorphism
Φ : V → W and ΦCB ΦCB
B C B C
ordered bases B, B̃ AΦ AΦ
of V and C, C̃ of W Ordered bases ΨB B̃ S T ΞC C̃ ΨB B̃ S T −1 ΞC̃C = Ξ−1
C C̃
(marked in blue), ÃΦ ÃΦ
we can express the B̃ C̃ B̃ C̃
ΦC̃ B̃ ΦC̃ B̃
mapping ΦC̃ B̃ with
respect to the bases
B̃, C̃ equivalently as
corresponding transformation matrix ÃΦ as follows: First, we find the ma-
a composition of the
homomorphisms trix representation of the linear mapping ΨB B̃ : V → V that maps coordi-
ΦC̃ B̃ = nates with respect to the new basis B̃ onto the (unique) coordinates with
ΞC̃C ◦ ΦCB ◦ ΨB B̃ respect to the “old” basis B (in V ). Then, we use the transformation ma-
with respect to the
trix AΦ of ΦCB : V → W to map these coordinates onto the coordinates
bases in the
subscripts. The with respect to C in W . Finally, we use a linear mapping ΞC̃C : W → W
corresponding to map the coordinates with respect to C onto coordinates with respect to
transformation C̃ . Therefore, we can express the linear mapping ΦC̃ B̃ as a composition of
matrices are in red. linear mappings that involve the “old” basis:
ΦC̃ B̃ = ΞC̃C ◦ ΦCB ◦ ΨB B̃ = Ξ−1
C C̃
◦ ΦCB ◦ ΨB B̃ . (2.114)
Concretely, we use ΨB B̃ = idV and ΞC C̃ = idW , i.e., the identity mappings
that map vectors onto themselves, but with respect to a different basis.

equivalent Definition 2.21 (Equivalence). Two matrices A, Ã ∈ Rm×n are equivalent


if there exist regular matrices S ∈ Rn×n and T ∈ Rm×m , such that
à = T −1 AS .
similar Definition 2.22 (Similarity). Two matrices A, Ã ∈ Rn×n are similar if
there exists a regular matrix S ∈ Rn×n with à = S −1 AS

Remark. Similar matrices are always equivalent. However, equivalent ma-


trices are not necessarily similar. ♢
Remark. Consider vector spaces V, W, X . From the remark that follows
Theorem 2.17, we already know that for linear mappings Φ : V → W
and Ψ : W → X the mapping Ψ ◦ Φ : V → X is also linear. With
transformation matrices AΦ and AΨ of the corresponding mappings, the
overall transformation matrix is AΨ◦Φ = AΨ AΦ . ♢
In light of this remark, we can look at basis changes from the perspec-
tive of composing linear mappings:

AΦ is the transformation matrix of a linear mapping ΦCB : V → W


with respect to the bases B, C .
ÃΦ is the transformation matrix of the linear mapping ΦC̃ B̃ : V → W
with respect to the bases B̃, C̃ .
S is the transformation matrix of a linear mapping ΨB B̃ : V → V
(automorphism) that represents B̃ in terms of B . Normally, Ψ = idV is
the identity mapping in V .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


2.7 Linear Mappings 57

T is the transformation matrix of a linear mapping ΞC C̃ : W → W


(automorphism) that represents C̃ in terms of C . Normally, Ξ = idW is
the identity mapping in W .
If we (informally) write down the transformations just in terms of bases,
then AΦ : B → C , ÃΦ : B̃ → C̃ , S : B̃ → B , T : C̃ → C and
T −1 : C → C̃ , and
B̃ → C̃ = B̃ → B→ C → C̃ (2.115)
−1
ÃΦ = T AΦ S . (2.116)
Note that the execution order in (2.116) is from right to left because vec-
tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ (Sx) 7→
T −1 AΦ (Sx) = ÃΦ x.


Example 2.24 (Basis Change)


Consider a linear mapping Φ : R3 → R4 whose transformation matrix is
 
1 2 0
−1 1 3
AΦ =  3 7 1
 (2.117)
−1 2 4
with respect to the standard bases
       
      1 0 0 0
1 0 0 0 1 0 0
B = ( 0 , 1 , 0) ,
     C = (
0 , 0 , 1 , 0).
       (2.118)
0 0 1
0 0 0 1
We seek the transformation matrix ÃΦ of Φ with respect to the new bases

       
      1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , 0 , 1 , 1 , 0) .
C̃ = (        (2.119)
0 1 1
0 0 0 1
Then,
 
  1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.120)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in
terms of the basis vectors of B . Since B is the standard basis, the co-
ordinate representation is straightforward to find. For a general basis B ,
we would need to solve a linear equation system to find the λi such that

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


58 Linear Algebra

P3
i=1 λi bi = b̃j , j = 1, . . . , 3. Similarly, the j th column of T is the coordi-
nate representation of c̃j in terms of the basis vectors of C .
Therefore, we obtain
1 −1 −1
  
1 3 2 1
1  1 −1 1 −1  0 4 2
ÃΦ = T −1 AΦ S = 
 
(2.121a)
2 −1 1 1 1 10 8 4
0 0 0 2 1 6 3
−4 −4 −2
 
 6 0 0
= 4
. (2.121b)
8 4
1 6 3

In Chapter 4, we will be able to exploit the concept of a basis change


to find a basis with respect to which the transformation matrix of an en-
domorphism has a particularly simple (diagonal) form. In Chapter 10, we
will look at a data compression problem and find a convenient basis onto
which we can project the data while minimizing the compression loss.

2.7.3 Image and Kernel


The image and kernel of a linear mapping are vector subspaces with cer-
tain important properties. In the following, we will characterize them
more carefully.
Definition 2.23 (Image and Kernel).
kernel For Φ : V → W , we define the kernel/null space
null space
ker(Φ) := Φ−1 (0W ) = {v ∈ V : Φ(v) = 0W } (2.122)
image and the image/range
range
Im(Φ) := Φ(V ) = {w ∈ W |∃v ∈ V : Φ(v) = w} . (2.123)
domain We also call V and W also the domain and codomain of Φ, respectively.
codomain
Intuitively, the kernel is the set of vectors v ∈ V that Φ maps onto the
neutral element 0W ∈ W . The image is the set of vectors w ∈ W that
can be “reached” by Φ from any vector in V . An illustration is given in
Figure 2.12.
Remark. Consider a linear mapping Φ : V → W , where V, W are vector
spaces.
It always holds that Φ(0V ) = 0W and, therefore, 0V ∈ ker(Φ). In
particular, the null space is never empty.
Im(Φ) ⊆ W is a subspace of W , and ker(Φ) ⊆ V is a subspace of V .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


2.7 Linear Mappings 59

Φ:V →W Figure 2.12 Kernel


V W and image of a
linear mapping
Φ : V → W.

ker(Φ) Im(Φ)

0V 0W

Φ is injective (one-to-one) if and only if ker(Φ) = {0}.



m×n
Remark (Null Space and Column Space). Let us consider A ∈ R and
a linear mapping Φ : Rn → Rm , x 7→ Ax.
For A = [a1 , . . . , an ], where ai are the columns of A, we obtain
( n )
X
n
Im(Φ) = {Ax : x ∈ R } = xi ai : x1 , . . . , xn ∈ R (2.124a)
i=1
m
= span[a1 , . . . , an ] ⊆ R , (2.124b)
i.e., the image is the span of the columns of A, also called the column column space
space. Therefore, the column space (image) is a subspace of Rm , where
m is the “height” of the matrix.
rk(A) = dim(Im(Φ)).
The kernel/null space ker(Φ) is the general solution to the homoge-
neous system of linear equations Ax = 0 and captures all possible
linear combinations of the elements in Rn that produce 0 ∈ Rm .
The kernel is a subspace of Rn , where n is the “width” of the matrix.
The kernel focuses on the relationship among the columns, and we can
use it to determine whether/how we can express a column as a linear
combination of other columns.

Example 2.25 (Image and Kernel of a Linear Mapping)


The mapping
   
x1   x1  
4 2
x2  1 2 −1 0  x2  x1 + 2x2 − x3
Φ : R → R ,   7→
    =
x3 1 0 0 1 x3  x1 + x4
x4 x4
(2.125a)

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


60 Linear Algebra

       
1 2 −1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im(Φ), we can take the span of the columns of the
transformation matrix and obtain
       
1 2 −1 0
Im(Φ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row-echelon form:
   
1 2 −1 0 1 0 0 1
⇝ ··· ⇝ . (2.127)
1 0 0 1 0 1 − 21 − 12
This matrix is in reduced row-echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot columns (columns 1 and 2). The third column a3 is
equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
Overall, this gives us the kernel (null space) as
−1
   
0
1  1 
ker(Φ) = span[  1  ,  0 ] .
2  2  (2.128)
0 1

rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) . (2.129)

fundamental The rank-nullity theorem is also referred to as the fundamental theorem


theorem of linear of linear mappings (Axler, 2015, theorem 3.22). The following are direct
mappings
consequences of Theorem 2.24:

If dim(Im(Φ)) < dim(V ), then ker(Φ) is non-trivial, i.e., the kernel


contains more than 0V and dim(ker(Φ)) ⩾ 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis
and dim(Im(Φ)) < dim(V ), then the system of linear equations AΦ x =
0 has infinitely many solutions.
If dim(V ) = dim(W ), then the three-way equivalence

Φ is injective ⇐⇒ Φ is surjective ⇐⇒ Φ is bijective


holds since Im(Φ) ⊆ W .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


2.8 Affine Spaces 61

2.8 Affine Spaces


In the following, we will take a closer look at spaces that are offset from
the origin, i.e., spaces that are no longer vector subspaces. Moreover, we
will briefly discuss properties of mappings between these affine spaces,
which resemble linear mappings.
Remark. In the machine learning literature, the distinction between linear
and affine is sometimes not clear so that we can find references to affine
spaces/mappings as linear spaces/mappings. ♢

2.8.1 Affine Subspaces


Definition 2.25 (Affine Subspace). Let V be a vector space, x0 ∈ V and
U ⊆ V a subspace. Then the subset
L = x0 + U := {x0 + u : u ∈ U } (2.130a)
= {v ∈ V |∃u ∈ U : v = x0 + u} ⊆ V (2.130b)
is called affine subspace or linear manifold of V . U is called direction or affine subspace
direction space, and x0 is called support point. In Chapter 12, we refer to linear manifold
such a subspace as a hyperplane. direction
direction space
Note that the definition of an affine subspace excludes 0 if x0 ∈ / U. support point
Therefore, an affine subspace is not a (linear) subspace (vector subspace) hyperplane
of V for x0 ∈
/ U.
Examples of affine subspaces are points, lines, and planes in R3 , which
do not (necessarily) go through the origin.
Remark. Consider two affine subspaces L = x0 + U and L̃ = x̃0 + Ũ of a
vector space V . Then, L ⊆ L̃ if and only if U ⊆ Ũ and x0 − x̃0 ∈ Ũ .
Affine subspaces are often described by parameters: Consider a k -dimen-
sional affine space L = x0 + U of V . If (b1 , . . . , bk ) is an ordered basis of
U , then every element x ∈ L can be uniquely described as
x = x0 + λ1 b1 + . . . + λk bk , (2.131)
where λ1 , . . . , λk ∈ R. This representation is called parametric equation parametric equation
of L with directional vectors b1 , . . . , bk and parameters λ1 , . . . , λk . ♢ parameters

Example 2.26 (Affine Subspaces)

One-dimensional affine subspaces are called lines and can be written line
as y = x0 + λb1 , where λ ∈ R and U = span[b1 ] ⊆ Rn is a one-
dimensional subspace of Rn . This means that a line is defined by a sup-
port point x0 and a vector b1 that defines the direction. See Figure 2.13
for an illustration.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


62 Linear Algebra

plane Two-dimensional affine subspaces of Rn are called planes. The para-


metric equation for planes is y = x0 + λ1 b1 + λ2 b2 , where λ1 , λ2 ∈ R
and U = span[b1 , b2 ] ⊆ Rn . This means that a plane is defined by a
support point x0 and two linearly independent vectors b1 , b2 that span
the direction space.
hyperplane In Rn , the (n − 1)-dimensional affine subspaces are called hyperplanes,
Pn−1
and the corresponding parametric equation is y = x0 + i=1 λi bi ,
where b1 , . . . , bn−1 form a basis of an (n − 1)-dimensional subspace
U of Rn . This means that a hyperplane is defined by a support point
x0 and (n − 1) linearly independent vectors b1 , . . . , bn−1 that span the
direction space. In R2 , a line is also a hyperplane. In R3 , a plane is also
a hyperplane.

Figure 2.13 Lines


+ λb 1
are affine subspaces.
L = x0
Vectors y on a line
x0 + λb1 lie in an y
affine subspace L
with support point x0
x0 and direction b1 . b1
0

Remark (Inhomogeneous systems of linear equations and affine subspaces).


For A ∈ Rm×n and x ∈ Rm , the solution of the system of linear equa-
tions Aλ = x is either the empty set or an affine subspace of Rn of
dimension n − rk(A). In particular, the solution of the linear equation
λ1 b1 + . . . + λn bn = x, where (λ1 , . . . , λn ) ̸= (0, . . . , 0), is a hyperplane
in Rn .
In Rn , every k -dimensional affine subspace is the solution of an inho-
mogeneous system of linear equations Ax = b, where A ∈ Rm×n , b ∈
Rm and rk(A) = n − k . Recall that for homogeneous equation systems
Ax = 0 the solution was a vector subspace, which we can also think of
as a special affine space with support point x0 = 0. ♢

2.8.2 Affine Mappings


Similar to linear mappings between vector spaces, which we discussed
in Section 2.7, we can define affine mappings between two affine spaces.
Linear and affine mappings are closely related. Therefore, many properties
that we already know from linear mappings, e.g., that the composition of
linear mappings is a linear mapping, also hold for affine mappings.

Definition 2.26 (Affine Mapping). For two vector spaces V, W , a linear

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


2.9 Further Reading 63

mapping Φ : V → W , and a ∈ W , the mapping


ϕ:V →W (2.132)
x 7→ a + Φ(x) (2.133)
is an affine mapping from V to W . The vector a is called the translation affine mapping
vector of ϕ. translation vector

Every affine mapping ϕ : V → W is also the composition of a linear


mapping Φ : V → W and a translation τ : W → W in W , such that
ϕ = τ ◦ Φ. The mappings Φ and τ are uniquely determined.
The composition ϕ′ ◦ ϕ of affine mappings ϕ : V → W , ϕ′ : W → X is
affine.
If ϕ is bijective, affine mappings keep the geometric structure invariant.
They then also preserve the dimension and parallelism.

2.9 Further Reading


There are many resources for learning linear algebra, including the text-
books by Strang (2003), Golan (2007), Axler (2015), and Liesen and
Mehrmann (2015). There are also several online resources that we men-
tioned in the introduction to this chapter. We only covered Gaussian elim-
ination here, but there are many other approaches for solving systems of
linear equations, and we refer to numerical linear algebra textbooks by
Stoer and Burlirsch (2002), Golub and Van Loan (2012), and Horn and
Johnson (2013) for an in-depth discussion.
In this book, we distinguish between the topics of linear algebra (e.g.,
vectors, matrices, linear independence, basis) and topics related to the
geometry of a vector space. In Chapter 3, we will introduce the inner
product, which induces a norm. These concepts allow us to define angles,
lengths and distances, which we will use for orthogonal projections. Pro-
jections turn out to be key in many machine learning algorithms, such as
linear regression and principal component analysis, both of which we will
cover in Chapters 9 and 10, respectively.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


64 Linear Algebra

Exercises
2.1 We consider (R\{−1}, ⋆), where

a ⋆ b := ab + a + b, a, b ∈ R\{−1} (2.134)

a. Show that (R\{−1}, ⋆) is an Abelian group.


b. Solve

3 ⋆ x ⋆ x = 15

in the Abelian group (R\{−1}, ⋆), where ⋆ is defined in (2.134).


2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k̄ of the
integer k as the set

k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .

We now define Z/nZ (sometimes written Zn ) as the set of all congruence


classes modulo n. Euclidean division implies that this set is a finite set con-
taining n elements:

Zn = {0, 1, . . . , n − 1}

For all a, b ∈ Zn , we define

a ⊕ b := a + b

a. Show that (Zn , ⊕) is a group. Is it Abelian?


b. We now define another operation ⊗ for all a and b in Zn as

a ⊗ b = a × b, (2.135)

where a × b represents the usual multiplication in Z.


Let n = 5. Draw the times table of the elements of Z5 \{0} under ⊗, i.e.,
calculate the products a ⊗ b for all a and b in Z5 \{0}.
Hence, show that Z5 \{0} is closed under ⊗ and possesses a neutral
element for ⊗. Display the inverse of all elements in Z5 \{0} under ⊗.
Conclude that (Z5 \{0}, ⊗) is an Abelian group.
c. Show that (Z8 \{0}, ⊗) is not a group.
d. We recall that the Bézout theorem states that two integers a and b are
relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers
u and v such that au + bv = 1. Show that (Zn \{0}, ⊗) is a group if and
only if n ∈ N\{0} is prime.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
  
 1 x z 
G = 0 1 y  ∈ R3×3 x, y, z ∈ R
0 0 1
 

We define · as the standard matrix multiplication.


Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
2.4 Compute the following matrix products, if possible:

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 65

a.
  
1 2 1 1 0
4 5  0 1 1
7 8 1 0 1

b.
  
1 2 3 1 1 0
4 5 6 0 1 1
7 8 9 1 0 1

c.
  
1 1 0 1 2 3
0 1 1 4 5 6
1 0 1 7 8 9

d.
 
  0 3
1 2 1 2 1 −1
4 1 −1 −4 2 1
5 2

e.
 
0 3
 
1
 −1 1 2 1 2
2 1 4 1 −1 −4
5 2

2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b, where A and b are defined as follows:
a.
   
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b= 
2 −1 1 3 4
5 2 −4 2 6

b.
   
1 −1 0 0 1 3
1 1 0 −3 0 6
A= , b= 
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1

2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
   
0 1 0 0 1 0 2
A = 0 0 0 1 1 0 , b = −1 .
0 1 0 0 0 1 1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


66 Linear Algebra
 
x1
2.7 Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x,
x3
where
 
6 4 3
A = 6 0 9
0 8 0

and 3i=1 xi = 1.
P
2.8 Determine the inverses of the following matrices if possible:
a.
 
2 3 4
A = 3 4 5
4 5 6
b.
 
1 0 1 0
0 1 1 0
A=
1

1 0 1
1 1 1 0

2.9 Which of the following sets are subspaces of R3 ?


a. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R}
b. B = {(λ2 , −λ2 , 0) | λ ∈ R}
c. Let γ be in R.
C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ}
d. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z}
2.10 Are the following sets of vectors linearly independent?
a.
     
2 1 3
x1 = −1 , x2 =  1  , x3 = −3
3 −2 8
b.
     
1 1 1
2 1 0 
     
1 ,
x1 =  
0 ,
x2 =   x3 = 
0 

0 1 1 
0 1 1
2.11 Write
 
1
y = −2
5
as linear combination of
     
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 67

2.12 Consider two subspaces of R4 :


           
1 2 −1 −1 2 −3
 1  −1  1  −2 −2  6 
−3 ,  0  , −1] ,
U1 = span[      U2 = span[
 2  ,  0  , −2] .
    

1 −1 1 1 0 −1

Determine a basis of U1 ∩ U2 .
2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
homogeneous equation system A1 x = 0 and U2 is the solution space of the
homogeneous equation system A2 x = 0 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2 .


b. Determine bases of U1 and U2 .
c. Determine a basis of U1 ∩ U2 .
2.14 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of
A1 and U2 is spanned by the columns of A2 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2


b. Determine bases of U1 and U2
c. Determine a basis of U1 ∩ U2
2.15 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}.
a. Show that F and G are subspaces of R3 .
b. Calculate F ∩ G without resorting to any basis vector.
c. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
2.16 Are the following mappings linear?
a. Let a, b ∈ R.
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a

where L1 ([a, b]) denotes the set of integrable functions on [a, b].
b.
Φ : C1 → C0
f 7→ Φ(f ) = f ′ ,

where for k ⩾ 1, C k denotes the set of k times continuously differen-


tiable functions, and C 0 denotes the set of continuous functions.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


68 Linear Algebra

c.
Φ:R→R
x 7→ Φ(x) = cos(x)

d.
Φ : R 3 → R2
 
1 2 3
x 7→ x
1 4 3

e. Let θ be in [0, 2π[ and


Φ : R2 → R2
 
cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)

2.17 Consider the linear mapping


Φ : R3 → R 4
 
  3x1 + 2x2 + x3
x1  x1 + x2 + x3 
Φ x2  = 
 x1 − 3x2 

x3
2x1 + 3x2 + x3

Find the transformation matrix AΦ .


Determine rk(AΦ ).
Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))?
2.18 Let E be a vector space. Let f and g be two automorphisms on E such that
f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) =
ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
 
1 1 0
AΦ = 1 −1 0 .
1 1 1

a. Determine ker(Φ) and Im(Φ).


b. Determine the transformation matrix ÃΦ with respect to the basis
     
1 1 1
B = (1 , 2 , 0) ,
1 1 0

i.e., perform a basis change toward the new basis B .


2.20 Let us consider b1 , b2 , b′1 , b′2 , 4 vectors of R2 expressed in the standard basis
of R2 as
       
2 −1 2 1
b1 = , b2 = , b′1 = , b′2 =
1 −1 −2 1

and let us define two ordered bases B = (b1 , b2 ) and B ′ = (b′1 , b′2 ) of R2 .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


Exercises 69

a. Show that B and B ′ are two bases of R2 and draw those basis vectors.
b. Compute the matrix P 1 that performs a basis change from B ′ to B .
c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis
of R3 as
     
1 0 1
c1 =  2  , c2 = −1 , c3 =  0 
−1 2 −1

and we define C = (c1 , c2 , c3 ).


(i) Show that C is a basis of R3 , e.g., by using determinants (see
Section 4.1).
(ii) Let us call C ′ = (c′1 , c′2 , c′3 ) the standard basis of R3 . Determine
the matrix P 2 that performs the basis change from C to C ′ .
d. We consider a homomorphism Φ : R2 −→ R3 , such that
Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3

where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 ,


respectively.
Determine the transformation matrix AΦ of Φ with respect to the or-
dered bases B and C .
e. Determine A′ , the transformation matrix of Φ with respect to the bases
B ′ and C ′ .
f. Let us consider the vector x ∈ R2 whose coordinates in B ′ are [2, 3]⊤ .
In other words, x = 2b′1 + 3b′2 .
(i) Calculate the coordinates of x in B .
(ii) Based on that, compute the coordinates of Φ(x) expressed in C .
(iii) Then, write Φ(x) in terms of c′1 , c′2 , c′3 .
(iv) Use the representation of x in B ′ and the matrix A′ to find this
result directly.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


3

Analytic Geometry

In Chapter 2, we studied vectors, vector spaces, and linear mappings at


a general but abstract level. In this chapter, we will add some geomet-
ric interpretation and intuition to all of these concepts. In particular, we
will look at geometric vectors and compute their lengths and distances
or angles between two vectors. To be able to do this, we equip the vec-
tor space with an inner product that induces the geometry of the vector
space. Inner products and their corresponding norms and metrics capture
the intuitive notions of similarity and distances, which we use to develop
the support vector machine in Chapter 12. We will then use the concepts
of lengths and angles between vectors to discuss orthogonal projections,
which will play a central role when we discuss principal component anal-
ysis in Chapter 10 and regression via maximum likelihood estimation in
Chapter 9. Figure 3.1 gives an overview of how concepts in this chapter
are related and how they are connected to other chapters of the book.

Figure 3.1 A mind


Inner product
map of the concepts
introduced in this
es
chapter, along with uc
when they are used ind
in other parts of the
Chapter 12
book. Norm Classification

Orthogonal
Lengths Angles Rotations
projection

Chapter 9 Chapter 4 Chapter 10


Regression Matrix Dimensionality
decomposition reduction

70
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
3.1 Norms 71

kx k1 = 1 kx k2 = 1 Figure 3.3 For


1 1
different norms, the
red lines indicate
the set of vectors
with norm 1. Left:
1 1 Manhattan norm;
Right: Euclidean
distance.

3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.

Definition 3.1 (Norm). A norm on a vector space V is a function norm

∥ · ∥ : V → R, (3.1)
x 7→ ∥x∥ , (3.2)

which assigns each vector x its length ∥x∥ ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: ∥λx∥ = |λ|∥x∥ homogeneous

Triangle inequality: ∥x + y∥ ⩽ ∥x∥ + ∥y∥ triangle inequality

Positive definite: ∥x∥ ⩾ 0 and ∥x∥ = 0 ⇐⇒ x = 0 positive definite

Figure 3.2 Triangle


In geometric terms, the triangle inequality states that for any triangle, inequality.
the sum of the lengths of any two sides must be greater than or equal
a b
to the length of the remaining side; see Figure 3.2 for an illustration.
Definition 3.1 is in terms of a general vector space V (Section 2.4), but c≤a+b

in this book we will only consider a finite-dimensional vector space Rn .


Recall that for a vector x ∈ Rn we denote the elements of the vector using
a subscript, that is, xi is the ith element of the vector x.

Example 3.1 (Manhattan Norm)


The Manhattan norm on Rn is defined for x ∈ Rn as Manhattan norm

Xn
∥x∥1 := |xi | , (3.3)
i=1

where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x ∈ R2 with ∥x∥1 = 1. The Manhattan norm is also called ℓ1 ℓ1 norm
norm.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


72 Analytic Geometry

Example 3.2 (Euclidean Norm)


Euclidean norm The Euclidean norm of x ∈ Rn is defined as
v

u n
uX
∥x∥2 := t x2i = x⊤ x (3.4)
i=1

Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.3 shows all vectors x ∈ R2 with ∥x∥2 = 1. The Euclidean
ℓ2 norm norm is also called ℓ2 norm.

Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♢

3.2 Inner Products


Inner products allow for the introduction of intuitive geometrical con-
cepts, such as the length of a vector and the angle or distance between
two vectors. A major purpose of inner products is to determine whether
vectors are orthogonal to each other.

3.2.1 Dot Product


We may already be familiar with a particular type of inner product, the
scalar product scalar product/dot product in Rn , which is given by
dot product n
X
x⊤ y = xi yi . (3.5)
i=1

We will refer to this particular inner product as the dot product in this
book. However, inner products are more general concepts with specific
properties, which we will now introduce.

3.2.2 General Inner Products


Recall the linear mapping from Section 2.7, where we can rearrange the
bilinear mapping mapping with respect to addition and multiplication with a scalar. A bi-
linear mapping Ω is a mapping with two arguments, and it is linear in
each argument, i.e., when we look at a vector space V then it holds that
for all x, y, z ∈ V, λ, ψ ∈ R that
Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z) (3.6)
Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z) . (3.7)
Here, (3.6) asserts that Ω is linear in the first argument, and (3.7) asserts
that Ω is linear in the second argument (see also (2.87)).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.2 Inner Products 73

Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear


mapping that takes two vectors and maps them onto a real number. Then
Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
order of the arguments does not matter.
Ω is called positive definite if positive definite

∀x ∈ V \{0} : Ω(x, x) > 0 , Ω(0, 0) = 0 . (3.8)


Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear
mapping that takes two vectors and maps them onto a real number. Then
A positive definite, symmetric bilinear mapping Ω : V ×V → R is called
an inner product on V . We typically write ⟨x, y⟩ instead of Ω(x, y). inner product
The pair (V, ⟨·, ·⟩) is called an inner product space or (real) vector space inner product space
with inner product. If we use the dot product defined in (3.5), we call vector space with
(V, ⟨·, ·⟩) a Euclidean vector space. inner product
Euclidean vector
We will refer to these spaces as inner product spaces in this book. space

Example 3.3 (Inner Product That Is Not the Dot Product)


Consider V = R2 . If we define
⟨x, y⟩ := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 (3.9)
then ⟨·, ·⟩ is an inner product but different from the dot product. The proof
will be an exercise.

3.2.3 Symmetric, Positive Definite Matrices


Symmetric, positive definite matrices play an important role in machine
learning, and they are defined via the inner product. In Section 4.3, we
will return to symmetric, positive definite matrices in the context of matrix
decompositions. The idea of symmetric positive semidefinite matrices is
key in the definition of kernels (Section 12.4).
Consider an n-dimensional vector space V with an inner product ⟨·, ·⟩ :
V × V → R (see Definition 3.3) and an ordered basis B = (b1 , . . . , bn ) of
V . Recall from Section 2.6.1 that any vectors x, y ∈ V Pncan be written as
linearPcombinations of the basis vectors so that x = i=1 ψi bi ∈ V and
n
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product, it holds for all x, y ∈ V that
* n n
+ n X n
ψi ⟨bi , bj ⟩ λj = x̂⊤ Aŷ , (3.10)
X X X
⟨x, y⟩ = ψi bi , λj bj =
i=1 j=1 i=1 j=1

where Aij := ⟨bi , bj ⟩ and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product ⟨·, ·⟩ is uniquely deter-
mined through A. The symmetry of the inner product also means that A

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


74 Analytic Geometry

is symmetric. Furthermore, the positive definiteness of the inner product


implies that

∀x ∈ V \{0} : x⊤ Ax > 0 . (3.11)

Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix


symmetric, positive A ∈ Rn×n that satisfies (3.11) is called symmetric, positive definite, or
definite just positive definite. If only ⩾ holds in (3.11), then A is called symmetric,
positive definite
positive semidefinite.
symmetric, positive
semidefinite

Example 3.4 (Symmetric, Positive Definite Matrices)


Consider the matrices
   
9 6 9 6
A1 = , A2 = . (3.12)
6 5 6 3
A1 is positive definite because it is symmetric and
  

  9 6 x1
x A1 x = x1 x2 (3.13a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.13b)
for all x ∈ V \{0}. In contrast, A2 is symmetric but not positive definite
because x⊤ A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be less
than 0, e.g., for x = [2, −3]⊤ .

If A ∈ Rn×n is symmetric, positive definite, then

⟨x, y⟩ = x̂⊤ Aŷ (3.14)

defines an inner product with respect to an ordered basis B , where x̂ and


ŷ are the coordinate representations of x, y ∈ V with respect to B .

Theorem 3.5. For a real-valued, finite-dimensional vector space V and an


ordered basis B of V , it holds that ⟨·, ·⟩ : V × V → R is an inner product if
and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with

⟨x, y⟩ = x̂⊤ Aŷ . (3.15)

The following properties hold if A ∈ Rn×n is symmetric and positive


definite:

The null space (kernel) of A consists only of 0 because x⊤ Ax > 0 for


all x ̸= 0. This implies that Ax ̸= 0 if x ̸= 0.
The diagonal elements aii of A are positive because aii = e⊤
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.3 Lengths and Distances 75

3.3 Lengths and Distances


In Section 3.1, we already discussed norms that we can use to compute
the length of a vector. Inner products and norms are closely related in the
sense that any inner product induces a norm Inner products
q induce norms.
∥x∥ := ⟨x, x⟩ (3.16)

in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, ⟨·, ·⟩) the induced norm ∥ · ∥ satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| ⟨x, y⟩ | ⩽ ∥x∥∥y∥ . (3.17)

Example 3.5 (Lengths of Vectors Using Inner Products)


In geometry, we are often interested in lengths of vectors. We can now use
an inner product to compute them using (3.16). Let us take x = [1, 1]⊤ ∈
R2 . If we use the dot product as the inner product, with (3.16) we obtain
√ √ √
∥x∥ = x⊤ x = 12 + 12 = 2 (3.18)
as the length of x. Let us now choose a different inner product:
1 − 12 1
 

⟨x, y⟩ := x 1 y = x1 y1 − (x1 y2 + x2 y1 ) + x2 y2 . (3.19)
−2 1 2
If we compute the norm of a vector, then this inner product returns smaller
values than the dot product if x1 and x2 have the same sign (and x1 x2 >
0); otherwise, it returns greater values than the dot product. With this
inner product, we obtain

⟨x, x⟩ = x21 − x1 x2 + x22 = 1 − 1 + 1 = 1 =⇒ ∥x∥ = 1 = 1 , (3.20)
such that x is “shorter” with this inner product than with the dot product.

Definition 3.6 (Distance and Metric). Consider an inner product space


(V, ⟨·, ·⟩). Then
q
d(x, y) := ∥x − y∥ = ⟨x − y, x − y⟩ (3.21)

is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


76 Analytic Geometry

The mapping

d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)

metric is called a metric.

Remark. Similar to the length of a vector, the distance between vectors


does not require an inner product: a norm is sufficient. If we have a norm
induced by an inner product, the distance may vary depending on the
choice of the inner product. ♢
A metric d satisfies the following:

positive definite 1. d is positive definite, i.e., d(x, y) ⩾ 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y .
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) ⩽ d(x, y) + d(y, z) for all x, y, z ∈ V .

Remark. At first glance, the lists of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that ⟨x, y⟩ and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♢

3.4 Angles and Orthogonality


Figure 3.4 When
restricted to [0, π] In addition to enabling the definition of lengths of vectors, as well as the
then f (ω) = cos(ω) distance between two vectors, inner products also capture the geometry
returns a unique
number in the
of a vector space by defining the angle ω between two vectors. We use
interval [−1, 1]. the Cauchy-Schwarz inequality (3.17) to define angles ω in inner prod-
uct spaces between two vectors x, y , and this notion coincides with our
1
intuition in R2 and R3 . Assume that x ̸= 0, y ̸= 0. Then
cos(ω)

0
⟨x, y⟩
−1 ⩽ ⩽ 1. (3.24)
∥x∥ ∥y∥
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π], illustrated in Figure 3.4, with

⟨x, y⟩
cos ω = . (3.25)
∥x∥ ∥y∥
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.4 Angles and Orthogonality 77

Example 3.6 (Angle between Vectors)


Let us compute the angle between x = [1, 1]⊤ ∈ R2 and y = [1, 2]⊤ ∈ R2 ; Figure 3.5 The
see Figure 3.5, where we use the dot product as the inner product. Then angle ω between
two vectors x, y is
we get
computed using the
⟨x, y⟩ x⊤ y 3 inner product.
cos ω = p =p =√ , (3.26)
⟨x, x⟩ ⟨y, y⟩ x⊤ xy ⊤ y 10
y
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
1
ω x
A key feature of the inner product is that it also allows us to characterize
vectors that are orthogonal.
0 1
Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
only if ⟨x, y⟩ = 0, and we write x ⊥ y . If additionally ∥x∥ = 1 = ∥y∥,
i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal

An implication of this definition is that the 0-vector is orthogonal to


every vector in the vector space.
Remark. Orthogonality is the generalization of the concept of perpendic-
ularity to bilinear forms that do not have to be the dot product. In our
context, geometrically, we can think of orthogonal vectors as having a
right angle with respect to a specific inner product. ♢

Example 3.7 (Orthogonal Vectors)

Figure 3.6 The


1 angle ω between
y x two vectors x, y can
change depending
ω on the inner
product.
−1 0 1

Consider two vectors x = [1, 1]⊤ , y = [−1, 1]⊤ ∈ R2 ; see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as the inner product yields
an angle ω between x and y of 90◦ , such that x ⊥ y . However, if we
choose the inner product
 
⊤ 2 0
⟨x, y⟩ = x y, (3.27)
0 1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


78 Analytic Geometry

we get that the angle ω between x and y is given by


⟨x, y⟩ 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
∥x∥∥y∥ 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.

Definition 3.8 (Orthogonal Matrix). A square matrix A ∈ Rn×n is an


orthogonal matrix orthogonal matrix if and only if its columns are orthonormal so that
AA⊤ = I = A⊤ A , (3.29)
which implies that
A−1 = A⊤ , (3.30)
It is convention to i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Transformations by orthogonal matrices are special because the length
more precise of a vector x is not changed when transforming it using an orthogonal
description would matrix A. For the dot product, we obtain
be “orthonormal”.
2 2
Transformations ∥Ax∥ = (Ax)⊤ (Ax) = x⊤ A⊤ Ax = x⊤ Ix = x⊤ x = ∥x∥ . (3.31)
with orthogonal
matrices preserve Moreover, the angle between any two vectors x, y , as measured by their
distances and inner product, is also unchanged when transforming both of them using
angles.
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as
(Ax)⊤ (Ay) x⊤ A⊤ Ay x⊤ y
cos ω = =q = , (3.32)
∥Ax∥ ∥Ay∥ x⊤ A⊤ Axy ⊤ A⊤ Ay ∥x∥ ∥y∥

which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A⊤ = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.

3.5 Orthonormal Basis


In Section 2.6.1, we characterized properties of basis vectors and found
that in an n-dimensional vector space, we need n basis vectors, i.e., n
vectors that are linearly independent. In Sections 3.3 and 3.4, we used
inner products to compute the length of vectors and the angle between
vectors. In the following, we will discuss the special case where the basis
vectors are orthogonal to each other and where the length of each basis
vector is 1. We will call this basis then an orthonormal basis.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.6 Orthogonal Complement 79

Let us introduce this more formally.

Definition 3.9 (Orthonormal Basis). Consider an n-dimensional vector


space V and a basis {b1 , . . . , bn } of V . If

⟨bi , bj ⟩ = 0 for i ̸= j (3.33)


⟨bi , bi ⟩ = 1 (3.34)

for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.

Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-

ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).

Example 3.8 (Orthonormal Basis)


The canonical/standard basis for a Euclidean vector space Rn is an or-
thonormal basis, where the inner product is the dot product of vectors.
In R2 , the vectors
1 1 1
   
1
b1 = √ , b2 = √ (3.35)
2 1 2 −1
form an orthonormal basis since b⊤
1 b2 = 0 and ∥b1 ∥ = 1 = ∥b2 ∥.

We will exploit the concept of an orthonormal basis in Chapter 12 and


Chapter 10 when we discuss support vector machines and principal com-
ponent analysis.

3.6 Orthogonal Complement


Having defined orthogonality, we will now look at vector spaces that are
orthogonal to each other. This will play an important role in Chapter 10,
when we discuss linear dimensionality reduction from a geometric per-
spective.
Consider a D-dimensional vector space V and an M -dimensional sub-
space U ⊆ V . Then its orthogonal complement U ⊥ is a (D−M )-dimensional orthogonal
subspace of V and contains all vectors in V that are orthogonal to every complement
vector in U . Furthermore, U ∩ U ⊥ = {0} so that any vector x ∈ V can be

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


80 Analytic Geometry

Figure 3.7 A plane


U in a e3
three-dimensional
vector space can be w
described by its
normal vector,
which spans its e2
orthogonal
complement U ⊥ .

e1
U

uniquely decomposed into


M
X D−M
X
x= λm bm + ψj b⊥
j , λm , ψj ∈ R , (3.36)
m=1 j=1

where (b1 , . . . , bM ) is a basis of U and (b⊥ ⊥ ⊥


1 , . . . , bD−M ) is a basis of U .
Therefore, the orthogonal complement can also be used to describe a
plane U (two-dimensional subspace) in a three-dimensional vector space.
More specifically, the vector w with ∥w∥ = 1, which is orthogonal to the
plane U , is the basis vector of U ⊥ . Figure 3.7 illustrates this setting. All
vectors that are orthogonal to w must (by construction) lie in the plane
normal vector U . The vector w is called the normal vector of U .
Generally, orthogonal complements can be used to describe hyperplanes
in n-dimensional vector and affine spaces.

3.7 Inner Product of Functions


Thus far, we looked at properties of inner products to compute lengths,
angles and distances. We focused on inner products of finite-dimensional
vectors. In the following, we will look at an example of inner products of
a different type of vectors: inner products of functions.
The inner products we discussed so far were defined for vectors with a
finite number of entries. We can think of a vector x ∈ Rn as a function
with n function values. The concept of an inner product can be generalized
to vectors with an infinite number of entries (countably infinite) and also
continuous-valued functions (uncountably infinite). Then the sum over
individual components of vectors (see Equation (3.5) for example) turns
into an integral.
An inner product of two functions u : R → R and v : R → R can be
defined as the definite integral
Z b
⟨u, v⟩ := u(x)v(x)dx (3.37)
a

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 81

for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.

Example 3.9 (Inner Product of Functions)


If we choose u = sin(x) and v = cos(x), the integrand f (x) = u(x)v(x) Figure 3.8 f (x) =
of (3.37), is shown in Figure 3.8. We see that this function is odd, i.e., sin(x) cos(x).
f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this 0.5

sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0

Remark. It also holds that the collection of functions −0.5


−2.5 0.0 2.5
x
{1, cos(x), cos(2x), cos(3x), . . . } (3.38)
is orthogonal if we integrate from −π to π , i.e., any pair of functions are
orthogonal to each other. The collection of functions in (3.38) spans a
large subspace of the functions that are even and periodic on [−π, π), and
projecting functions onto this subspace is the fundamental idea behind
Fourier series. ♢
In Section 6.4.6, we will have a look at a second type of unconventional
inner products: the inner product of random variables.

3.8 Orthogonal Projections


Projections are an important class of linear transformations (besides rota-
tions and reflections) and play an important role in graphics, coding the-
ory, statistics and machine learning. In machine learning, we often deal
with data that is high-dimensional. High-dimensional data is often hard
to analyze or visualize. However, high-dimensional data quite often pos-
sesses the property that only a few dimensions contain most information,
and most other dimensions are not essential to describe key properties
of the data. When we compress or visualize high-dimensional data, we
will lose information. To minimize this compression loss, we ideally find
the most informative dimensions in the data. As discussed in Chapter 1, “Feature” is a
data can be represented as vectors, and in this chapter, we will discuss common expression
for data
some of the fundamental tools for data compression. More specifically, we
representation.
can project the original high-dimensional data onto a lower-dimensional
feature space and work in this lower-dimensional space to learn more
about the dataset and extract relevant patterns. For example, machine

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


82 Analytic Geometry

Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)

x2
0
onto a
one-dimensional −1
subspace (straight
line). −2

−4 −2 0 2 4
x1

learning algorithms, such as principal component analysis (PCA) by Pear-


son (1901) and Hotelling (1933) and deep neural networks (e.g., deep
auto-encoders (Deng et al., 2010)), heavily exploit the idea of dimension-
ality reduction. In the following, we will focus on orthogonal projections,
which we will use in Chapter 10 for linear dimensionality reduction and
in Chapter 12 for classification. Even linear regression, which we discuss
in Chapter 9, can be interpreted using orthogonal projections. For a given
lower-dimensional subspace, orthogonal projections of high-dimensional
data retain as much information as possible and minimize the difference/
error between the original data and the corresponding projection. An il-
lustration of such an orthogonal projection is given in Figure 3.9. Before
we detail how to obtain these projections, let us define what a projection
actually is.

Definition 3.10 (Projection). Let V be a vector space and U ⊆ V a


projection subspace of V . A linear mapping π : V → U is called a projection if
π2 = π ◦ π = π.
Since linear mappings can be expressed by transformation matrices (see
Section 2.7), the preceding definition applies equally to a special kind
projection matrix of transformation matrices, the projection matrices P π , which exhibit the
property that P 2π = P π .
In the following, we will derive orthogonal projections of vectors in the
inner product space (Rn , ⟨·, ·⟩) onto subspaces. We will start with one-
line dimensional subspaces, which are also called lines. If not mentioned oth-
erwise, we assume the dot product ⟨x, y⟩ = x⊤ y as the inner product.

3.8.1 Projection onto One-Dimensional Subspaces (Lines)


Assume we are given a line (one-dimensional subspace) through the ori-
gin with basis vector b ∈ Rn . The line is a one-dimensional subspace
U ⊆ Rn spanned by b. When we project x ∈ Rn onto U , we seek the
vector πU (x) ∈ U that is closest to x. Using geometric arguments, let

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.


3.8 Orthogonal Projections 83
Figure 3.10
x Examples of
projections onto
one-dimensional
subspaces.

b x

πU (x)

ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with ∥x∥ = 1 onto a one-dimensional
subspace spanned by b.

us characterize some properties of the projection πU (x) (Figure 3.10(a)


serves as an illustration):

The projection πU (x) is closest to x, where “closest” implies that the


distance ∥x − πU (x)∥ is minimal. It follows that the segment πU (x) − x
from πU (x) to x is orthogonal to U , and therefore the basis vector b of
U . The orthogonality condition yields ⟨πU (x) − x, b⟩ = 0 since angles
between vectors are defined via the inner product. λ is then the
The projection πU (x) of x onto U must be an element of U and, there- coordinate of πU (x)
with respect to b.
fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
for some λ ∈ R.

In the following three steps, we determine the coordinate λ, the projection


πU (x) ∈ U , and the projection matrix P π that maps any x ∈ Rn onto U :

1. Finding the coordinate λ. The orthogonality condition yields


πU (x)=λb
⟨x − πU (x), b⟩ = 0 ⇐⇒ ⟨x − λb, b⟩ = 0 . (3.39)

We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
⟨x, b⟩ ⟨b, x⟩ λ = ⟨x, b⟩ if
⟨x, b⟩ − λ ⟨b, b⟩ = 0 ⇐⇒ λ = = . (3.40) ∥b∥ = 1.
⟨b, b⟩ ∥b∥2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose ⟨·, ·⟩ to be the dot product, we obtain

b⊤ x b⊤ x
λ= = . (3.41)
b⊤ b ∥b∥2

If ∥b∥ = 1, then the coordinate λ of the projection is given by b⊤ x.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


84 Analytic Geometry

2. Finding the projection point πU (x) ∈ U . Since πU (x) = λb, we imme-


diately obtain with (3.40) that

⟨x, b⟩ b⊤ x
πU (x) = λb = b = b, (3.42)
∥b∥2 ∥b∥2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as

∥πU (x)∥ = ∥λb∥ = |λ| ∥b∥ . (3.43)

Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product, we get

(3.42) |b⊤ x| (3.25) ∥b∥


∥πU (x)∥ = ∥b∥ = | cos ω| ∥x∥ ∥b∥ = | cos ω| ∥x∥ .
∥b∥ 2 ∥b∥2
(3.44)

Here, ω is the angle between x and b. This equation should be familiar


from trigonometry: If ∥x∥ = 1, then x lies on the unit circle. It follows
The horizontal axis that the projection onto the horizontal axis spanned by b is exactly
is a one-dimensional cos ω , and the length of the corresponding vector πU (x) = |cos ω|. An
subspace.
illustration is given in Figure 3.10(b).
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and

b⊤ x bb⊤
πU (x) = λb = bλ = b = x, (3.45)
∥b∥2 ∥b∥2
we immediately see that

bb⊤
Pπ = . (3.46)
∥b∥2

Projection matrices Note that bb⊤ (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and ∥b∥2 = ⟨b, b⟩ is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♢

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy