0% found this document useful (0 votes)

5K views30 pages

MML Book (061 090)

Math for Machine Learning

Uploaded by

Erdoğan Altıparmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5K views30 pages

MML Book (061 090)

Math for Machine Learning

Uploaded by

Erdoğan Altıparmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

2.

7 Linear Mappings 55

the k th column of T is the coordinate representation of c̃k with respect to

C . Note that both S and T are regular.
We are going to look at Φ(b̃j ) from two perspectives. First, applying the
mapping Φ, we get that for all j = 1, . . . , n
m m m m m
!
(2.107)
X X X X X
Φ(b̃j ) = ãkj c̃k = ãkj tlk cl = tlk ãkj cl , (2.108)
k=1
| {z } k=1 l=1 l=1 k=1
∈W

where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.106)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1

where we exploited the linearity of Φ. Comparing (2.108) and (2.109b),

it follows for all j = 1, . . . , n and l = 1, . . . , m that
m
X n
X
tlk ãkj = ali sij (2.110)
k=1 i=1

and, therefore,

T ÃΦ = AΦ S ∈ Rm×n , (2.111)

such that

ÃΦ = T −1 AΦ S , (2.112)

which proves Theorem 2.20.

Theorem 2.20 tells us that with a basis change in V (B is replaced with

B̃ ) and W (C is replaced with C̃ ), the transformation matrix AΦ of a
linear mapping Φ : V → W is replaced by an equivalent matrix ÃΦ with

ÃΦ = T −1 AΦ S. (2.113)

Figure 2.11 illustrates this relation: Consider a homomorphism Φ : V →

W and ordered bases B, B̃ of V and C, C̃ of W . The mapping ΦCB is an
instantiation of Φ and maps basis vectors of B onto linear combinations
of basis vectors of C . Assume that we know the transformation matrix AΦ
of ΦCB with respect to the ordered bases B, C . When we perform a basis
change from B to B̃ in V and from C to C̃ in W , we can determine the

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

56 Linear Algebra

Figure 2.11 For a Φ Φ

Vector spaces V W V W
homomorphism
Φ : V → W and ΦCB ΦCB
B C B C
ordered bases B, B̃ AΦ AΦ
of V and C, C̃ of W Ordered bases ΨB B̃ S T ΞC C̃ ΨB B̃ S T −1 ΞC̃C = Ξ−1
C C̃
(marked in blue), ÃΦ ÃΦ
we can express the B̃ C̃ B̃ C̃
ΦC̃ B̃ ΦC̃ B̃
mapping ΦC̃ B̃ with
respect to the bases
B̃, C̃ equivalently as
corresponding transformation matrix ÃΦ as follows: First, we find the ma-
a composition of the
homomorphisms trix representation of the linear mapping ΨB B̃ : V → V that maps coordi-
ΦC̃ B̃ = nates with respect to the new basis B̃ onto the (unique) coordinates with
ΞC̃C ◦ ΦCB ◦ ΨB B̃ respect to the “old” basis B (in V ). Then, we use the transformation ma-
with respect to the
trix AΦ of ΦCB : V → W to map these coordinates onto the coordinates
bases in the
subscripts. The with respect to C in W . Finally, we use a linear mapping ΞC̃C : W → W
corresponding to map the coordinates with respect to C onto coordinates with respect to
transformation C̃ . Therefore, we can express the linear mapping ΦC̃ B̃ as a composition of
matrices are in red. linear mappings that involve the “old” basis:
ΦC̃ B̃ = ΞC̃C ◦ ΦCB ◦ ΨB B̃ = Ξ−1
C C̃
◦ ΦCB ◦ ΨB B̃ . (2.114)
Concretely, we use ΨB B̃ = idV and ΞC C̃ = idW , i.e., the identity mappings
that map vectors onto themselves, but with respect to a different basis.

equivalent Definition 2.21 (Equivalence). Two matrices A, Ã ∈ Rm×n are equivalent

if there exist regular matrices S ∈ Rn×n and T ∈ Rm×m , such that
Ã = T −1 AS .
similar Definition 2.22 (Similarity). Two matrices A, Ã ∈ Rn×n are similar if
there exists a regular matrix S ∈ Rn×n with Ã = S −1 AS

Remark. Similar matrices are always equivalent. However, equivalent ma-

trices are not necessarily similar. ♢
Remark. Consider vector spaces V, W, X . From the remark that follows
Theorem 2.17, we already know that for linear mappings Φ : V → W
and Ψ : W → X the mapping Ψ ◦ Φ : V → X is also linear. With
transformation matrices AΦ and AΨ of the corresponding mappings, the
overall transformation matrix is AΨ◦Φ = AΨ AΦ . ♢
In light of this remark, we can look at basis changes from the perspec-
tive of composing linear mappings:

AΦ is the transformation matrix of a linear mapping ΦCB : V → W

with respect to the bases B, C .
ÃΦ is the transformation matrix of the linear mapping ΦC̃ B̃ : V → W
with respect to the bases B̃, C̃ .
S is the transformation matrix of a linear mapping ΨB B̃ : V → V
(automorphism) that represents B̃ in terms of B . Normally, Ψ = idV is
the identity mapping in V .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

2.7 Linear Mappings 57

T is the transformation matrix of a linear mapping ΞC C̃ : W → W

(automorphism) that represents C̃ in terms of C . Normally, Ξ = idW is
the identity mapping in W .
If we (informally) write down the transformations just in terms of bases,
then AΦ : B → C , ÃΦ : B̃ → C̃ , S : B̃ → B , T : C̃ → C and
T −1 : C → C̃ , and
B̃ → C̃ = B̃ → B→ C → C̃ (2.115)
−1
ÃΦ = T AΦ S . (2.116)
Note that the execution order in (2.116) is from right to left because vec-
tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ (Sx) 7→
T −1 AΦ (Sx) = ÃΦ x.

Example 2.24 (Basis Change)

Consider a linear mapping Φ : R3 → R4 whose transformation matrix is
 
1 2 0
−1 1 3
AΦ =  3 7 1
 (2.117)
−1 2 4
with respect to the standard bases
       
      1 0 0 0
1 0 0 0 1 0 0
B = ( 0 , 1 , 0) ,
     C = (
0 , 0 , 1 , 0).
       (2.118)
0 0 1
0 0 0 1
We seek the transformation matrix ÃΦ of Φ with respect to the new bases

       
      1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , 0 , 1 , 1 , 0) .
C̃ = (        (2.119)
0 1 1
0 0 0 1
Then,
 
  1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.120)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in
terms of the basis vectors of B . Since B is the standard basis, the co-
ordinate representation is straightforward to find. For a general basis B ,
we would need to solve a linear equation system to find the λi such that

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

58 Linear Algebra

P3
i=1 λi bi = b̃j , j = 1, . . . , 3. Similarly, the j th column of T is the coordi-
nate representation of c̃j in terms of the basis vectors of C .
Therefore, we obtain
1 −1 −1
  
1 3 2 1
1  1 −1 1 −1  0 4 2
ÃΦ = T −1 AΦ S = 
 
(2.121a)
2 −1 1 1 1 10 8 4
0 0 0 2 1 6 3
−4 −4 −2
 
 6 0 0
= 4
. (2.121b)
8 4
1 6 3

In Chapter 4, we will be able to exploit the concept of a basis change

to find a basis with respect to which the transformation matrix of an en-
domorphism has a particularly simple (diagonal) form. In Chapter 10, we
will look at a data compression problem and find a convenient basis onto
which we can project the data while minimizing the compression loss.

2.7.3 Image and Kernel

The image and kernel of a linear mapping are vector subspaces with cer-
tain important properties. In the following, we will characterize them
more carefully.
Definition 2.23 (Image and Kernel).
kernel For Φ : V → W , we define the kernel/null space
null space
ker(Φ) := Φ−1 (0W ) = {v ∈ V : Φ(v) = 0W } (2.122)
image and the image/range
range
Im(Φ) := Φ(V ) = {w ∈ W |∃v ∈ V : Φ(v) = w} . (2.123)
domain We also call V and W also the domain and codomain of Φ, respectively.
codomain
Intuitively, the kernel is the set of vectors v ∈ V that Φ maps onto the
neutral element 0W ∈ W . The image is the set of vectors w ∈ W that
can be “reached” by Φ from any vector in V . An illustration is given in
Figure 2.12.
Remark. Consider a linear mapping Φ : V → W , where V, W are vector
spaces.
It always holds that Φ(0V ) = 0W and, therefore, 0V ∈ ker(Φ). In
particular, the null space is never empty.
Im(Φ) ⊆ W is a subspace of W , and ker(Φ) ⊆ V is a subspace of V .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

2.7 Linear Mappings 59

Φ:V →W Figure 2.12 Kernel

V W and image of a
linear mapping
Φ : V → W.

ker(Φ) Im(Φ)

0V 0W

Φ is injective (one-to-one) if and only if ker(Φ) = {0}.

♢
m×n
Remark (Null Space and Column Space). Let us consider A ∈ R and
a linear mapping Φ : Rn → Rm , x 7→ Ax.
For A = [a1 , . . . , an ], where ai are the columns of A, we obtain
( n )
X
n
Im(Φ) = {Ax : x ∈ R } = xi ai : x1 , . . . , xn ∈ R (2.124a)
i=1
m
= span[a1 , . . . , an ] ⊆ R , (2.124b)
i.e., the image is the span of the columns of A, also called the column column space
space. Therefore, the column space (image) is a subspace of Rm , where
m is the “height” of the matrix.
rk(A) = dim(Im(Φ)).
The kernel/null space ker(Φ) is the general solution to the homoge-
neous system of linear equations Ax = 0 and captures all possible
linear combinations of the elements in Rn that produce 0 ∈ Rm .
The kernel is a subspace of Rn , where n is the “width” of the matrix.
The kernel focuses on the relationship among the columns, and we can
use it to determine whether/how we can express a column as a linear
combination of other columns.
♢

Example 2.25 (Image and Kernel of a Linear Mapping)

The mapping
   
x1 x1
4 2
x2  1 2 −1 0  x2  x1 + 2x2 − x3
Φ : R → R ,   7→
    =
x3 1 0 0 1 x3  x1 + x4
x4 x4
(2.125a)

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

60 Linear Algebra

1 2 −1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im(Φ), we can take the span of the columns of the
transformation matrix and obtain

1 2 −1 0
Im(Φ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row-echelon form:

1 2 −1 0 1 0 0 1
⇝ ··· ⇝ . (2.127)
1 0 0 1 0 1 − 21 − 12
This matrix is in reduced row-echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot columns (columns 1 and 2). The third column a3 is
equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
Overall, this gives us the kernel (null space) as
−1
   
0
1  1 
ker(Φ) = span[  1  ,  0 ] .
2  2  (2.128)
0 1

rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) . (2.129)

fundamental The rank-nullity theorem is also referred to as the fundamental theorem

theorem of linear of linear mappings (Axler, 2015, theorem 3.22). The following are direct
mappings
consequences of Theorem 2.24:

If dim(Im(Φ)) < dim(V ), then ker(Φ) is non-trivial, i.e., the kernel

contains more than 0V and dim(ker(Φ)) ⩾ 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis
and dim(Im(Φ)) < dim(V ), then the system of linear equations AΦ x =
0 has infinitely many solutions.
If dim(V ) = dim(W ), then the three-way equivalence

Φ is injective ⇐⇒ Φ is surjective ⇐⇒ Φ is bijective

holds since Im(Φ) ⊆ W .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

2.8 Affine Spaces 61

2.8 Affine Spaces

In the following, we will take a closer look at spaces that are offset from
the origin, i.e., spaces that are no longer vector subspaces. Moreover, we
will briefly discuss properties of mappings between these affine spaces,
which resemble linear mappings.
Remark. In the machine learning literature, the distinction between linear
and affine is sometimes not clear so that we can find references to affine
spaces/mappings as linear spaces/mappings. ♢

2.8.1 Affine Subspaces

Definition 2.25 (Affine Subspace). Let V be a vector space, x0 ∈ V and
U ⊆ V a subspace. Then the subset
L = x0 + U := {x0 + u : u ∈ U } (2.130a)
= {v ∈ V |∃u ∈ U : v = x0 + u} ⊆ V (2.130b)
is called affine subspace or linear manifold of V . U is called direction or affine subspace
direction space, and x0 is called support point. In Chapter 12, we refer to linear manifold
such a subspace as a hyperplane. direction
direction space
Note that the definition of an affine subspace excludes 0 if x0 ∈ / U. support point
Therefore, an affine subspace is not a (linear) subspace (vector subspace) hyperplane
of V for x0 ∈
/ U.
Examples of affine subspaces are points, lines, and planes in R3 , which
do not (necessarily) go through the origin.
Remark. Consider two affine subspaces L = x0 + U and L̃ = x̃0 + Ũ of a
vector space V . Then, L ⊆ L̃ if and only if U ⊆ Ũ and x0 − x̃0 ∈ Ũ .
Affine subspaces are often described by parameters: Consider a k -dimen-
sional affine space L = x0 + U of V . If (b1 , . . . , bk ) is an ordered basis of
U , then every element x ∈ L can be uniquely described as
x = x0 + λ1 b1 + . . . + λk bk , (2.131)
where λ1 , . . . , λk ∈ R. This representation is called parametric equation parametric equation
of L with directional vectors b1 , . . . , bk and parameters λ1 , . . . , λk . ♢ parameters

Example 2.26 (Affine Subspaces)

One-dimensional affine subspaces are called lines and can be written line
as y = x0 + λb1 , where λ ∈ R and U = span[b1 ] ⊆ Rn is a one-
dimensional subspace of Rn . This means that a line is defined by a sup-
port point x0 and a vector b1 that defines the direction. See Figure 2.13
for an illustration.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

62 Linear Algebra

plane Two-dimensional affine subspaces of Rn are called planes. The para-

metric equation for planes is y = x0 + λ1 b1 + λ2 b2 , where λ1 , λ2 ∈ R
and U = span[b1 , b2 ] ⊆ Rn . This means that a plane is defined by a
support point x0 and two linearly independent vectors b1 , b2 that span
the direction space.
hyperplane In Rn , the (n − 1)-dimensional affine subspaces are called hyperplanes,
Pn−1
and the corresponding parametric equation is y = x0 + i=1 λi bi ,
where b1 , . . . , bn−1 form a basis of an (n − 1)-dimensional subspace
U of Rn . This means that a hyperplane is defined by a support point
x0 and (n − 1) linearly independent vectors b1 , . . . , bn−1 that span the
direction space. In R2 , a line is also a hyperplane. In R3 , a plane is also
a hyperplane.

Figure 2.13 Lines

+ λb 1
are affine subspaces.
L = x0
Vectors y on a line
x0 + λb1 lie in an y
affine subspace L
with support point x0
x0 and direction b1 . b1
0

Remark (Inhomogeneous systems of linear equations and affine subspaces).

For A ∈ Rm×n and x ∈ Rm , the solution of the system of linear equa-
tions Aλ = x is either the empty set or an affine subspace of Rn of
dimension n − rk(A). In particular, the solution of the linear equation
λ1 b1 + . . . + λn bn = x, where (λ1 , . . . , λn ) ̸= (0, . . . , 0), is a hyperplane
in Rn .
In Rn , every k -dimensional affine subspace is the solution of an inho-
mogeneous system of linear equations Ax = b, where A ∈ Rm×n , b ∈
Rm and rk(A) = n − k . Recall that for homogeneous equation systems
Ax = 0 the solution was a vector subspace, which we can also think of
as a special affine space with support point x0 = 0. ♢

2.8.2 Affine Mappings

Similar to linear mappings between vector spaces, which we discussed
in Section 2.7, we can define affine mappings between two affine spaces.
Linear and affine mappings are closely related. Therefore, many properties
that we already know from linear mappings, e.g., that the composition of
linear mappings is a linear mapping, also hold for affine mappings.

Definition 2.26 (Affine Mapping). For two vector spaces V, W , a linear

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

2.9 Further Reading 63

mapping Φ : V → W , and a ∈ W , the mapping

ϕ:V →W (2.132)
x 7→ a + Φ(x) (2.133)
is an affine mapping from V to W . The vector a is called the translation affine mapping
vector of ϕ. translation vector

Every affine mapping ϕ : V → W is also the composition of a linear

mapping Φ : V → W and a translation τ : W → W in W , such that
ϕ = τ ◦ Φ. The mappings Φ and τ are uniquely determined.
The composition ϕ′ ◦ ϕ of affine mappings ϕ : V → W , ϕ′ : W → X is
affine.
If ϕ is bijective, affine mappings keep the geometric structure invariant.
They then also preserve the dimension and parallelism.

2.9 Further Reading

There are many resources for learning linear algebra, including the text-
books by Strang (2003), Golan (2007), Axler (2015), and Liesen and
Mehrmann (2015). There are also several online resources that we men-
tioned in the introduction to this chapter. We only covered Gaussian elim-
ination here, but there are many other approaches for solving systems of
linear equations, and we refer to numerical linear algebra textbooks by
Stoer and Burlirsch (2002), Golub and Van Loan (2012), and Horn and
Johnson (2013) for an in-depth discussion.
In this book, we distinguish between the topics of linear algebra (e.g.,
vectors, matrices, linear independence, basis) and topics related to the
geometry of a vector space. In Chapter 3, we will introduce the inner
product, which induces a norm. These concepts allow us to define angles,
lengths and distances, which we will use for orthogonal projections. Pro-
jections turn out to be key in many machine learning algorithms, such as
linear regression and principal component analysis, both of which we will
cover in Chapters 9 and 10, respectively.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

64 Linear Algebra

Exercises
2.1 We consider (R\{−1}, ⋆), where

a ⋆ b := ab + a + b, a, b ∈ R\{−1} (2.134)

a. Show that (R\{−1}, ⋆) is an Abelian group.

b. Solve

3 ⋆ x ⋆ x = 15

in the Abelian group (R\{−1}, ⋆), where ⋆ is defined in (2.134).

2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k̄ of the
integer k as the set

k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .

We now define Z/nZ (sometimes written Zn ) as the set of all congruence

classes modulo n. Euclidean division implies that this set is a finite set con-
taining n elements:

Zn = {0, 1, . . . , n − 1}

For all a, b ∈ Zn , we define

a ⊕ b := a + b

a. Show that (Zn , ⊕) is a group. Is it Abelian?

b. We now define another operation ⊗ for all a and b in Zn as

a ⊗ b = a × b, (2.135)

where a × b represents the usual multiplication in Z.

Let n = 5. Draw the times table of the elements of Z5 \{0} under ⊗, i.e.,
calculate the products a ⊗ b for all a and b in Z5 \{0}.
Hence, show that Z5 \{0} is closed under ⊗ and possesses a neutral
element for ⊗. Display the inverse of all elements in Z5 \{0} under ⊗.
Conclude that (Z5 \{0}, ⊗) is an Abelian group.
c. Show that (Z8 \{0}, ⊗) is not a group.
d. We recall that the Bézout theorem states that two integers a and b are
relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers
u and v such that au + bv = 1. Show that (Zn \{0}, ⊗) is a group if and
only if n ∈ N\{0} is prime.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
  
 1 x z 
G = 0 1 y  ∈ R3×3 x, y, z ∈ R
0 0 1
 

We define · as the standard matrix multiplication.

Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
2.4 Compute the following matrix products, if possible:

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Exercises 65

a.
  
1 2 1 1 0
4 5  0 1 1
7 8 1 0 1

b.
  
1 2 3 1 1 0
4 5 6 0 1 1
7 8 9 1 0 1

c.
  
1 1 0 1 2 3
0 1 1 4 5 6
1 0 1 7 8 9

d.
 
0 3
1 2 1 2 1 −1
4 1 −1 −4 2 1
5 2

e.
 
0 3

1
 −1 1 2 1 2
2 1 4 1 −1 −4
5 2

2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b, where A and b are defined as follows:
a.
   
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b= 
2 −1 1 3 4
5 2 −4 2 6

b.
   
1 −1 0 0 1 3
1 1 0 −3 0 6
A= , b= 
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1

2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
   
0 1 0 0 1 0 2
A = 0 0 0 1 1 0 , b = −1 .
0 1 0 0 0 1 1

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

66 Linear Algebra
 
x1
2.7 Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x,
x3
where
 
6 4 3
A = 6 0 9
0 8 0

and 3i=1 xi = 1.
P
2.8 Determine the inverses of the following matrices if possible:
a.
 
2 3 4
A = 3 4 5
4 5 6
b.
 
1 0 1 0
0 1 1 0
A=
1

1 0 1
1 1 1 0

2.9 Which of the following sets are subspaces of R3 ?

a. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R}
b. B = {(λ2 , −λ2 , 0) | λ ∈ R}
c. Let γ be in R.
C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ}
d. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z}
2.10 Are the following sets of vectors linearly independent?
a.
     
2 1 3
x1 = −1 , x2 =  1  , x3 = −3
3 −2 8
b.
     
1 1 1
2 1 0 
     
1 ,
x1 =  
0 ,
x2 =   x3 = 
0 

0 1 1 
0 1 1
2.11 Write
 
1
y = −2
5
as linear combination of
     
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Exercises 67

2.12 Consider two subspaces of R4 :

           
1 2 −1 −1 2 −3
 1  −1  1  −2 −2  6 
−3 ,  0  , −1] ,
U1 = span[      U2 = span[
 2  ,  0  , −2] .
    

1 −1 1 1 0 −1

Determine a basis of U1 ∩ U2 .
2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
homogeneous equation system A1 x = 0 and U2 is the solution space of the
homogeneous equation system A2 x = 0 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2 .

b. Determine bases of U1 and U2 .
c. Determine a basis of U1 ∩ U2 .
2.14 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of
A1 and U2 is spanned by the columns of A2 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2

b. Determine bases of U1 and U2
c. Determine a basis of U1 ∩ U2
2.15 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}.
a. Show that F and G are subspaces of R3 .
b. Calculate F ∩ G without resorting to any basis vector.
c. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
2.16 Are the following mappings linear?
a. Let a, b ∈ R.
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a

where L1 ([a, b]) denotes the set of integrable functions on [a, b].
b.
Φ : C1 → C0
f 7→ Φ(f ) = f ′ ,

where for k ⩾ 1, C k denotes the set of k times continuously differen-

tiable functions, and C 0 denotes the set of continuous functions.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

68 Linear Algebra

c.
Φ:R→R
x 7→ Φ(x) = cos(x)

d.
Φ : R 3 → R2

1 2 3
x 7→ x
1 4 3

e. Let θ be in [0, 2π[ and

Φ : R2 → R2

cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)

2.17 Consider the linear mapping

Φ : R3 → R 4
 
  3x1 + 2x2 + x3
x1  x1 + x2 + x3 
Φ x2  = 
 x1 − 3x2 

x3
2x1 + 3x2 + x3

Find the transformation matrix AΦ .

Determine rk(AΦ ).
Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))?
2.18 Let E be a vector space. Let f and g be two automorphisms on E such that
f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) =
ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
 
1 1 0
AΦ = 1 −1 0 .
1 1 1

a. Determine ker(Φ) and Im(Φ).

b. Determine the transformation matrix ÃΦ with respect to the basis
     
1 1 1
B = (1 , 2 , 0) ,
1 1 0

i.e., perform a basis change toward the new basis B .

2.20 Let us consider b1 , b2 , b′1 , b′2 , 4 vectors of R2 expressed in the standard basis
of R2 as

2 −1 2 1
b1 = , b2 = , b′1 = , b′2 =
1 −1 −2 1

and let us define two ordered bases B = (b1 , b2 ) and B ′ = (b′1 , b′2 ) of R2 .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Exercises 69

a. Show that B and B ′ are two bases of R2 and draw those basis vectors.
b. Compute the matrix P 1 that performs a basis change from B ′ to B .
c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis
of R3 as
     
1 0 1
c1 =  2  , c2 = −1 , c3 =  0 
−1 2 −1

and we define C = (c1 , c2 , c3 ).

(i) Show that C is a basis of R3 , e.g., by using determinants (see
Section 4.1).
(ii) Let us call C ′ = (c′1 , c′2 , c′3 ) the standard basis of R3 . Determine
the matrix P 2 that performs the basis change from C to C ′ .
d. We consider a homomorphism Φ : R2 −→ R3 , such that
Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3

where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 ,

respectively.
Determine the transformation matrix AΦ of Φ with respect to the or-
dered bases B and C .
e. Determine A′ , the transformation matrix of Φ with respect to the bases
B ′ and C ′ .
f. Let us consider the vector x ∈ R2 whose coordinates in B ′ are [2, 3]⊤ .
In other words, x = 2b′1 + 3b′2 .
(i) Calculate the coordinates of x in B .
(ii) Based on that, compute the coordinates of Φ(x) expressed in C .
(iii) Then, write Φ(x) in terms of c′1 , c′2 , c′3 .
(iv) Use the representation of x in B ′ and the matrix A′ to find this
result directly.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

Analytic Geometry

In Chapter 2, we studied vectors, vector spaces, and linear mappings at

a general but abstract level. In this chapter, we will add some geomet-
ric interpretation and intuition to all of these concepts. In particular, we
will look at geometric vectors and compute their lengths and distances
or angles between two vectors. To be able to do this, we equip the vec-
tor space with an inner product that induces the geometry of the vector
space. Inner products and their corresponding norms and metrics capture
the intuitive notions of similarity and distances, which we use to develop
the support vector machine in Chapter 12. We will then use the concepts
of lengths and angles between vectors to discuss orthogonal projections,
which will play a central role when we discuss principal component anal-
ysis in Chapter 10 and regression via maximum likelihood estimation in
Chapter 9. Figure 3.1 gives an overview of how concepts in this chapter
are related and how they are connected to other chapters of the book.

Figure 3.1 A mind

Inner product
map of the concepts
introduced in this
es
chapter, along with uc
when they are used ind
in other parts of the
Chapter 12
book. Norm Classification

Orthogonal
Lengths Angles Rotations
projection

Chapter 9 Chapter 4 Chapter 10

Regression Matrix Dimensionality
decomposition reduction

70
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2024. https://mml-book.com.
3.1 Norms 71

kx k1 = 1 kx k2 = 1 Figure 3.3 For

1 1
different norms, the
red lines indicate
the set of vectors
with norm 1. Left:
1 1 Manhattan norm;
Right: Euclidean
distance.

3.1 Norms
When we think of geometric vectors, i.e., directed line segments that start
at the origin, then intuitively the length of a vector is the distance of the
“end” of this directed line segment from the origin. In the following, we
will discuss the notion of the length of vectors using the concept of a norm.

Definition 3.1 (Norm). A norm on a vector space V is a function norm

∥ · ∥ : V → R, (3.1)
x 7→ ∥x∥ , (3.2)

which assigns each vector x its length ∥x∥ ∈ R, such that for all λ ∈ R length
and x, y ∈ V the following hold:
absolutely
Absolutely homogeneous: ∥λx∥ = |λ|∥x∥ homogeneous

Triangle inequality: ∥x + y∥ ⩽ ∥x∥ + ∥y∥ triangle inequality

Positive definite: ∥x∥ ⩾ 0 and ∥x∥ = 0 ⇐⇒ x = 0 positive definite

Figure 3.2 Triangle

In geometric terms, the triangle inequality states that for any triangle, inequality.
the sum of the lengths of any two sides must be greater than or equal
a b
to the length of the remaining side; see Figure 3.2 for an illustration.
Definition 3.1 is in terms of a general vector space V (Section 2.4), but c≤a+b

in this book we will only consider a finite-dimensional vector space Rn .

Recall that for a vector x ∈ Rn we denote the elements of the vector using
a subscript, that is, xi is the ith element of the vector x.

Example 3.1 (Manhattan Norm)

The Manhattan norm on Rn is defined for x ∈ Rn as Manhattan norm

Xn
∥x∥1 := |xi | , (3.3)
i=1

where | · | is the absolute value. The left panel of Figure 3.3 shows all
vectors x ∈ R2 with ∥x∥1 = 1. The Manhattan norm is also called ℓ1 ℓ1 norm
norm.

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

72 Analytic Geometry

Example 3.2 (Euclidean Norm)

Euclidean norm The Euclidean norm of x ∈ Rn is defined as
v
√
u n
uX
∥x∥2 := t x2i = x⊤ x (3.4)
i=1

Euclidean distance and computes the Euclidean distance of x from the origin. The right panel
of Figure 3.3 shows all vectors x ∈ R2 with ∥x∥2 = 1. The Euclidean
ℓ2 norm norm is also called ℓ2 norm.

Remark. Throughout this book, we will use the Euclidean norm (3.4) by
default if not stated otherwise. ♢

3.2 Inner Products

Inner products allow for the introduction of intuitive geometrical con-
cepts, such as the length of a vector and the angle or distance between
two vectors. A major purpose of inner products is to determine whether
vectors are orthogonal to each other.

3.2.1 Dot Product

We may already be familiar with a particular type of inner product, the
scalar product scalar product/dot product in Rn , which is given by
dot product n
X
x⊤ y = xi yi . (3.5)
i=1

We will refer to this particular inner product as the dot product in this
book. However, inner products are more general concepts with specific
properties, which we will now introduce.

3.2.2 General Inner Products

Recall the linear mapping from Section 2.7, where we can rearrange the
bilinear mapping mapping with respect to addition and multiplication with a scalar. A bi-
linear mapping Ω is a mapping with two arguments, and it is linear in
each argument, i.e., when we look at a vector space V then it holds that
for all x, y, z ∈ V, λ, ψ ∈ R that
Ω(λx + ψy, z) = λΩ(x, z) + ψΩ(y, z) (3.6)
Ω(x, λy + ψz) = λΩ(x, y) + ψΩ(x, z) . (3.7)
Here, (3.6) asserts that Ω is linear in the first argument, and (3.7) asserts
that Ω is linear in the second argument (see also (2.87)).

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.2 Inner Products 73

Definition 3.2. Let V be a vector space and Ω : V × V → R be a bilinear

mapping that takes two vectors and maps them onto a real number. Then
Ω is called symmetric if Ω(x, y) = Ω(y, x) for all x, y ∈ V , i.e., the symmetric
order of the arguments does not matter.
Ω is called positive definite if positive definite

∀x ∈ V \{0} : Ω(x, x) > 0 , Ω(0, 0) = 0 . (3.8)

Definition 3.3. Let V be a vector space and Ω : V × V → R be a bilinear
mapping that takes two vectors and maps them onto a real number. Then
A positive definite, symmetric bilinear mapping Ω : V ×V → R is called
an inner product on V . We typically write ⟨x, y⟩ instead of Ω(x, y). inner product
The pair (V, ⟨·, ·⟩) is called an inner product space or (real) vector space inner product space
with inner product. If we use the dot product defined in (3.5), we call vector space with
(V, ⟨·, ·⟩) a Euclidean vector space. inner product
Euclidean vector
We will refer to these spaces as inner product spaces in this book. space

Example 3.3 (Inner Product That Is Not the Dot Product)

Consider V = R2 . If we define
⟨x, y⟩ := x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2 (3.9)
then ⟨·, ·⟩ is an inner product but different from the dot product. The proof
will be an exercise.

3.2.3 Symmetric, Positive Definite Matrices

Symmetric, positive definite matrices play an important role in machine
learning, and they are defined via the inner product. In Section 4.3, we
will return to symmetric, positive definite matrices in the context of matrix
decompositions. The idea of symmetric positive semidefinite matrices is
key in the definition of kernels (Section 12.4).
Consider an n-dimensional vector space V with an inner product ⟨·, ·⟩ :
V × V → R (see Definition 3.3) and an ordered basis B = (b1 , . . . , bn ) of
V . Recall from Section 2.6.1 that any vectors x, y ∈ V Pncan be written as
linearPcombinations of the basis vectors so that x = i=1 ψi bi ∈ V and
n
y = j=1 λj bj ∈ V for suitable ψi , λj ∈ R. Due to the bilinearity of the
inner product, it holds for all x, y ∈ V that
* n n
+ n X n
ψi ⟨bi , bj ⟩ λj = x̂⊤ Aŷ , (3.10)
X X X
⟨x, y⟩ = ψi bi , λj bj =
i=1 j=1 i=1 j=1

where Aij := ⟨bi , bj ⟩ and x̂, ŷ are the coordinates of x and y with respect
to the basis B . This implies that the inner product ⟨·, ·⟩ is uniquely deter-
mined through A. The symmetry of the inner product also means that A

©2024 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

74 Analytic Geometry

is symmetric. Furthermore, the positive definiteness of the inner product

implies that

∀x ∈ V \{0} : x⊤ Ax > 0 . (3.11)

Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix

symmetric, positive A ∈ Rn×n that satisfies (3.11) is called symmetric, positive definite, or
definite just positive definite. If only ⩾ holds in (3.11), then A is called symmetric,
positive definite
positive semidefinite.
symmetric, positive
semidefinite

Example 3.4 (Symmetric, Positive Definite Matrices)

Consider the matrices

9 6 9 6
A1 = , A2 = . (3.12)
6 5 6 3
A1 is positive definite because it is symmetric and

⊤
9 6 x1
x A1 x = x1 x2 (3.13a)
6 5 x2
= 9x21 + 12x1 x2 + 5x22 = (3x1 + 2x2 )2 + x22 > 0 (3.13b)
for all x ∈ V \{0}. In contrast, A2 is symmetric but not positive definite
because x⊤ A2 x = 9x21 + 12x1 x2 + 3x22 = (3x1 + 2x2 )2 − x22 can be less
than 0, e.g., for x = [2, −3]⊤ .

If A ∈ Rn×n is symmetric, positive definite, then

⟨x, y⟩ = x̂⊤ Aŷ (3.14)

defines an inner product with respect to an ordered basis B , where x̂ and

ŷ are the coordinate representations of x, y ∈ V with respect to B .

Theorem 3.5. For a real-valued, finite-dimensional vector space V and an

ordered basis B of V , it holds that ⟨·, ·⟩ : V × V → R is an inner product if
and only if there exists a symmetric, positive definite matrix A ∈ Rn×n with

⟨x, y⟩ = x̂⊤ Aŷ . (3.15)

The following properties hold if A ∈ Rn×n is symmetric and positive

definite:

The null space (kernel) of A consists only of 0 because x⊤ Ax > 0 for

all x ̸= 0. This implies that Ax ̸= 0 if x ̸= 0.
The diagonal elements aii of A are positive because aii = e⊤
i Aei > 0,
where ei is the ith vector of the standard basis in Rn .

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.3 Lengths and Distances 75

3.3 Lengths and Distances

In Section 3.1, we already discussed norms that we can use to compute
the length of a vector. Inner products and norms are closely related in the
sense that any inner product induces a norm Inner products
q induce norms.
∥x∥ := ⟨x, x⟩ (3.16)

in a natural way, such that we can compute lengths of vectors using the in-
ner product. However, not every norm is induced by an inner product. The
Manhattan norm (3.3) is an example of a norm without a corresponding
inner product. In the following, we will focus on norms that are induced
by inner products and introduce geometric concepts, such as lengths, dis-
tances, and angles.
Remark (Cauchy-Schwarz Inequality). For an inner product vector space
(V, ⟨·, ·⟩) the induced norm ∥ · ∥ satisfies the Cauchy-Schwarz inequality Cauchy-Schwarz
inequality
| ⟨x, y⟩ | ⩽ ∥x∥∥y∥ . (3.17)
♢

Example 3.5 (Lengths of Vectors Using Inner Products)

In geometry, we are often interested in lengths of vectors. We can now use
an inner product to compute them using (3.16). Let us take x = [1, 1]⊤ ∈
R2 . If we use the dot product as the inner product, with (3.16) we obtain
√ √ √
∥x∥ = x⊤ x = 12 + 12 = 2 (3.18)
as the length of x. Let us now choose a different inner product:
1 − 12 1

⊤
⟨x, y⟩ := x 1 y = x1 y1 − (x1 y2 + x2 y1 ) + x2 y2 . (3.19)
−2 1 2
If we compute the norm of a vector, then this inner product returns smaller
values than the dot product if x1 and x2 have the same sign (and x1 x2 >
0); otherwise, it returns greater values than the dot product. With this
inner product, we obtain
√
⟨x, x⟩ = x21 − x1 x2 + x22 = 1 − 1 + 1 = 1 =⇒ ∥x∥ = 1 = 1 , (3.20)
such that x is “shorter” with this inner product than with the dot product.

Definition 3.6 (Distance and Metric). Consider an inner product space

(V, ⟨·, ·⟩). Then
q
d(x, y) := ∥x − y∥ = ⟨x − y, x − y⟩ (3.21)

is called the distance between x and y for x, y ∈ V . If we use the dot distance
product as the inner product, then the distance is called Euclidean distance. Euclidean distance

76 Analytic Geometry

The mapping

d:V ×V →R (3.22)
(x, y) 7→ d(x, y) (3.23)

metric is called a metric.

Remark. Similar to the length of a vector, the distance between vectors

does not require an inner product: a norm is sufficient. If we have a norm
induced by an inner product, the distance may vary depending on the
choice of the inner product. ♢
A metric d satisfies the following:

positive definite 1. d is positive definite, i.e., d(x, y) ⩾ 0 for all x, y ∈ V and d(x, y) =
0 ⇐⇒ x = y .
symmetric 2. d is symmetric, i.e., d(x, y) = d(y, x) for all x, y ∈ V .
triangle inequality 3. Triangle inequality: d(x, z) ⩽ d(x, y) + d(y, z) for all x, y, z ∈ V .

Remark. At first glance, the lists of properties of inner products and met-
rics look very similar. However, by comparing Definition 3.3 with Defini-
tion 3.6 we observe that ⟨x, y⟩ and d(x, y) behave in opposite directions.
Very similar x and y will result in a large value for the inner product and
a small value for the metric. ♢

3.4 Angles and Orthogonality

Figure 3.4 When
restricted to [0, π] In addition to enabling the definition of lengths of vectors, as well as the
then f (ω) = cos(ω) distance between two vectors, inner products also capture the geometry
returns a unique
number in the
of a vector space by defining the angle ω between two vectors. We use
interval [−1, 1]. the Cauchy-Schwarz inequality (3.17) to define angles ω in inner prod-
uct spaces between two vectors x, y , and this notion coincides with our
1
intuition in R2 and R3 . Assume that x ̸= 0, y ̸= 0. Then
cos(ω)

0
⟨x, y⟩
−1 ⩽ ⩽ 1. (3.24)
∥x∥ ∥y∥
−1
0 π/2 π
ω Therefore, there exists a unique ω ∈ [0, π], illustrated in Figure 3.4, with

⟨x, y⟩
cos ω = . (3.25)
∥x∥ ∥y∥
angle The number ω is the angle between the vectors x and y . Intuitively, the
angle between two vectors tells us how similar their orientations are. For
example, using the dot product, the angle between x and y = 4x, i.e., y
is a scaled version of x, is 0: Their orientation is the same.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.4 Angles and Orthogonality 77

Example 3.6 (Angle between Vectors)

Let us compute the angle between x = [1, 1]⊤ ∈ R2 and y = [1, 2]⊤ ∈ R2 ; Figure 3.5 The
see Figure 3.5, where we use the dot product as the inner product. Then angle ω between
two vectors x, y is
we get
computed using the
⟨x, y⟩ x⊤ y 3 inner product.
cos ω = p =p =√ , (3.26)
⟨x, x⟩ ⟨y, y⟩ x⊤ xy ⊤ y 10
y
and the angle between the two vectors is arccos( √310 ) ≈ 0.32 rad, which
corresponds to about 18◦ .
1
ω x
A key feature of the inner product is that it also allows us to characterize
vectors that are orthogonal.
0 1
Definition 3.7 (Orthogonality). Two vectors x and y are orthogonal if and orthogonal
only if ⟨x, y⟩ = 0, and we write x ⊥ y . If additionally ∥x∥ = 1 = ∥y∥,
i.e., the vectors are unit vectors, then x and y are orthonormal. orthonormal

An implication of this definition is that the 0-vector is orthogonal to

every vector in the vector space.
Remark. Orthogonality is the generalization of the concept of perpendic-
ularity to bilinear forms that do not have to be the dot product. In our
context, geometrically, we can think of orthogonal vectors as having a
right angle with respect to a specific inner product. ♢

Example 3.7 (Orthogonal Vectors)

Figure 3.6 The

1 angle ω between
y x two vectors x, y can
change depending
ω on the inner
product.
−1 0 1

Consider two vectors x = [1, 1]⊤ , y = [−1, 1]⊤ ∈ R2 ; see Figure 3.6.
We are interested in determining the angle ω between them using two
different inner products. Using the dot product as the inner product yields
an angle ω between x and y of 90◦ , such that x ⊥ y . However, if we
choose the inner product

⊤ 2 0
⟨x, y⟩ = x y, (3.27)
0 1

78 Analytic Geometry

we get that the angle ω between x and y is given by

⟨x, y⟩ 1
cos ω = = − =⇒ ω ≈ 1.91 rad ≈ 109.5◦ , (3.28)
∥x∥∥y∥ 3
and x and y are not orthogonal. Therefore, vectors that are orthogonal
with respect to one inner product do not have to be orthogonal with re-
spect to a different inner product.

Definition 3.8 (Orthogonal Matrix). A square matrix A ∈ Rn×n is an

orthogonal matrix orthogonal matrix if and only if its columns are orthonormal so that
AA⊤ = I = A⊤ A , (3.29)
which implies that
A−1 = A⊤ , (3.30)
It is convention to i.e., the inverse is obtained by simply transposing the matrix.
call these matrices
“orthogonal” but a Transformations by orthogonal matrices are special because the length
more precise of a vector x is not changed when transforming it using an orthogonal
description would matrix A. For the dot product, we obtain
be “orthonormal”.
2 2
Transformations ∥Ax∥ = (Ax)⊤ (Ax) = x⊤ A⊤ Ax = x⊤ Ix = x⊤ x = ∥x∥ . (3.31)
with orthogonal
matrices preserve Moreover, the angle between any two vectors x, y , as measured by their
distances and inner product, is also unchanged when transforming both of them using
angles.
an orthogonal matrix A. Assuming the dot product as the inner product,
the angle of the images Ax and Ay is given as
(Ax)⊤ (Ay) x⊤ A⊤ Ay x⊤ y
cos ω = =q = , (3.32)
∥Ax∥ ∥Ay∥ x⊤ A⊤ Axy ⊤ A⊤ Ay ∥x∥ ∥y∥

which gives exactly the angle between x and y . This means that orthog-
onal matrices A with A⊤ = A−1 preserve both angles and distances. It
turns out that orthogonal matrices define transformations that are rota-
tions (with the possibility of flips). In Section 3.9, we will discuss more
details about rotations.

3.5 Orthonormal Basis

In Section 2.6.1, we characterized properties of basis vectors and found
that in an n-dimensional vector space, we need n basis vectors, i.e., n
vectors that are linearly independent. In Sections 3.3 and 3.4, we used
inner products to compute the length of vectors and the angle between
vectors. In the following, we will discuss the special case where the basis
vectors are orthogonal to each other and where the length of each basis
vector is 1. We will call this basis then an orthonormal basis.

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.6 Orthogonal Complement 79

Let us introduce this more formally.

Definition 3.9 (Orthonormal Basis). Consider an n-dimensional vector

space V and a basis {b1 , . . . , bn } of V . If

⟨bi , bj ⟩ = 0 for i ̸= j (3.33)

⟨bi , bi ⟩ = 1 (3.34)

for all i, j = 1, . . . , n then the basis is called an orthonormal basis (ONB). orthonormal basis
If only (3.33) is satisfied, then the basis is called an orthogonal basis. Note ONB
orthogonal basis
that (3.34) implies that every basis vector has length/norm 1.

Recall from Section 2.6.1 that we can use Gaussian elimination to find a
basis for a vector space spanned by a set of vectors. Assume we are given
a set {b̃1 , . . . , b̃n } of non-orthogonal and unnormalized basis vectors. We
concatenate them into a matrix B̃ = [b̃1 , . . . , b̃n ] and apply Gaussian elim-
⊤
ination to the augmented matrix (Section 2.3.2) [B̃ B̃ |B̃] to obtain an
orthonormal basis. This constructive way to iteratively build an orthonor-
mal basis {b1 , . . . , bn } is called the Gram-Schmidt process (Strang, 2003).

Example 3.8 (Orthonormal Basis)

The canonical/standard basis for a Euclidean vector space Rn is an or-
thonormal basis, where the inner product is the dot product of vectors.
In R2 , the vectors
1 1 1

1
b1 = √ , b2 = √ (3.35)
2 1 2 −1
form an orthonormal basis since b⊤
1 b2 = 0 and ∥b1 ∥ = 1 = ∥b2 ∥.

We will exploit the concept of an orthonormal basis in Chapter 12 and

Chapter 10 when we discuss support vector machines and principal com-
ponent analysis.

3.6 Orthogonal Complement

Having defined orthogonality, we will now look at vector spaces that are
orthogonal to each other. This will play an important role in Chapter 10,
when we discuss linear dimensionality reduction from a geometric per-
spective.
Consider a D-dimensional vector space V and an M -dimensional sub-
space U ⊆ V . Then its orthogonal complement U ⊥ is a (D−M )-dimensional orthogonal
subspace of V and contains all vectors in V that are orthogonal to every complement
vector in U . Furthermore, U ∩ U ⊥ = {0} so that any vector x ∈ V can be

80 Analytic Geometry

Figure 3.7 A plane

U in a e3
three-dimensional
vector space can be w
described by its
normal vector,
which spans its e2
orthogonal
complement U ⊥ .

e1
U

uniquely decomposed into

M
X D−M
X
x= λm bm + ψj b⊥
j , λm , ψj ∈ R , (3.36)
m=1 j=1

where (b1 , . . . , bM ) is a basis of U and (b⊥ ⊥ ⊥

1 , . . . , bD−M ) is a basis of U .
Therefore, the orthogonal complement can also be used to describe a
plane U (two-dimensional subspace) in a three-dimensional vector space.
More specifically, the vector w with ∥w∥ = 1, which is orthogonal to the
plane U , is the basis vector of U ⊥ . Figure 3.7 illustrates this setting. All
vectors that are orthogonal to w must (by construction) lie in the plane
normal vector U . The vector w is called the normal vector of U .
Generally, orthogonal complements can be used to describe hyperplanes
in n-dimensional vector and affine spaces.

3.7 Inner Product of Functions

Thus far, we looked at properties of inner products to compute lengths,
angles and distances. We focused on inner products of finite-dimensional
vectors. In the following, we will look at an example of inner products of
a different type of vectors: inner products of functions.
The inner products we discussed so far were defined for vectors with a
finite number of entries. We can think of a vector x ∈ Rn as a function
with n function values. The concept of an inner product can be generalized
to vectors with an infinite number of entries (countably infinite) and also
continuous-valued functions (uncountably infinite). Then the sum over
individual components of vectors (see Equation (3.5) for example) turns
into an integral.
An inner product of two functions u : R → R and v : R → R can be
defined as the definite integral
Z b
⟨u, v⟩ := u(x)v(x)dx (3.37)
a

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.8 Orthogonal Projections 81

for lower and upper limits a, b < ∞, respectively. As with our usual inner
product, we can define norms and orthogonality by looking at the inner
product. If (3.37) evaluates to 0, the functions u and v are orthogonal. To
make the preceding inner product mathematically precise, we need to take
care of measures and the definition of integrals, leading to the definition of
a Hilbert space. Furthermore, unlike inner products on finite-dimensional
vectors, inner products on functions may diverge (have infinite value). All
this requires diving into some more intricate details of real and functional
analysis, which we do not cover in this book.

Example 3.9 (Inner Product of Functions)

If we choose u = sin(x) and v = cos(x), the integrand f (x) = u(x)v(x) Figure 3.8 f (x) =
of (3.37), is shown in Figure 3.8. We see that this function is odd, i.e., sin(x) cos(x).
f (−x) = −f (x). Therefore, the integral with limits a = −π, b = π of this 0.5

sin(x) cos(x)
product evaluates to 0. Therefore, sin and cos are orthogonal functions.
0.0

Remark. It also holds that the collection of functions −0.5

−2.5 0.0 2.5
x
{1, cos(x), cos(2x), cos(3x), . . . } (3.38)
is orthogonal if we integrate from −π to π , i.e., any pair of functions are
orthogonal to each other. The collection of functions in (3.38) spans a
large subspace of the functions that are even and periodic on [−π, π), and
projecting functions onto this subspace is the fundamental idea behind
Fourier series. ♢
In Section 6.4.6, we will have a look at a second type of unconventional
inner products: the inner product of random variables.

3.8 Orthogonal Projections

Projections are an important class of linear transformations (besides rota-
tions and reflections) and play an important role in graphics, coding the-
ory, statistics and machine learning. In machine learning, we often deal
with data that is high-dimensional. High-dimensional data is often hard
to analyze or visualize. However, high-dimensional data quite often pos-
sesses the property that only a few dimensions contain most information,
and most other dimensions are not essential to describe key properties
of the data. When we compress or visualize high-dimensional data, we
will lose information. To minimize this compression loss, we ideally find
the most informative dimensions in the data. As discussed in Chapter 1, “Feature” is a
data can be represented as vectors, and in this chapter, we will discuss common expression
for data
some of the fundamental tools for data compression. More specifically, we
representation.
can project the original high-dimensional data onto a lower-dimensional
feature space and work in this lower-dimensional space to learn more
about the dataset and extract relevant patterns. For example, machine

82 Analytic Geometry

Figure 3.9
Orthogonal
projection (orange 2
dots) of a
1
two-dimensional
dataset (blue dots)

x2
0
onto a
one-dimensional −1
subspace (straight
line). −2

−4 −2 0 2 4
x1

learning algorithms, such as principal component analysis (PCA) by Pear-

son (1901) and Hotelling (1933) and deep neural networks (e.g., deep
auto-encoders (Deng et al., 2010)), heavily exploit the idea of dimension-
ality reduction. In the following, we will focus on orthogonal projections,
which we will use in Chapter 10 for linear dimensionality reduction and
in Chapter 12 for classification. Even linear regression, which we discuss
in Chapter 9, can be interpreted using orthogonal projections. For a given
lower-dimensional subspace, orthogonal projections of high-dimensional
data retain as much information as possible and minimize the difference/
error between the original data and the corresponding projection. An il-
lustration of such an orthogonal projection is given in Figure 3.9. Before
we detail how to obtain these projections, let us define what a projection
actually is.

Definition 3.10 (Projection). Let V be a vector space and U ⊆ V a

projection subspace of V . A linear mapping π : V → U is called a projection if
π2 = π ◦ π = π.
Since linear mappings can be expressed by transformation matrices (see
Section 2.7), the preceding definition applies equally to a special kind
projection matrix of transformation matrices, the projection matrices P π , which exhibit the
property that P 2π = P π .
In the following, we will derive orthogonal projections of vectors in the
inner product space (Rn , ⟨·, ·⟩) onto subspaces. We will start with one-
line dimensional subspaces, which are also called lines. If not mentioned oth-
erwise, we assume the dot product ⟨x, y⟩ = x⊤ y as the inner product.

3.8.1 Projection onto One-Dimensional Subspaces (Lines)

Assume we are given a line (one-dimensional subspace) through the ori-
gin with basis vector b ∈ Rn . The line is a one-dimensional subspace
U ⊆ Rn spanned by b. When we project x ∈ Rn onto U , we seek the
vector πU (x) ∈ U that is closest to x. Using geometric arguments, let

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

3.8 Orthogonal Projections 83
Figure 3.10
x Examples of
projections onto
one-dimensional
subspaces.

b x

πU (x)

ω sin ω
ω cos ω b
(a) Projection of x ∈ R2 onto a subspace U (b) Projection of a two-dimensional vector
with basis vector b. x with ∥x∥ = 1 onto a one-dimensional
subspace spanned by b.

us characterize some properties of the projection πU (x) (Figure 3.10(a)

serves as an illustration):

The projection πU (x) is closest to x, where “closest” implies that the

distance ∥x − πU (x)∥ is minimal. It follows that the segment πU (x) − x
from πU (x) to x is orthogonal to U , and therefore the basis vector b of
U . The orthogonality condition yields ⟨πU (x) − x, b⟩ = 0 since angles
between vectors are defined via the inner product. λ is then the
The projection πU (x) of x onto U must be an element of U and, there- coordinate of πU (x)
with respect to b.
fore, a multiple of the basis vector b that spans U . Hence, πU (x) = λb,
for some λ ∈ R.

In the following three steps, we determine the coordinate λ, the projection

πU (x) ∈ U , and the projection matrix P π that maps any x ∈ Rn onto U :

1. Finding the coordinate λ. The orthogonality condition yields

πU (x)=λb
⟨x − πU (x), b⟩ = 0 ⇐⇒ ⟨x − λb, b⟩ = 0 . (3.39)

We can now exploit the bilinearity of the inner product and arrive at With a general inner
product, we get
⟨x, b⟩ ⟨b, x⟩ λ = ⟨x, b⟩ if
⟨x, b⟩ − λ ⟨b, b⟩ = 0 ⇐⇒ λ = = . (3.40) ∥b∥ = 1.
⟨b, b⟩ ∥b∥2
In the last step, we exploited the fact that inner products are symmet-
ric. If we choose ⟨·, ·⟩ to be the dot product, we obtain

b⊤ x b⊤ x
λ= = . (3.41)
b⊤ b ∥b∥2

If ∥b∥ = 1, then the coordinate λ of the projection is given by b⊤ x.

84 Analytic Geometry

2. Finding the projection point πU (x) ∈ U . Since πU (x) = λb, we imme-

diately obtain with (3.40) that

⟨x, b⟩ b⊤ x
πU (x) = λb = b = b, (3.42)
∥b∥2 ∥b∥2
where the last equality holds for the dot product only. We can also
compute the length of πU (x) by means of Definition 3.1 as

∥πU (x)∥ = ∥λb∥ = |λ| ∥b∥ . (3.43)

Hence, our projection is of length |λ| times the length of b. This also
adds the intuition that λ is the coordinate of πU (x) with respect to the
basis vector b that spans our one-dimensional subspace U .
If we use the dot product as an inner product, we get

(3.42) |b⊤ x| (3.25) ∥b∥

∥πU (x)∥ = ∥b∥ = | cos ω| ∥x∥ ∥b∥ = | cos ω| ∥x∥ .
∥b∥ 2 ∥b∥2
(3.44)

Here, ω is the angle between x and b. This equation should be familiar

from trigonometry: If ∥x∥ = 1, then x lies on the unit circle. It follows
The horizontal axis that the projection onto the horizontal axis spanned by b is exactly
is a one-dimensional cos ω , and the length of the corresponding vector πU (x) = |cos ω|. An
subspace.
illustration is given in Figure 3.10(b).
3. Finding the projection matrix P π . We know that a projection is a lin-
ear mapping (see Definition 3.10). Therefore, there exists a projection
matrix P π , such that πU (x) = P π x. With the dot product as inner
product and

b⊤ x bb⊤
πU (x) = λb = bλ = b = x, (3.45)
∥b∥2 ∥b∥2
we immediately see that

bb⊤
Pπ = . (3.46)
∥b∥2

Projection matrices Note that bb⊤ (and, consequently, P π ) is a symmetric matrix (of rank
are always 1), and ∥b∥2 = ⟨b, b⟩ is a scalar.
symmetric.
The projection matrix P π projects any vector x ∈ Rn onto the line through
the origin with direction b (equivalently, the subspace U spanned by b).
Remark. The projection πU (x) ∈ Rn is still an n-dimensional vector and
not a scalar. However, we no longer require n coordinates to represent the
projection, but only a single one if we want to express it with respect to
the basis vector b that spans the subspace U : λ. ♢

Draft (2024-01-15) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Linear - Transformation by RS
No ratings yet
Linear - Transformation by RS
84 pages
Linear Transformation
No ratings yet
Linear Transformation
34 pages
Lec7 Maps Between Spaces (Change of Basis)
No ratings yet
Lec7 Maps Between Spaces (Change of Basis)
48 pages
Chapter 5
No ratings yet
Chapter 5
52 pages
Lecture 21
No ratings yet
Lecture 21
35 pages
Math W4
No ratings yet
Math W4
57 pages
Ma2101 Chapter 3
No ratings yet
Ma2101 Chapter 3
61 pages
(Updated File) - Linear Transformation and Inner Product
No ratings yet
(Updated File) - Linear Transformation and Inner Product
11 pages
Lecture 9 COM1033
No ratings yet
Lecture 9 COM1033
34 pages
Lect2 07web PDF
No ratings yet
Lect2 07web PDF
19 pages
Lec7 Maps Between Spaces (Change of Basis)
No ratings yet
Lec7 Maps Between Spaces (Change of Basis)
48 pages
Lec5 Maps - Between - Spaces
No ratings yet
Lec5 Maps - Between - Spaces
53 pages
Mit18 701f21 Lect9
No ratings yet
Mit18 701f21 Lect9
8 pages
Lecture 8
No ratings yet
Lecture 8
15 pages
Rank Nullity
No ratings yet
Rank Nullity
7 pages
Lecture - Notes - Algebra-Chapter 4
No ratings yet
Lecture - Notes - Algebra-Chapter 4
20 pages
Lecture 7
No ratings yet
Lecture 7
9 pages
Linear Algebra II
No ratings yet
Linear Algebra II
53 pages
LAG12
No ratings yet
LAG12
5 pages
FALLSEM2023-24 BMAT201L TH VL2023240102090 2023-10-25 Reference-Material-I
No ratings yet
FALLSEM2023-24 BMAT201L TH VL2023240102090 2023-10-25 Reference-Material-I
9 pages
Mit18 701f21 Lect8
No ratings yet
Mit18 701f21 Lect8
4 pages
Lecture10 D2 PDF
No ratings yet
Lecture10 D2 PDF
16 pages
Nott 81
No ratings yet
Nott 81
11 pages
Math 413/513 Chapter 2 (From Friedberg, Insel, & Spence) : 1 Linear Transformations
No ratings yet
Math 413/513 Chapter 2 (From Friedberg, Insel, & Spence) : 1 Linear Transformations
14 pages
Lecture - 5, 6 - Unit 5
No ratings yet
Lecture - 5, 6 - Unit 5
15 pages
Lecture - 4,5 - Unit1 (Linear Algebra)
No ratings yet
Lecture - 4,5 - Unit1 (Linear Algebra)
15 pages
MATH 304 Linear Algebra Matrix Transformations (Continued) - Matrix of A Linear Transformation
No ratings yet
MATH 304 Linear Algebra Matrix Transformations (Continued) - Matrix of A Linear Transformation
26 pages
L2 Kernel and Image 2023
No ratings yet
L2 Kernel and Image 2023
5 pages
Unit 2 - Unit Details
No ratings yet
Unit 2 - Unit Details
7 pages
ECE5590 Mathematical Fundamentals
No ratings yet
ECE5590 Mathematical Fundamentals
24 pages
Linear Algebra
No ratings yet
Linear Algebra
36 pages
Worksheet Chapter 4
No ratings yet
Worksheet Chapter 4
4 pages
Fridberg Linear
No ratings yet
Fridberg Linear
16 pages
Linear Transformations 2017 03
No ratings yet
Linear Transformations 2017 03
84 pages
Unit 2 Sessionwise Problems
No ratings yet
Unit 2 Sessionwise Problems
18 pages
Best Poetry
100% (1)
Best Poetry
12 pages
AeschylusSophoclesEuripides, ACarson PDF
100% (3)
AeschylusSophoclesEuripides, ACarson PDF
135 pages
MAT3701Section2 1
No ratings yet
MAT3701Section2 1
9 pages
Topic 4
No ratings yet
Topic 4
19 pages
l10 mth113 2025
No ratings yet
l10 mth113 2025
5 pages
Chapter 4. Linear Transformations: Lecture Notes For MA1111
No ratings yet
Chapter 4. Linear Transformations: Lecture Notes For MA1111
22 pages
x + βy) = αf (x) + βf (y) : converse
No ratings yet
x + βy) = αf (x) + βf (y) : converse
50 pages
111147
No ratings yet
111147
25 pages
MTH6140 Linear Algebra II: Notes 3 28th October 2010
No ratings yet
MTH6140 Linear Algebra II: Notes 3 28th October 2010
8 pages
Math 114: Linear Algebra Matrix Representations of Linear Transformations
No ratings yet
Math 114: Linear Algebra Matrix Representations of Linear Transformations
2 pages
Lecture 4
No ratings yet
Lecture 4
29 pages
Edinburgh Uni Algebra Exam Notes
No ratings yet
Edinburgh Uni Algebra Exam Notes
7 pages
5 Linear Transformations: 5.1 Basic Definitions and Examples
No ratings yet
5 Linear Transformations: 5.1 Basic Definitions and Examples
10 pages
Linear Transformations and Matrices
No ratings yet
Linear Transformations and Matrices
6 pages
Action Replay Codes
No ratings yet
Action Replay Codes
4 pages
L1-L3 Linear Transformation
No ratings yet
L1-L3 Linear Transformation
35 pages
Review: (V,, V) V V Nul (A) Ax 0
No ratings yet
Review: (V,, V) V V Nul (A) Ax 0
7 pages
Change of Basis: Massoud Malek
No ratings yet
Change of Basis: Massoud Malek
4 pages
1 Vector Spaces: 1 2 K 1 1 2 2 K K I I
No ratings yet
1 Vector Spaces: 1 2 K 1 1 2 2 K K I I
18 pages
MATH 304 Linear Algebra Matrix of A Linear Transformation
No ratings yet
MATH 304 Linear Algebra Matrix of A Linear Transformation
13 pages
Uni 1 gw5 - Test
100% (1)
Uni 1 gw5 - Test
3 pages
Literary 100 A Ranking of The Most Influential Novelists Playwrights and Poets of All Time Daniel S. Burt Download
100% (1)
Literary 100 A Ranking of The Most Influential Novelists Playwrights and Poets of All Time Daniel S. Burt Download
52 pages
5 Linear Transformations
No ratings yet
5 Linear Transformations
44 pages
QUANTUM NOTES: The Linear Algebra Brush Up You Just Needed
No ratings yet
QUANTUM NOTES: The Linear Algebra Brush Up You Just Needed
30 pages
Chapter4 Part 2 Edited 3
No ratings yet
Chapter4 Part 2 Edited 3
17 pages
10 Recipe Management
No ratings yet
10 Recipe Management
26 pages
Chapter 4 - Linear Transformations
No ratings yet
Chapter 4 - Linear Transformations
24 pages
Pindorama
No ratings yet
Pindorama
124 pages
NTSE Practice Paper - 07 Mental Ability Test
No ratings yet
NTSE Practice Paper - 07 Mental Ability Test
7 pages
Finite, Non-Finite 1
100% (1)
Finite, Non-Finite 1
3 pages
Reasoning and Mental Aptitude For Aptitude Tests - Workbook On Coding-Decoding Puzzles
No ratings yet
Reasoning and Mental Aptitude For Aptitude Tests - Workbook On Coding-Decoding Puzzles
207 pages
wfm01 - s20 - QP - f2 (Ial)
No ratings yet
wfm01 - s20 - QP - f2 (Ial)
28 pages
Sattriya
No ratings yet
Sattriya
73 pages
Konica Minolta QMS 3260-4032 Chap11to16 Service Manual
No ratings yet
Konica Minolta QMS 3260-4032 Chap11to16 Service Manual
218 pages
Trig Functionsof Special Angles
No ratings yet
Trig Functionsof Special Angles
23 pages
Unit 9: Fractions and Decimals Lesson 4: Compare The Decimals
No ratings yet
Unit 9: Fractions and Decimals Lesson 4: Compare The Decimals
5 pages
Readme PDF
100% (1)
Readme PDF
5 pages
OJM SectionTest 1A01 Sol e
No ratings yet
OJM SectionTest 1A01 Sol e
14 pages
Venture b2 Unit 6 Recupero
No ratings yet
Venture b2 Unit 6 Recupero
6 pages
Differential Equation
No ratings yet
Differential Equation
13 pages
Graphs and Gates Exam Edition
No ratings yet
Graphs and Gates Exam Edition
15 pages
Téléchargements Asterisk
No ratings yet
Téléchargements Asterisk
6 pages
Search Student Data
No ratings yet
Search Student Data
5 pages
2007 May P2 MS
No ratings yet
2007 May P2 MS
6 pages
Partial Rapture and Dispensational Punishment - Nathan Caze
No ratings yet
Partial Rapture and Dispensational Punishment - Nathan Caze
24 pages
Bibiani 2025 Time Table
No ratings yet
Bibiani 2025 Time Table
1 page
0 Atudosiei Elenaproiect de Lectie 1
No ratings yet
0 Atudosiei Elenaproiect de Lectie 1
4 pages
Khutbah IDUL ADHA
No ratings yet
Khutbah IDUL ADHA
11 pages
Latihan Bahasa Inggris 7
No ratings yet
Latihan Bahasa Inggris 7
4 pages
Lecture 2
No ratings yet
Lecture 2
2 pages
Activityn1nnFuturentensenWILL 5360242538114d3
No ratings yet
Activityn1nnFuturentensenWILL 5360242538114d3
3 pages
Hyperbolic Functions: with Configuration Theorems and Equivalent and Equidecomposable Figures
From Everand
Hyperbolic Functions: with Configuration Theorems and Equivalent and Equidecomposable Figures
V. G. Shervatov
No ratings yet
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.