SML Book Draft Latest (001 046)
SML Book Draft Latest (001 046)
Acknowledgements ix
Notation xi
1 Introduction 1
1.1 Machine Learning Exemplified . . . . . . . . . . . . . . . . . . . 2
1.2 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. v
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Contents
Bibliography 327
Index 335
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. vii
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Acknowledgements
Many people have helped us throughout the writing of this book. First of all, we
want to mention David Sumpter, who, in addition to giving feedback from using the
material for teaching, contributed the entire Chapter 12 on ethical aspects. We have
also received valuable feedback from many students and other teacher colleagues.
We are, of course, very grateful for each and every comment we have received; in
particular, we want to mention David Widmann, Adrian Wills, Johannes Hendricks,
Mattias Villani, Dmitrijs Kass, and Joel Oskarsson. We have also received useful
feedback on the technical content of the book, including the practical insights in
Chapter 11, from Agrin Hilmkil (at Peltarion), Salla Franzén and Alla Tarighati
(at SEB), Lawrence Murray (at Uber), James Hensman and Alexis Boukouvalas
(at Secondmind), Joel Kronander and Nand Dalal (at Nines), and Peter Lindskog
and Jacob Roll (at Arriver). We also received valuable comments from Arno
Solin on Chapter 8 and 9, and Joakim Lindblad on Chapter 6. Several people
helped us with the figures illustrating the examples in Chapter 1, namely Antônio
Ribeiro (Figure 1.1), Fredrik K. Gustafsson (Figure 1.4), and Theodoros Damoulas
(Figure 1.5). Thank you all for your help!
During the writing of this book, we enjoyed financial support from AI Competence
for Sweden, the Swedish Research Council (projects: 2016-04278, 2016-06079,
2017-03807, 2020-04122), the Swedish Foundation for Strategic Research (projects:
ICA16-0015, RIT12-0012), the Wallenberg AI, Autonomous Systems and Software
Program (WASP) funded by the Knut and Alice Wallenberg Foundation, ELLIIT,
and the Kjell och Märta Beijer Foundation.
Finally, we are thankful to Lauren Cowles at Cambridge University Press for
helpful advice and guidance through the publishing process and to Chris Cartwright
for careful and helpful copyediting.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. ix
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Notation
Symbol Meaning
General mathematics
𝑏 a scalar
b a vector
B a matrix
T
transpose
sign(𝑥) the sign operator; +1 if 𝑥 > 0, −1 if 𝑥 < 0
∇ del operator; ∇ 𝑓 is the gradient of 𝑓
kbk 2 Euclidean norm of b
kbk 1 taxicab norm of b
𝑝(𝑧) probability density (if 𝑧 is a continuous random variable)
or probability mass (if 𝑧 is a discrete random variable)
𝑝(𝑧|𝑥) the probability density (or mass) for 𝑧 conditioned on 𝑥
N (𝑧; 𝑚, 𝜎 2 ) the normal probability distribution for the random variable
𝑧 with mean 𝑚 and variance 𝜎 2
𝐿 loss function
𝐽 cost function
Supervised methods
𝜽 parameters to be learned from training data
𝑔(x) model of 𝑝(𝑦 | x) (most classification methods)
𝜆 regularisation parameter
𝜙 link function (generalised linear models)
ℎ activation function (neural networks)
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. xi
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Notation
Machine learning is about learning, reasoning, and acting based on data. This
is done by constructing computer programs that process the data, extract useful
information, make predictions regarding unknown properties, and suggest actions to
take or decisions to make. What turns data analysis into machine learning is that the
process is automated and that the computer program is learnt from data. This means
that generic computer programs are used, which are adapted to application-specific
circumstances by automatically adjusting the settings of the program based on
observed, so-called training data. It can therefore be said that machine learning
is a way of programming by example. The beauty of machine learning is that it is
quite arbitrary what the data represents, and we can design general methods that are
useful for a wide range of practical applications in different domains. We illustrate
this via a range of examples below.
The ‘generic computer program’ referred to above corresponds to a mathematical
model of the data. That is, when we develop and describe different machine
learning methods, we do this using the language of mathematics. The mathematical
model describes a relationship between the quantities involved, or variables, that
correspond to the observed data and the properties of interest (such as predictions,
actions, etc.) Hence, the model is a compact representation of the data that, in a
precise mathematical form, captures the key properties of the phenomenon we are
studying. Which model to make use of is typically guided by the machine learning
engineer’s insights generated when looking at the available data and the practitioner’s
general understanding of the problem. When implementing the method in practice,
this mathematical model is translated into code that can be executed on a computer.
However, to understand what the computer program actually does, it is important
also to understand the underlying mathematics.
As mentioned above, the model (or computer program) is learnt based on the
available training data. This is accomplished by using a learning algorithm which
is capable of automatically adjusting the settings, or parameters, of the model to
agree with the data. In summary, the three cornerstones of machine learning are:
In this introductory chapter, we will give a taste of the machine learning problem
by illustrating these cornerstones with a few examples. They come from different
application domains and have different properties, but nevertheless, they can all
be addressed using similar techniques from machine learning. We also give some
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 1
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
advice on how to proceed through the rest of the book and, at the end, provide
references to good books on machine learning for the interested reader who wants
to dig further into this topic.
The leading cause of death globally is conditions that affect the heart and blood
vessels, collectively referred to as cardiovascular diseases. Heart problems often
influence the electrical activity of the heart, which can be measured using electrodes
attached to the body. The electrical signals are reported in an electrocardiogram
(ECG). In Figure 1.1 we show examples of (parts of) the measured signals from
three different hearts. The measurements stem from a healthy heart (top), a heart
suffering from atrial fibrillation (middle), and a heart suffering from right bundle
branch block (bottom). Atrial fibrillation makes the heart beat without rhythm,
making it hard for the heart to pump blood in a normal way. Right bundle branch
block corresponds to a delay or blockage in the electrical pathways of the heart.
Fig.
1.1
To improve the diagnostic accuracy, as well as to save time for the cardiologists,
we can ask ourselves if this process can be automated to some extent. That is, can
we construct a computer program which reads in the ECG signals, analyses the data,
and returns a prediction regarding the normality or abnormality of the heart? Such
models, capable of accurately interpreting an ECG examination in an automated
fashion, will find applications globally, but the needs are most acute in low- and
middle-income countries. An important reason for this is that the population in these
countries often do not have easy and direct access to highly skilled cardiologists
capable of accurately carrying out ECG diagnoses. Furthermore, cardiovascular
diseases in these countries are linked to more than 75% of deaths.
The key challenge in building such a computer program is that it is far from
obvious which computations are needed to turn the raw ECG signal into a predication
about the heart condition. Even if an experienced cardiologist were to try to explain
to a software developer which patterns in the data to look for, translating the
cardiologist’s experience into a reliable computer program would be extremely
challenging.
To tackle this difficulty, the machine learning approach is to instead teach the
computer program through examples. Specifically, instead of asking the cardiologist
to specify a set of rules for how to classify an ECG signal as normal or abnormal,
we simply ask the cardiologist (or a group of cardiologists) to label a large number
of recorded ECG signals with labels corresponding to the the underlying heart
condition. This is a much easier (albeit possibly tedious) way for the cardiologists
to communicate their experience and encode it in a way that is interpretable by a
computer.
The task of the learning algorithm is then to automatically adapt the computer
program so that its predictions agree with the cardiologists’ labels on the labelled
training data. The hope is that, if it succeeds on the training data (where we already
know the answer), then it should be possible to use the predictions made the by
program on previously unseen data (where we do not know the answer) as well.
This is the approach taken by Ribeiro et al. (2020), who developed a machine
learning model for ECG prediction. In their study, the training data consists of
more than 2 300 000 ECG records from almost 1 700 000 different patients from the
state of Minas Gerais in Brazil. More specifically, each ECG corresponds to 12
time series (one from each of the 12 electrodes that were used in conducting the
exam) of a duration between 7 to 10 seconds each, sampled at frequencies ranging
from 300 Hz to 600 Hz. These ECGs can be used to provide a full evaluation of
the electrical activity of the heart, and it is indeed the most commonly used test
in evaluating the heart. Importantly, each ECG in the dataset also comes with a
label sorting it into different classes – no abnormalities, atrial fibrillation, right
bundle branch block, etc. – according to the status of the heart. Based on this data,
a machine learning model is trained to automatically classify a new ECG recording
without requiring a human doctor to be involved. The model used is a deep neural
network, more specifically a so-called residual network, which is commonly used
for images. The researchers adapted this to work for the ECG signals of relevance
for this study. In Chapter 6, we introduce deep learning models and their training
algorithms.
Evaluating how a model like this will perform in practice is not straightforward.
The approach taken in this study was to ask three different cardiologists with
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 3
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
Before we move on, let us pause and reflect on the example introduced above. In
fact, many concepts that are central to machine learning can be recognised in this
example.
As we mentioned above, the first cornerstone of machine learning is the data.
Taking a closer look at what the data actually is, we note that it comes in different
forms. First, we have the training data which is used to train the model. Each
training data point consists of both the ECG signal, which we refer to as the input,
and its label corresponding to the type of heart condition seen in this signal, which
we refer to as the output. To train the model, we need access to both the inputs
and the outputs, where the latter had to be manually assigned by domain experts
(or possibly some auxiliary examination). Training a model from lableled data
points is therefore referred to as supervised learning. We think of the learning
as being supervised by the domain expert, and the learning objective is to obtain
a computer program that can mimic the labelling done by the expert. Second,
we have the (unlabelled) ECG signals that will be fed to the program when it is
used ‘in production’. It is important to remember that the ultimate goal of the
model is to obtain accurate predictions in this second phase. We say that the
predictions made by the model must generalise beyond the training data. How to
train models that are capable of generalising, and how to evaluate to what extent
they do so, is a central theoretical question studied throughout this book (see in
particular Chapter 4).
We illustrate the training of the ECG prediction model in Figure 1.2. The general
structure of the training procedure is, however, the same (or at least very similar)
for all supervised machine learning problems.
Another key concept that we encountered in the ECG example is the notion of a
classification problem. Classification is a supervised machine learning task which
amounts to predicting a certain class, or label, for each data point. Specifically, for
classification problems, there are only a finite number of possible output values.
In the ECG example, the classes correspond to the type of heart condition. For
instance, the classes could be ‘normal’ or ‘abnormal’, in which case we refer to
it as a binary classification problem (only two possible classes). More generally,
we could design a model for classifying each signal as either ‘normal’, or assign it
to one of a predetermined set of abnormalities. We then face a (more ambitious)
multi-class classification problem.
Classification is, however, not the only application of supervised machine learning
that we will encounter. Specifically, we will also study another type of problem
referred to as regression problems. Regression differs from classification in that the
Training data
Labels e.g. healty,
art. fib., RBBB Unseen data
?
Learning
Model prediction algorithm
Figure 1.2: Illustrating the supervised machine learning process with training to the left and
then the use of the trained model to the right. Left: Values for the unknown parameters of
the model are set by the learning algorithm such that the model best describes the available
training data. Right: The learned model is used on new, previously unseen data, where we
hope to obtain a correct classification. It is thus essential that the model is able to generalise
to new data that is not present in the training data.
output (that is, the quantity that we want the model to predict) is a numerical value.
We illustrate with an example from material science.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 5
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
Comparing the two examples discussed above, we can make a few interesting
observations. As already pointed out, one difference is that the ECG model is asked
to predict a certain class (say, normal or abnormal), whereas the materials discovery
model is asked to predict a numerical value (the formation energy of a crystal).
These are the two main types of prediction problems that we will study in this book,
referred to as classification and regression, respectively. While conceptually similar,
we often use slight variations of the underpinning mathematical models, depending
on the problem type. It is therefore instructive to treat them separately.
Both types are supervised learning problems, though. That is, we train a predictive
model to mimic the predictions made by a ‘supervisor’. However, it is interesting to
note that the supervision is not necessarily done by a human domain expert. Indeed,
for the formation energy model, the training data was obtained by running automated
(but costly) density functional theory computations. In other situations, we might
obtain the output values naturally when collecting the training data. For instance,
assume that you want to build a model for predicting the outcome of a soccer match
based on data about the players in the two teams. This is a classification problem
(the output is ‘win’, ‘lose’, or ‘tie’), but the training data does not have to be manually
labelled, since we get the labels directly from historical matches. Similarly, if you
want to build a regression model for predicting the price of an apartment based
on its size, location, condition, etc., then the output (the price) is obtained directly
from historical sales.
Finally, it is worth noting that, although the examples discussed above correspond
to very different application domains, the problems are quite similar from a machine
learning perspective. Indeed, the general procedure outlined in Figure 1.2 is also
Soccer is a sport where a great deal of data has been collected on how individual
players act throughout a match, how teams collaborate, how they perform over time,
etc. All this data is used to better understand the game and to help players reach
their full potential.
Consider the problem of predicting whether or not a shot results in a goal. To this
end, we will use a rather simple model, where the prediction is based only on the
player’s position on the field when taking the shot. Specifically, the input is given by
the distance from the goal and the angle between two lines drawn from the player’s
position to the goal posts; see Figure 1.3. The output corresponds to whether or not
the shot results in a goal, meaning that this is a binary classification problem.
1
0.9
𝝋 0.8
0.7
Frequency of goals
0.6
𝝋 0.5
0.4
0.3
0.2
Fig. 0.1
1.3 𝝋
0.0
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 7
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
Clearly, knowing the player’s position is not enough to definitely say if the shot
will be successful. Still, it is reasonable to assume that it provides some information
about the chance of scoring a goal. Indeed, a shot close to the goal line with a large
angle is intuitively more likely to result in a goal than one made from a position
close to the sideline. To acknowledge this fact when constructing a machine learning
model, we will not ask the model to predict the outcome of the shot but rather to
predict the probability of a goal. This is accomplished by using a probabilistic
model which is trained by maximising the total probability of the observed training
data with respect to the probabilistic predictions. For instance, using a so-called
logistic regression model (see Chapter 3) we obtain a predicted probability of
scoring a goal from any position, illustrated using a heat map in the right panel in
Figure 1.3.
Fig.
1.4
The bottom part of Figure 1.4 shows the prediction generated by such an algorithm,
where the aim is to classify each pixel as either car (blue), traffic sign (yellow),
pavement (purple), or tree (green). The best performing solutions for this task today
rely on cleverly crafted deep neural networks (see Chapter 6).
In the final example, we raise the bar even higher, since here the model needs
to be able to explain dependencies not only over space, but also over time, in a
so-called spatio-temporal problem. These problems are finding more and more
applications as we get access to more and more data. More precisely, we look into
the problem of how to build probabilistic models capable of better estimating and
forecasting air pollution across time and space in a city, in this case London.
Roughly 91% of the world’s population lives in places where the air quality levels are
worse than those recommended by the world health organisation. Recent estimates
indicate that 4.2 million people die each year from stroke, heart disease, lung cancer,
and chronic respiratory diseases caused by ambient air pollution.
A natural first step in dealing with this problem is to develop technology to
measure and aggregate information about the air pollution levels across time and
space. Such information enables the development of machine learning models to
better estimate and accurately forecast air pollution, which in turn permits suitable
interventions. The work that we feature here sets out to do this for the city of London,
where more than 9 000 people die early every year as a result of air pollution.
Air quality sensors are now – as opposed to the situation in the recent past –
available at relatively low cost. This, combined with an increasing awareness of the
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 9
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
Fig.
1.5
Figure 1.5 illustrates the output from the Gaussian process model in terms of
spatio-temporal estimation and forecasting of NO2 levels in London. To the left,
we have the situation on February 19, 2019 at 11:00 using observations from both
ground sensors providing hourly readings of NO2 and from satellite data. To the
right, we have the situation on 19 February 2019 at 17:00 using only the satellite data.
The Gaussian process is a non-parametric and probabilistic model for nonlinear
functions. Non-parametric means that it does not rely on any particular parametric
functional form to be postulated. The fact that it is a probabilistic model means that
it is capable of representing and manipulating uncertainty in a systematic way.
The aim of this book is to convey the spirit of supervised machine learning,
without requiring any previous experience in the field. We focus on the underlying
mathematics as well as the practical aspects. This book is a textbook; it is not
a reference work or a programming manual. It therefore contains only a careful
10: Generative Models and 11: User Aspects of Machine 12: Ethics in Machine Learn-
Learning from Unlabelled Data Learning ing
Figure 1.6: The structure of this book, illustrated by blocks (chapters) and arrows (recom-
mended order in which to read the chapters). We do recommend everyone to read (or at
least skim) the fundamental material in Chapters 2, 3, and 4 first. The path through the
technically more advanced Chapters 5–9, can be chosen to match the particular interest of
the reader. For Chapters 11, 10, and 12, we recommend reading the fundamental chapters
first.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 11
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction
In this chapter, we will introduce the supervised machine learning problem as well as
two basic machine learning methods for solving it. The methods we will introduce
are called 𝑘-nearest neighbours and decision trees. These two methods are relatively
simple, and we will derive them on intuitive grounds. Still, these methods are useful
in their own right and are therefore a good place to start. Understanding their inner
workings, advantages, and shortcomings also lays a good foundation for the more
advanced methods that are to come in later chapters.
1 The input is commonly also called feature, attribute, predictor, regressor, covariate, explanatory
variable, controlled variable, and independent variable.
2 The output is commonly also called response, regressand, label, explained variable, predicted
variable, or dependent variable.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 13
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
The variables contained in our data (input as well as output) can be of two different
types: numerical or categorical. A numerical variable has a natural ordering.
We can say that one instance of a numerical variable is larger or smaller than
another instance of the same variable. A numerical variable could, for instance,
be represented by a continuous real number, but it could also be discrete, such
as an integer. Categorical variables, on the other hand, are always discrete, and
3 For image-based problems it is often more convenient to represent the input as a matrix of size
ℎ × 𝑤 than as a vector of length 𝑝 = ℎ𝑤, but the dimension is nevertheless the same. We will get
back to this in Chapter 6 when discussing the convolutional neural network, a model structure
tailored to image-type inputs.
importantly, they lack a natural ordering. In this book we assume that any categorical
variable can take only a finite number of different values. A few examples are given
in Table 2.1 above.
The distinction between numerical and categorical is sometimes somewhat
arbitrary. We could, for instance, argue that having no children is qualitatively
different from having children, and use the categorical variable ‘children: yes/no’
instead of the numerical ‘0, 1 or 2 children’. It is therefore a decision for the machine
learning engineer whether a certain variable is to be considered as numerical or
categorical.
The notion of categorical vs. numerical applies to both the output variable 𝑦 and
to the 𝑝 elements 𝑥 𝑗 of the input vector x = [𝑥 1 𝑥 2 · · · 𝑥 𝑝 ] T . All 𝑝 input variables
do not have to be of the same type. It is perfectly fine (and common in practice) to
have a mix of categorical and numerical inputs.
Regression means that the output is numerical, and classification means that the
output is categorical.
The reason for this distinction is that the regression and classification problems have
somewhat different properties, and different methods are used for solving them.
Note that the 𝑝 input variables x = [𝑥 1 𝑥 2 · · · 𝑥 𝑝 ] T can be either numerical or
categorical for both regression and classification problems. It is only the type of the
output that determines whether a problem is a regression or a classification problem.
A method for solving a classification problems is called a classifier.
For classification, the output is categorical and can therefore only take val-
ues in a finite set. We use 𝑀 to denote the number of elements in the set of
possible output values. It could, for instance, be {false, true} (𝑀 = 2) or
{Sweden, Norway, Finland, Denmark} (𝑀 = 4). We will refer to these elements
as classes or labels. The number of classes 𝑀 is assumed to be known in the
classification problem. To prepare for a concise mathematical notation, we use
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 15
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
Say that we want to build a ‘song categoriser’ app, where the user records a song,
and the app answers by reporting whether the song has the artistic style of either the
Beatles, Kiss, or Bob Dylan. At the heart of this fictitious app, there has to be a
mechanism that takes an audio recording as an input and returns an artist’s name.
If we first collect some recordings with songs from the three groups/artists
(where we know which artist is behind each song: a labelled dataset), we could use
supervised machine learning to learn the characteristics of their different styles and
therefrom predict the artist of the new user-provided song. In supervised machine
learning terminology, the artist name (the Beatles, Kiss, or Bob Dylan) is the
output 𝑦. In this problem, 𝑦 is categorical, and we are hence facing a classification
problem.
One of the important design choices for a machine learning engineer is a detailed
specification of what the input x really is. It would in principle be possible to consider
the raw audio information as input, but that would give a very high-dimensional x
which (unless an audio-specific machine learning method is used) would most likely
require an unrealistically large amount of training data in order to be successful (we
will discuss this aspect in detail in Chapter 4). A better option could therefore be to
define some summary statistics of audio recordings and use those so-called features
as input x instead. As input features, we could, for example, use the length of the
audio recording and the ‘perceived energy’ of the song. The length of a recording
is easy to measure. Since it can differ quite a lot between different songs, we take
the logarithm of the actual length (in seconds) to get values in the same range for all
songs. Such feature transformations are commonly used in practice to make the
input data more homogeneous.
The energy of a songa is a bit more tricky, and the exact definition may even
be ambiguous. However, we leave that to the audio experts and re-use a piece
of software that they have written for this purposeb without bothering too much
about its inner workings. As long as this piece of software returns a number for
any recording that is fed to it, and always returns the same number for the same
recording, we can use it as an input to a machine learning method.
In Figure 2.1 we have plotted a dataset with 230 songs from the three artists. Each
song is represented by a dot, where the horizontal axis is the logarithm of its length
(measured in seconds) and the vertical axis the energy (on a scale 0–1). When
we later return to this example and apply different supervised machine learning
methods to it, this data will be the training data.
0.6
0.4
0.2
a We use this term to refer to the perceived musical energy, not the signal energy in a strict
sense.
b Specifically, we use http://api.spotify.com/ here.
Ezekiel and Fox (1959) present a dataset with 62 observations of the distance needed
for various cars at different initial speeds to brake to a full stop.a The dataset has
the following two variables:
- Speed: The speed of the car when the brake signal is given.
- Distance: The distance traveled after the signal is given until the car has
reached a full stop.
150
Data
Distance (feet)
100
50
0
Fig. 0 10 20 30 40
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 17
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
problem. We then ask ourselves what the stopping distance would be if the initial
speed were, for example, 33 mph or 45 mph, respectively (two speeds at which
no data has been recorded). Another way to frame this question is to ask for the
prediction b
𝑦 (𝑥★) for 𝑥★ = 33 and 𝑥★ = 45.
a The data is somewhat dated, so the conclusions are perhaps not applicable to modern cars.
(i) To reason about and explore how input and output variables are connected.
An often-encountered task in sciences such as medicine and sociology is
to determine whether a correlation between a pair of variables exists or not
(‘does eating seafood increase life expectancy?’). Such questions can be
addressed by learning a mathematical model and carefully reasoning about
the likelihood that the learned relationships between input x and output 𝑦 are
due only to random effects in the data or if there appears to be some substance
to the proposed relationships.
(ii) To predict the output value 𝑦★ for some new, previously unseen input x★.
By using some mathematical method which generalises the input–output
examples seen in the training data, we can make a prediction b 𝑦 (x★) for a
previously unseen test input x★. The hat b indicates that the prediction is an
estimate of the output.
These two objectives are sometimes used to roughly distinguish between classical
statistics, focusing more on objective (i), and machine learning, where objective
(ii) is more central. However, this is not a clear-cut distinction since predictive
modelling is a topic in classical statistics too, and explainable models are also
studied in machine learning. The primary focus in this book, however, is on making
predictions, objective (ii) above, which is the foundation of supervised machine
learning. Our overall goal is to obtain as accurate predictions b 𝑦 (x★) as possible
(measured in some appropriate way) for a wide range of possible test inputs x★. We
say that we are interested in methods that generalise well beyond the training data.
A method that generalises well for the music example above would be able to
correctly tell the artist of a new song which was not in the training data (assuming
that the artist of the new song is one of the three that was present in the training
data, of course). The ability to generalise to new data is a key concept of machine
learning. It is not difficult to construct models or methods that give very accurate
predictions if they are only evaluated on the training data (we will see an example
in the next section). However, if the model is not able to generalise, meaning that
the predictions are poor when the model is applied to new test data points, then
the model is of little use in practice for making predictions. If this is the case, we
say that the model is overfitting to the training data. We will illustrate the issue
of overfitting for a specific machine learning model in the next section, and in
Chapter 4 we will return to this concept using a more general and mathematical
approach.
4 The
√︁ Euclidean distance between a test point x★ and a training data point x𝑖 is kx𝑖 − x★ k 2 =
(𝑥𝑖1 − 𝑥★1 ) 2 + (𝑥𝑖2 − 𝑥★2 ) 2 . Other distance functions can also be used and will be discussed in
Chapter 8. Categorical input variables can be handled, as we will discuss in Chapter 3.
5 Ties can be handled in different ways, for instance by a coin-flip, or by reporting the actual vote
count to the end user, who gets to decide what to do with it.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 19
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
Methods that explicitly use the training data when making predictions are referred
to as nonparametric, and the 𝑘-NN method is one example of this. This is in contrast
with parametric methods, where the prediction is given by some function (a model)
governed by a fixed number of parameters. For parametric methods, the training
data is used to learn the parameters in an initial training phase, but once the model
has been learned, the training data can be discarded since it is not used explicitly
when making predictions. We will introduce parametric modelling in Chapter 3.
and we are interested in predicting the output for x★ = [1 2] T . For this purpose we
will explore two different 𝑘-NN classifiers, one using 𝑘 = 1 and one using 𝑘 = 3.
First, we compute the Euclidean distance kx𝑖 − x★ k 2 between each training data
point x𝑖 (red and blue dots) and the test data point x★ (black dot), and then sort
them in ascending order.
Since the closest training data point to x★ is the data point 𝑖 = 6 (Red), this
means that for 𝑘-NN with 𝑘 = 1, we get the prediction b 𝑦 (x★) = Red. For 𝑘 = 3,
the three nearest neighbours are 𝑖 = 6 (Red), 𝑖 = 2 (Blue), and 𝑖 = 4 (Blue).
Taking a majority vote among these three training data points, Blue wins with 2
votes against 1, so our prediction becomes b 𝑦 (x★) = Blue. In Figure 2.3, 𝑘 = 1 is
4
𝑘=3
1
𝑖 kx𝑖 − x★ k 2 𝑦𝑖
𝑖=
√ 𝑘=1
6 √1
𝑖=
Red 2
𝑖=3
4
2 √2 Blue
𝑥2
𝑖=6
4 √4 Blue
𝑖=2
1 √5 Red
0
5 √8 Blue 𝑖=5
3 9 Red
Fig. −2 0 2
2.3 𝑥1
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 21
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
2 2
𝑥2
𝑥2
b
0 0
𝑦 = blue
b
𝑦 = blue
Fig. −2 0 2 −2 0 2
2.4 𝑥1 𝑥1
Choosing k
The number of neighbours 𝑘 that are considered when making a prediction with
𝑘-NN is an important choice the user has to make. Since 𝑘 is not learned by
𝑘-NN itself, but is design choice left to the user, we refer to it as a hyperparameter.
Throughout the book, we will use the term ‘hyperparameter’ for similar tuning
parameters for other methods.
The choice of the hyperparameter 𝑘 has a big impact on the predictions made by
𝑘-NN. To understand the impact of 𝑘, we study how the decision boundary changes
as 𝑘 changes in Figure 2.5, where 𝑘-NN is applied to the music classification
Example 2.1 and the car stopping distance Example 2.2, both with 𝑘 = 1 and 𝑘 = 20.
With 𝑘 = 1, all training data points will, by construction, be correctly predicted,
and the model is adapted to the exact x and 𝑦 values of the training data. In the
classification problem there are, for instance, small green (Bob Dylan) regions
within the red (the Beatles) area that are most likely misleading when it comes to
accurately predicting the artist of a new song. In order to make good predictions, it
would probably be better to instead predict red (the Beatles) for a new song in the
entire middle-left region since the vast majority of training data points in that area
are red. For the regression problem, 𝑘 = 1 gives quite shaky behaviour, and also for
this problem, it is intuitively clear that this does not describe an actual effect, but
rather that the prediction is adapting to the noise in the data.
The drawbacks of using 𝑘 = 1 are not specific to these two examples. In most
real world problems there is a certain amount of randomness in the data, or at
least insufficient information, which can be thought of as a random effect. In the
music example, the 𝑛 = 230 songs were selected from all songs ever recorded
from these artists, and since we do not know how this selection was made, we may
consider it random. Furthermore, and more importantly, if we want our classifier to
generalise to completely new data, like new releases from the artists in our example
(overlooking the obvious complication for now), then it is not reasonable to assume
that the length and energy of a song will give a complete picture of the artistic
styles. Hence, even with the best possible model, there is some ambiguity about
𝑘=1 𝑘 = 20
1 Beatles 1 Beatles
0.5 0.5
0 0
5 6 7 5 6 7
Distance (feet)
Data Data
100 100
50 50
0 0
20 40 20 40
Figure 2.5: 𝑘-NN applied to the music classification Example 2.1 (a and b) and the car
stopping distance Example 2.2 (c and d). For both problems 𝑘-NN is applied with 𝑘 = 1
and 𝑘 = 20.
which artist has recorded a song if we only look at these two input variables. This
ambiguity is modelled as random noise. Also for the car stopping distance, there
appear to be a certain amount of random effects, not only in 𝑥 but also in 𝑦. By
using 𝑘 = 1 and thereby adapting very closely to the training data, the predictions
will depend not only on the interesting patterns in the problem but also on the (more
or less) random effects that have shaped the training data. Typically we are not
interested in capturing these effects, and we refer to this as overfitting.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 23
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
With the 𝑘-NN classifier, we can mitigate overfitting by increasing the region
of the neighbourhood used to compute the prediction, that is, increasing the
hyperparameter 𝑘. With, for example, 𝑘 = 20, the predictions are no longer based
only on the closest neighbour but are instead a majority vote among the 20 closest
neighbours. As a consequence, all training data points are no longer perfectly
classified, but some of the songs end up in the wrong region in Figure 2.5b. The
predictions are, however, less adapted to the peculiarities of the training data and
thereby less overfitted, and Figure 2.5b and d are indeed less ‘noisy’ than Figure 2.5a
and c. However, if we make 𝑘 too large, then the averaging effect will wash out
all interesting patterns in the data as well. Indeed, for sufficiently large 𝑘 the
neihbourhood will include all training data points, and the model will reduce to
predicting the mean of the data for any input.
Selecting 𝑘 is thus a trade-off between flexibility and rigidity. Since selecting
𝑘 either too big or too small will lead to a meaningless classifiers, there must
exist a sweet spot for some moderate 𝑘 (possibly 20, but it could be less or more)
where the classifier generalises the best. Unfortunately, there is no general answer
to the 𝑘 for which this happens, and this is different for different problems. In
the music classification problem, it seems reasonable that 𝑘 = 20 will predict new
test data points better than 𝑘 = 1, but there might very well be an even better
choice of 𝑘. For the car stopping problem, the behaviour is also more reasonable
for 𝑘 = 20 than 𝑘 = 1, except for the boundary effect for large 𝑥, where 𝑘-NN is
unable to capture the trend in the data as 𝑥 increases (simply because the 20 nearest
neighbours are the same for all test points 𝑥★ around and above 35). A systematic
way of choosing a good value for 𝑘 is to use cross-validation, which we will discuss
in Chapter 4.
Time to reflect 2.1 The prediction b 𝑦 (x★) obtained using the 𝑘-NN method is
a piecewise constant function of the input x★. For a classification problem,
this is natural, since the output is categorical (see, for example, Figure 2.5
where the coloured regions correspond to areas of the input space where
the prediction is constant according to the colour of that region). However,
𝑘-NN will also have piecewise constant predictions for regression problems.
Why?
Input Normalisation
A final important practical aspect when using 𝑘-NN is the importance of normal-
isation of the input data. Imagine a training dataset with 𝑝 = 2 input variables
x = [𝑥 1 𝑥 2 ] T where all values of 𝑥 1 are in the range [100, 1100] and the values
for 𝑥 2 are in the much smaller range [0, 1]. It could, for example, be that 𝑥 1 and
𝑥 2 are measured in different units. The Euclidean √︁ distance between a test point
x★ and a training data point x𝑖 is kx𝑖 − x★ k 2 = (𝑥 𝑖1 − 𝑥★1 ) 2 + (𝑥 𝑖2 − 𝑥★2 ) 2 . This
expression will typically be dominated by the first term (𝑥 𝑖1 − 𝑥★1 ) 2 , whereas the
second term (𝑥 𝑖2 − 𝑥★2 ) 2 tends to have a much smaller effect, simply due to the
different magnitude of 𝑥 1 and 𝑥 2 . That is, the different ranges lead to 𝑥 1 being
considered much more important than 𝑥 2 by 𝑘-NN.
To avoid this undesired effect, we can re-scale the input variables. One option, in
the mentioned example, could be to subtract 100 from 𝑥 1 and thereafter divide it by
new = 𝑥𝑖1 −100 such that 𝑥 new and 𝑥 both are in the range [0, 1].
1 000 and create 𝑥 𝑖1 1 000 1 2
More generally, this normalisation procedure for the input data can be written as
𝑥 𝑖 𝑗 − minℓ (𝑥ℓ 𝑗 )
𝑥 𝑖new
𝑗 = , for all 𝑗 = 1, . . . , 𝑝, 𝑖 = 1, . . . , 𝑛. (2.1)
maxℓ (𝑥ℓ 𝑗 ) − minℓ (𝑥ℓ 𝑗 )
where 𝑥¯ 𝑗 and 𝜎 𝑗 are the mean and standard deviation for each input variable,
respectively.
It is crucial for 𝑘-NN to apply some type of input normalisation (as was indeed
done in Figure 2.5), but it is a good practice to apply this also when using other
methods, for numerical stability if nothing else. It is, however, important to compute
the scaling factors (minℓ (𝑥ℓ 𝑗 ), 𝑥¯ 𝑗 , etc.) using training data only and to also apply
that scaling to future test data points. Failing to do this, for example by performing
normalisation before setting test data aside (which we will discuss more in Chapter 4),
might lead to wrong conclusions on how well the method will perform in predicting
future (not yet seen) data points.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 25
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
𝑥1 < 5.0
𝑅2 𝑅3
𝑥2
𝑅1
b
𝑦 1 = Blue
3.0
𝑅2 𝑅3 𝑅1
Fig. b
𝑦 2 = Blue b
𝑦 3 = Red
2.6 Fig. 𝑥1
2.7
A classification tree. At each internal A region partition, where each region cor-
node, a rule of the form 𝑥 𝑗 < 𝑠 𝑘 indicates responds to a leaf node in the tree. Each
the left branch coming from that split, border between regions corresponds to a
and the right branch then consequently split in the tree. Each region is coloured
corresponds to 𝑥 𝑗 ≥ 𝑠 𝑘 . This tree has with the prediction corresponding to that
two internal nodes (including the root) region, and the boundary between red
and three leaf nodes. and blue is therefore the decision bound-
ary.
The decision tree partitions the input space into axis-aligned ‘boxes’, as shown in
Figure 2.7. By increasing the depth of the tree (the number of steps from the root to
the leaves), the partitioning can be made finer and finer and thereby describes more
complicated functions of the input variable.
Pseudo-code for predicting a test input with the tree in Figure 2.6 would look like:
return Blue
else
return Red
end
end
To set the terminology, the endpoint of each branch 𝑅1 , 𝑅2 , and 𝑅3 in Example 2.5
are called leaf nodes, and the internal splits, 𝑥2 < 3.0 and 𝑥 1 < 5.0, are known
as internal nodes. The lines that connect the nodes are referred to as branches.
The tree is referred to as binary since each internal node splits into exactly two
branches.
With more than two input variables, it is difficult to illustrate the partitioning
of the input space into regions (Figure 2.7), but the tree representation can still be
used in the very same way. Each internal node corresponds to a rule where one of
the 𝑝 input variables 𝑥 𝑗 , 𝑗 = 1, . . . , 𝑝, is compared to a threshold 𝑠. If 𝑥 𝑗 < 𝑠, we
continue along the left branch, and if 𝑥 𝑗 ≥ 𝑠, we continue along the right branch.
The constant predictions that we associate with the leaf nodes can be either
categorical (as in Example 2.5 above) or numerical. Decision trees can thus be used
to address both classification and regression problems.
Example 2.5 illustrated how a decision tree can be used to make a prediction. We
will now turn to the question of how a tree can be learned from training data.
We will start by discussing how to learn (or, equivalently, train) a decision tree for a
regression problem. The classification problem is conceptually similar and will be
explained later.
As mentioned above, the prediction b 𝑦 (x★) from a regression tree is a piecewise
constant function of the input x★. We can write this mathematically as,
∑︁
𝐿
b
𝑦 (x★) = b
𝑦 ℓ I{x★ ∈ 𝑅ℓ }, (2.3)
ℓ=1
where 𝐿 is the total number of regions (leaf nodes) in the tree, 𝑅ℓ is the ℓth region,
and b𝑦 ℓ is the constant prediction for the ℓth region. Note that in the regression
setting, b𝑦 ℓ is a numerical variable, and we will consider it to be a real number for
simplicity. In the equation above, we have used the indicator function, I{x ∈ 𝑅ℓ } = 1
if x ∈ 𝑅ℓ and I{x ∈ 𝑅ℓ } = 0 otherwise.
Learning the tree from data corresponds to finding suitable values for the
parameters defining the function (2.3), namely the regions 𝑅ℓ and the constant
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 27
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
can compute the constants {b 𝑦 ℓ }ℓ=1 in a natural way, simply as the average of the
𝐿
b
𝑦 ℓ = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }.
It remains to find the shape of the tree, the regions 𝑅ℓ , which requires a bit more
work. The basic idea is, of course, to select the regions so that the tree fits the
training data. This means that the output predictions from the tree should match the
output values in the training data. Unfortunately, even when restricting ourselves
to seemingly simple regions such as the ‘boxes’ obtained from a decision tree,
finding the tree (a collection of splitting rules) that optimally partitions the input
space to fit the training data as well as possible turns out to be computationally
infeasible. The problem is that there is a combinatorial explosion in the number
of ways in which we can partition the input space. Searching through all possible
binary trees is not possible in practice unless the tree size is so small that it is not of
practical use.
To handle this situation, we use a heuristic algorithm known as recursive binary
splitting for learning the tree. The word recursive means that we will determine
the splitting rules one after the other, starting with the first split at the root and
then building the tree from top to bottom. The algorithm is greedy, in the sense
that the tree is constructed one split at a time, without having the complete tree ‘in
mind’. That is, when determining the splitting rule at the root node, the objective is
to obtain a model that explains the training data as well as possible after a single
split, without taking into consideration that additional splits may be added before
arriving at the final model. When we have decided on the first split of the input
space (corresponding to the root node of the tree), this split is kept fixed, and we
continue in a similar way for the two resulting half-spaces (corresponding to the
two branches of the tree), etc.
To see in detail how one step of this algorithm works, consider the situation when
we are about to do our very first split at the root of the tree. Hence, we want to
select one of the 𝑝 input variables 𝑥 1 , . . . , 𝑥 𝑝 and a corresponding cutpoint 𝑠 which
divide the input space into two half-spaces,
Note that the regions depend on the index 𝑗 of the splitting variable as well as the
value of the cutpoint 𝑠, which is why we write them as functions of 𝑗 and 𝑠. This is
the case also for the predictions associated with the two regions,
b
𝑦 1 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅1 ( 𝑗, 𝑠)} and b
𝑦 2 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅2 ( 𝑗, 𝑠)},
since the averages in these expressions range over different data points depending
on the regions.
For each training data point (x𝑖 , 𝑦 𝑖 ),we can compute a prediction error by first
determining which region the data point falls in and then computing the difference
between 𝑦 𝑖 and the constant prediction associated with that region. Doing this for
all training data points, the sum of squared errors can be written as
∑︁ ∑︁
(𝑦 𝑖 − b
𝑦 1 ( 𝑗, 𝑠)) 2 + (𝑦 𝑖 − b
𝑦 2 ( 𝑗, 𝑠)) 2 . (2.5)
𝑖:x𝑖 ∈𝑅1 ( 𝑗,𝑠) 𝑖:x𝑖 ∈𝑅2 ( 𝑗,𝑠)
The square is added to ensure that the expression above is non-negative and that both
positive and negative errors are counted equally. The squared error is a common
loss function used for measuring the closeness of a prediction to the training data,
but other loss functions can also be used. We will discuss the choice of loss function
in more detail in later chapters.
To find the optimal split, we select the values for 𝑗 and 𝑠 that minimise the squared
error (2.5). This minimisation problem can be solved easily by looping through
all possible values for 𝑗 = 1, . . . , 𝑝. For each 𝑗, we can scan through the finite
number of possible splits and pick the pair ( 𝑗, 𝑠) for which the expression above
is minimised. As pointed out above, when we have found the optimal split at the
root node, this splitting rule is fixed. We then continue in the same way for the left
and right branches independently. Each branch (corresponding to a half-space) is
split again by minimising the squared prediction error over all training data points
following that branch.
In principle, we can continue in this way until there is only a single training data
point in each of the regions – that is, until 𝐿 = 𝑛. Such a fully grown tree will result
in predictions that exactly match the training data points, and the resulting model is
quite similar to 𝑘-NN with 𝑘 = 1. As pointed out above, this will typically result in
too erratic a model that has overfitted to (possibly noisy) training data. To mitigate
this issue, it is common to stop the growth of the tree at an earlier stage using some
stopping criterion, for instance by deciding on 𝐿 beforehand, limiting the maximum
depth (number of splits in any branch), or adding a constraint on the minimum
number of training data points associated with each leaf node. Forcing the model to
have more training data points in each leaf will result in an averaging effect, similar
to increasing the value of 𝑘 in the 𝑘-NN method. Using such a stopping criterion
means that the value of 𝐿 is not set manually but is determined adaptively based on
the result of the learning procedure.
A high-level summary of the method is given in Method 2.2. Note that the
learning in Method 2.2 includes a recursive call, where in each recursion we grow
one branch of the tree one step further.
Classification Trees
Trees can also be used for classification. We use the same procedure of recursive
binary splitting but with two main differences. Firstly, we use a majority vote instead
of an average to compute the prediction associated with each region:
b
𝑦 ℓ = MajorityVote{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 29
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
4 Function Split(𝑅,T ):
5 if stopping criterion fulfilled then
6 return 𝑅
7 else
8 Go through all possible splits 𝑥 𝑗 < 𝑠 for all input variables
𝑗 = 1, . . . , 𝑝.
9 Pick the pair ( 𝑗, 𝑠) that minimises (2.5)/(2.6) for
regression/classification problems.
10 Split region 𝑅 into 𝑅1 and 𝑅2 according to (2.4).
11 Split data T into T1 and T2 accordingly.
12 return Split(𝑅1 ,T1 ), Split(𝑅2 ,T2 )
13 end
14 end
Secondly, when learning the tree, we need a different splitting criterion than the
squared prediction error to take into account the fact that the output is categorical.
To define these criteria, note first that the split at any internal node is computed by
solving an optimisation problem of the form
where 𝑛1 and 𝑛2 denote the number of training data points in the left and right nodes
of the current split, respectively, and 𝑄 1 and 𝑄 2 are the costs (derived form the
prediction errors) associated with these two nodes. The variables 𝑗 and 𝑠 denote the
index of the splitting variable and the cutpoint as before. All of the terms 𝑛1 , 𝑛2 ,
𝑄 1 , and 𝑄 2 depend on these variables, but we have dropped the explicit dependence
from the notation for brevity. Comparing (2.6) with (2.5), we see that we recover
the regression case if 𝑄 ℓ corresponds to the mean-squared error in node ℓ.
To generalise this to the classification case, we still solve the optimisation problem
(2.6) to compute the split, but choose 𝑄 ℓ in a different way which respects the
categorical nature of a classification problem. To this end, we first introduce
1 ∑︁
b
𝜋ℓ𝑚 = I{𝑦 𝑖 = 𝑚}
𝑛ℓ 𝑖:x ∈𝑅
𝑖 ℓ
to be the proportion of training observations in the ℓth region that belong to the
𝑚th class. We can then define the splitting criterion, 𝑄 ℓ , based on these class
proportions. One simple alternative is the misclassification rate
𝑄 ℓ = 1 − max b
𝜋ℓ𝑚 , (2.7a)
𝑚
which is simply the proportion of data points in region 𝑅ℓ which do not belong to
the most common class. Other common splitting criteria are the Gini index
∑︁
𝑀
𝑄ℓ = b
𝜋ℓ𝑚 (1 − b
𝜋ℓ𝑚 ) (2.7b)
𝑚=1
We consider the same setup as in Example 2.5, but now with the following dataset:
We want to learn a classification tree by using the entropy criterion in (2.7c) and
growing the tree until there are no regions with more than five data points left.
First split: There are infinitely many possible splits we can make, but all splits
which give the same partition of the data points are equivalent. Hence, in practice
we only have nine different splits to consider in this dataset. The data (dots) and
these possible splits (dashed lines) are visualised in Figure 2.8.
We consider all nine splits in turn. We start with the split at 𝑥1 = 2.5, which splits
the input space into two regions, 𝑅1 = 𝑥1 < 2.5 and 𝑅2 = 𝑥1 ≥ 2.5. In region 𝑅1 we
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 31
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
10
𝑥1 𝑥2 𝑦
9.0 2.0 Blue 8
4.0 1.0
𝑥2
Blue
1.0 2.0 Blue 4
have two blue data points and one red, in total 𝑛1 = 3 data points. The proportion
of the two classes in region 𝑅1 will therefore be b𝜋1B = 2/3 and b 𝜋1R = 1/3. The
entropy is calculated as
2 2 1 1
𝑄 1 = −b 𝜋1B ) − b
𝜋1B ln(b 𝜋1R ln(b
𝜋1R ) = − ln − ln = 0.64.
3 3 3 3
In region 𝑅2 we have 𝑛2 = 7 data points with the proportions b 𝜋2B = 3/7 and
b
𝜋2R = 4/7. The entropy for this region will be
3 3 4 4
𝑄 2 = −b 𝜋2B ) − b
𝜋2B ln(b 𝜋2R ) = − ln
𝜋2R ln(b − ln = 0.68,
7 7 7 7
and inserted in (2.6), the total weighted entropy for this split becomes
𝑛1 𝑄 1 + 𝑛2 𝑄 2 = 3 · 0.64 + 7 · 0.68 = 6.69.
We compute the costs for all other splits in the same manner and summarise them
in the table below:
Split (𝑅1 ) 𝑛1 b
𝜋1B b
𝜋1R 𝑄1 𝑛2 b
𝜋2B b
𝜋2R 𝑄2 𝑛1 𝑄 1 + 𝑛2 𝑄 2
𝑥1 < 2.5 3 2/3 1/3 0.64 7 3/7 4/7 0.68 6.69
𝑥1 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.50 5.00
𝑥1 < 6.5 6 4/6 2/6 0.64 4 1/4 3/4 0.56 6.07
𝑥1 < 8.0 7 4/7 3/7 0.68 3 1/3 2/3 0.64 6.69
𝑥2 < 1.5 1 1/1 0/1 0.00 9 4/9 5/9 0.69 6.18
𝑥2 < 3.0 3 3/3 0/3 0.00 7 2/7 5/7 0.60 4.18
𝑥2 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.06 5.00
𝑥2 < 7.0 7 5/7 2/7 0.60 3 0/3 3/3 0.00 4.18
𝑥2 < 8.5 9 5/9 4/9 0.69 1 0/1 1/1 0.00 6.18
From the table, we can read that the two splits at 𝑥 2 < 3.0 and 𝑥 2 < 7.0 are both
equally good. We choose to continue with 𝑥2 < 3.0.
8 8
6 6 𝑅2 𝑅3
𝑥2
𝑥2
4 4
2 2
𝑅1 𝑅1
Fig. 0
0 2 4 6 8 10
0
0 2 4 6 8 10
2.9 𝑥1 𝑥1
Second split: We note that only the upper region has more than five data points.
Also, there is no point splitting region 𝑅1 further since it only contains data points
from the same class. In the next step, we therefore split the upper region into two
new regions, 𝑅2 and 𝑅3 . All possible splits are displayed in Figure 2.9 to the left
(dashed lines), and we compute their costs in the same manner as before:
Splits (𝑅1 ) 𝑛2 b
𝜋2B b
𝜋2R 𝑄2 𝑛3 b
𝜋3B b
𝜋3R 𝑄3 𝑛2 𝑄 2 + 𝑛3 𝑄 3
𝑥 1 < 2.5 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.89
𝑥 1 < 5.0 3 2/3 1/3 0.63 4 0/4 4/4 0.00 1.91
𝑥 1 < 6.5 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥 1 < 8.0 5 2/5 3/5 0.67 2 0/2 2/2 0.00 3.37
𝑥 2 < 5.0 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.88
𝑥 2 < 7.0 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥 2 < 8.5 6 2/6 4/6 0.64 1 0/1 1/1 0.00 3.82
The best split is the one at 𝑥1 < 5.0, visualised above to the right. None of the three
regions has more than five data points. Therefore, we terminate the training. The
final tree and its partitions were displayed in Example 2.5. If we want to use the
tree for prediction, we predict blue if x★ ∈ 𝑅1 or x★ ∈ 𝑅2 since the blue training
data points are in the majority in each of these two regions. Similarly, we predict
red if x★ ∈ 𝑅3 .
When choosing between the different splitting criteria mentioned above, the
misclassification rate sounds like a reasonable choice since that is typically the
criterion we want the final model to do well on.6 However, one drawback is that it
does not favour pure nodes. By pure nodes we mean nodes where most of the data
points belong to a certain class. It is usually an advantage to favour pure nodes in
the greedy procedure that we use to grow the tree, since this can lead to fewer splits
6 Thisis not always true, for example for imbalanced and asymmetric classification problems; see
Section 4.5.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 33
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
0.5
Misclassification rate
Gini index
0.25
Entropy
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
𝑟
Figure 2.10: Three splitting criteria for classification trees as a function of the proportion
of the first class 𝑟 = 𝜋ℓ1 in a certain region 𝑅ℓ as given in (2.8). The entropy criterion has
been scaled such that it passes through (0.5,0.5).
in total. Both the entropy criterion and the Gini index favour node purity more than
the misclassification rate does.
This advantage can also be illustrated in Example 2.6. Consider the first split in
this example. If we were to use the misclassification rate as the splitting criterion,
both the split 𝑥 2 < 5.0 and the split 𝑥 2 < 3.0 would provide a total misclassification
rate of 0.2. However, the split at 𝑥 2 < 3.0, which the entropy criterion favoured,
provides a pure node 𝑅1 . If we now went with the split 𝑥 2 < 5.0, the misclassification
after the second split would still be 0.2. If we continued to grow the tree until no
data points were misclassified, we would need three splits if we used the entropy
criterion, whereas we would need five splits if we used the misclassification criterion
and started with the split at 𝑥 2 < 5.0.
To generalise this discussion, consider a problem with two classes, where we
denote the proportion of the first class as 𝜋ℓ1 = 𝑟 and hence the proportion of the
second class as 𝜋ℓ2 = 1 − 𝑟. The three criteria (2.7) can then be expressed in terms
of 𝑟 as
These functions are shown in Figure 2.10. All three citeria are similar in the
sense that they provide zero loss if all data points belong to either of the two classes
and maximum loss if the data points are equally divided between the two classes.
However, the Gini index and entropy have a higher loss for all other proportions.
In other words, the gain of having a pure node (𝑟 close to 0 or 1) is higher for the
Gini index and the entropy than for the misclassification rate. As a consequence,
the Gini index and the entropy both tend to favour making one of the two nodes
pure (or close to pure) since that provides a smaller total loss, which can make a
good combination with the greedy nature of the recursive binary splitting.
1 Beatles 1 Beatles
Kiss Kiss
Bob Dylan Bob Dylan
0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7
(a) Decision boundaries for the music classifica- (b) The same problem and data as in 2.11a for
tion problem for a fully grown classification tree which a tree restricted to depth 4 has been learned,
with the Gini index. This model overfits the data. again using the Gini index. This models will
hopefully make better predictions for new data.
Distance (feet)
100 100
50 50
0 0
10 20 30 40 10 20 30 40
(c) The prediction for a fully grown regression (d) The same problem and data as in 2.11c for
tree. As for the classification problem above, this which a tree restricted to depth 3 has been learned.
model overfits to the training data.
Figure 2.11: Decision trees applied to the music classification Example 2.1 (a and b) and
the car stopping distance Example 2.2 (c and d).
The depth of a decision tree (the maximum distance between the root node and any
leaf node) has a big impact on the final predictions. The tree depth impacts the
predictions in a somewhat similar way to the hyperparameter 𝑘 in 𝑘-NN. We again
use the music classification and car stopping distance problems from Examples 2.1
and 2.2 to study how the decision boundaries change depending on the depth of the
trees. In Figure 2.11, the decision boundaries are illustrated for two different trees.
In Figure 2.11a and c, we have not restricted the depth of the tree and have grown it
until each region contains only data points with the same output value – a so-called
fully grown tree. In Figure 2.11b and d, the maximum depth is restricted to 4 and 3,
respectively.
This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 35
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach
Similarly to choosing 𝑘 = 1 in 𝑘-NN, for a fully grown tree, all training data
points will, by construction, be correctly predicted since each region only contains
data points with the same output. As a result, for the music classification problem,
we get thin and small regions adapted to single training data points, and for the car
stopping distance problem, we get a very irregular line passing exactly through the
observations. Even though these trees give excellent performance on the training
data, they are not likely to be the best models for new, as yet unseen data. As we
discussed previously in the context of 𝑘-NN, we refer to this as overfitting.
In decision trees, we can mitigate overfitting by using shallower trees. Conse-
quently, we get fewer and larger regions with an increased averaging effect, resulting
in decision boundaries that are less adapted to the noise in the training data. This is
illustrated in Figure 2.11b and d for the two example problems. As for 𝑘 in 𝑘-NN,
the optimal size of the tree depends on many properties of the problem, and it is
a trade-off between flexibility and rigidity. Similar trade-offs have to be made for
almost all methods presented in this book, and they will be discussed systematically
in Chapter 4.
How can the user control the growth of the tree? Here we have different
strategies. The most straightforward strategy is to adjust the stopping criterion, that
is, the condition that should be fulfilled for not proceeding with further splits in a
certain node. As mentioned earlier, this criterion could be that we do not attempt
further splits if there are less than a certain number of training data points in the
corresponding region, or, as in Figure 2.11, we can stop splitting when we reach a
certain depth. Another strategy to control the depth is to use pruning. In pruning,
we start with a fully grown tree, and then in a second post-processing step, prune it
back to a smaller one. We will, however, not discuss pruning further here.