0% found this document useful (0 votes)
24 views46 pages

SML Book Draft Latest (001 046)

Uploaded by

Cảnh Đỗ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views46 pages

SML Book Draft Latest (001 046)

Uploaded by

Cảnh Đỗ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

MACHINE LEARNING

A First Course for Engineers and Scientists

Andreas Lindholm, Niklas Wahlström,


Fredrik Lindsten, Thomas B. Schön

This version: July 8, 2022

This material is published by Cambridge University Press.


Printed copies can be ordered via
http://www.cambridge.org.
This pre-publication version is free to view and download for
personal use only. Not for re-distribution, re-sale or use in
derivative works. © The authors, 2022.
Contents

Acknowledgements ix

Notation xi

1 Introduction 1
1.1 Machine Learning Exemplified . . . . . . . . . . . . . . . . . . . 2
1.2 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Supervised Learning: A First Approach 13


2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . 13
2.2 A Distance-Based Method: k-NN . . . . . . . . . . . . . . . . . 19
2.3 A Rule-Based Method: Decision Trees . . . . . . . . . . . . . . . 25
2.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Basic Parametric Models and a Statistical Perspective on Learning 37


3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Classification and Logistic Regression . . . . . . . . . . . . . . . 45
3.3 Polynomial Regression and Regularisation . . . . . . . . . . . . . 54
3.4 Generalised Linear Models . . . . . . . . . . . . . . . . . . . . . 57
3.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.A Derivation of the Normal Equations . . . . . . . . . . . . . . . . 60

4 Understanding, Evaluating, and Improving Performance 63


4.1 Expected New Data Error 𝐸 new : Performance in Production . . . . 63
4.2 Estimating 𝐸 new . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 The Training Error–Generalisation Gap Decomposition of 𝐸 new . . 71
4.4 The Bias–Variance Decomposition of 𝐸 new . . . . . . . . . . . . 79
4.5 Additional Tools for Evaluating Binary Classifiers . . . . . . . . . 86
4.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Learning Parametric Models 91


5.1 Principles of Parametric Modelling . . . . . . . . . . . . . . . . . 91
5.2 Loss Functions and Likelihood-Based Models . . . . . . . . . . . 96
5.3 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Parameter Optimisation . . . . . . . . . . . . . . . . . . . . . . . 112
5.5 Optimisation with Large Datasets . . . . . . . . . . . . . . . . . . 124

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. v
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Contents

5.6 Hyperparameter Optimisation . . . . . . . . . . . . . . . . . . . 129


5.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 Neural Networks and Deep Learning 133


6.1 The Neural Network Model . . . . . . . . . . . . . . . . . . . . . 133
6.2 Training a Neural Network . . . . . . . . . . . . . . . . . . . . . 140
6.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . 147
6.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.A Derivation of the Backpropagation Equations . . . . . . . . . . . 160

7 Ensemble Methods: Bagging and Boosting 163


7.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.3 Boosting and AdaBoost . . . . . . . . . . . . . . . . . . . . . . . 174
7.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8 Non-linear Input Transformations and Kernels 189


8.1 Creating Features by Non-linear Input Transformations . . . . . . 189
8.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . 192
8.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . 197
8.4 Kernel Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.5 Support Vector Classification . . . . . . . . . . . . . . . . . . . . 208
8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
8.A The Representer Theorem . . . . . . . . . . . . . . . . . . . . . . 213
8.B Derivation of Support Vector Classification . . . . . . . . . . . . 214

9 The Bayesian Approach and Gaussian Processes 217


9.1 The Bayesian Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.2 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . 220
9.3 The Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4 Practical Aspects of the Gaussian Process . . . . . . . . . . . . . 237
9.5 Other Bayesian Methods in Machine Learning . . . . . . . . . . . 242
9.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.A The Multivariate Gaussian Distribution . . . . . . . . . . . . . . 243

10 Generative Models and Learning from Unlabelled Data 247


10.1 The Gaussian Mixture Model and Discriminant Analysis . . . . . 248
10.2 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.3 Deep Generative Models . . . . . . . . . . . . . . . . . . . . . . 268
10.4 Representation Learning and Dimensionality Reduction . . . . . . 275
10.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

Machine Learning – A First Course for Engineers and Scientists


vi Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Contents

11 User Aspects of Machine Learning 287


11.1 Defining the Machine Learning Problem . . . . . . . . . . . . . . 287
11.2 Improving a Machine Learning Model . . . . . . . . . . . . . . . 291
11.3 What If We Cannot Collect More Data? . . . . . . . . . . . . . . 299
11.4 Practical Data Issues . . . . . . . . . . . . . . . . . . . . . . . . 303
11.5 Can I Trust my Machine Learning Model? . . . . . . . . . . . . . 307
11.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

12 Ethics in Machine Learning 309


12.1 Fairness and Error Functions . . . . . . . . . . . . . . . . . . . . 309
12.2 Misleading Claims about Performance . . . . . . . . . . . . . . . 314
12.3 Limitations of Training Data . . . . . . . . . . . . . . . . . . . . 322
12.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

Bibliography 327

Index 335

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. vii
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Acknowledgements
Many people have helped us throughout the writing of this book. First of all, we
want to mention David Sumpter, who, in addition to giving feedback from using the
material for teaching, contributed the entire Chapter 12 on ethical aspects. We have
also received valuable feedback from many students and other teacher colleagues.
We are, of course, very grateful for each and every comment we have received; in
particular, we want to mention David Widmann, Adrian Wills, Johannes Hendricks,
Mattias Villani, Dmitrijs Kass, and Joel Oskarsson. We have also received useful
feedback on the technical content of the book, including the practical insights in
Chapter 11, from Agrin Hilmkil (at Peltarion), Salla Franzén and Alla Tarighati
(at SEB), Lawrence Murray (at Uber), James Hensman and Alexis Boukouvalas
(at Secondmind), Joel Kronander and Nand Dalal (at Nines), and Peter Lindskog
and Jacob Roll (at Arriver). We also received valuable comments from Arno
Solin on Chapter 8 and 9, and Joakim Lindblad on Chapter 6. Several people
helped us with the figures illustrating the examples in Chapter 1, namely Antônio
Ribeiro (Figure 1.1), Fredrik K. Gustafsson (Figure 1.4), and Theodoros Damoulas
(Figure 1.5). Thank you all for your help!
During the writing of this book, we enjoyed financial support from AI Competence
for Sweden, the Swedish Research Council (projects: 2016-04278, 2016-06079,
2017-03807, 2020-04122), the Swedish Foundation for Strategic Research (projects:
ICA16-0015, RIT12-0012), the Wallenberg AI, Autonomous Systems and Software
Program (WASP) funded by the Knut and Alice Wallenberg Foundation, ELLIIT,
and the Kjell och Märta Beijer Foundation.
Finally, we are thankful to Lauren Cowles at Cambridge University Press for
helpful advice and guidance through the publishing process and to Chris Cartwright
for careful and helpful copyediting.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. ix
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Notation

Symbol Meaning
General mathematics
𝑏 a scalar
b a vector
B a matrix
T
transpose
sign(𝑥) the sign operator; +1 if 𝑥 > 0, −1 if 𝑥 < 0
∇ del operator; ∇ 𝑓 is the gradient of 𝑓
kbk 2 Euclidean norm of b
kbk 1 taxicab norm of b
𝑝(𝑧) probability density (if 𝑧 is a continuous random variable)
or probability mass (if 𝑧 is a discrete random variable)
𝑝(𝑧|𝑥) the probability density (or mass) for 𝑧 conditioned on 𝑥
N (𝑧; 𝑚, 𝜎 2 ) the normal probability distribution for the random variable
𝑧 with mean 𝑚 and variance 𝜎 2

The supervised learning problem


x input
𝑦 output
x★ test input
𝑦★ test output
b
𝑦 (x★) a prediction of 𝑦★
𝜀 noise
𝑛 number of data points in training data
T training data {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

𝐿 loss function
𝐽 cost function

Supervised methods
𝜽 parameters to be learned from training data
𝑔(x) model of 𝑝(𝑦 | x) (most classification methods)
𝜆 regularisation parameter
𝜙 link function (generalised linear models)
ℎ activation function (neural networks)

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. xi
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
Notation

W weight matrix (neural networks)


b offset vector (neural networks)
𝛾 learning rate
𝐵 number of members in an ensemble method
𝜅 kernel
𝝓 nonlinear feature transformation (kernel methods)
𝑑 dimension of 𝝓; number of features (kernel methods)

Evaluation of supervised methods


𝐸 error function
𝐸 new new data error
𝐸 train training data error
𝐸 𝑘-fold estimate of 𝐸 new from 𝑘-fold cross validation
𝐸 hold-out estimate of 𝐸 new from hold-out validation data

Machine Learning – A First Course for Engineers and Scientists


xii Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

Machine learning is about learning, reasoning, and acting based on data. This
is done by constructing computer programs that process the data, extract useful
information, make predictions regarding unknown properties, and suggest actions to
take or decisions to make. What turns data analysis into machine learning is that the
process is automated and that the computer program is learnt from data. This means
that generic computer programs are used, which are adapted to application-specific
circumstances by automatically adjusting the settings of the program based on
observed, so-called training data. It can therefore be said that machine learning
is a way of programming by example. The beauty of machine learning is that it is
quite arbitrary what the data represents, and we can design general methods that are
useful for a wide range of practical applications in different domains. We illustrate
this via a range of examples below.
The ‘generic computer program’ referred to above corresponds to a mathematical
model of the data. That is, when we develop and describe different machine
learning methods, we do this using the language of mathematics. The mathematical
model describes a relationship between the quantities involved, or variables, that
correspond to the observed data and the properties of interest (such as predictions,
actions, etc.) Hence, the model is a compact representation of the data that, in a
precise mathematical form, captures the key properties of the phenomenon we are
studying. Which model to make use of is typically guided by the machine learning
engineer’s insights generated when looking at the available data and the practitioner’s
general understanding of the problem. When implementing the method in practice,
this mathematical model is translated into code that can be executed on a computer.
However, to understand what the computer program actually does, it is important
also to understand the underlying mathematics.
As mentioned above, the model (or computer program) is learnt based on the
available training data. This is accomplished by using a learning algorithm which
is capable of automatically adjusting the settings, or parameters, of the model to
agree with the data. In summary, the three cornerstones of machine learning are:

1. The data 2. The mathematical model 3. The learning algorithm.

In this introductory chapter, we will give a taste of the machine learning problem
by illustrating these cornerstones with a few examples. They come from different
application domains and have different properties, but nevertheless, they can all
be addressed using similar techniques from machine learning. We also give some

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 1
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

advice on how to proceed through the rest of the book and, at the end, provide
references to good books on machine learning for the interested reader who wants
to dig further into this topic.

1.1 Machine Learning Exemplified


Machine learning is a multifaceted subject. We gave a brief and high-level description
of what it entails above, but this will become much more concrete as we proceed
throughout this book and introduce specific methods and techniques for solving
various machine learning problems. However, before digging into the details, we
will try to give an intuitive answer to the question ‘What is machine learning?’, by
discussing a few application examples of where it can (and has) been used.
We start with an example related to medicine, more precisely cardiology.

Example 1.1 Automatically diagnosing heart abnormalities

The leading cause of death globally is conditions that affect the heart and blood
vessels, collectively referred to as cardiovascular diseases. Heart problems often
influence the electrical activity of the heart, which can be measured using electrodes
attached to the body. The electrical signals are reported in an electrocardiogram
(ECG). In Figure 1.1 we show examples of (parts of) the measured signals from
three different hearts. The measurements stem from a healthy heart (top), a heart
suffering from atrial fibrillation (middle), and a heart suffering from right bundle
branch block (bottom). Atrial fibrillation makes the heart beat without rhythm,
making it hard for the heart to pump blood in a normal way. Right bundle branch
block corresponds to a delay or blockage in the electrical pathways of the heart.

Fig.
1.1

By analysing the ECG signal, a cardiologist gains valuable information about


the condition of the heart, which can be used to diagnose the patient and plan the
treatment.

Machine Learning – A First Course for Engineers and Scientists


2 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1.1 Machine Learning Exemplified

To improve the diagnostic accuracy, as well as to save time for the cardiologists,
we can ask ourselves if this process can be automated to some extent. That is, can
we construct a computer program which reads in the ECG signals, analyses the data,
and returns a prediction regarding the normality or abnormality of the heart? Such
models, capable of accurately interpreting an ECG examination in an automated
fashion, will find applications globally, but the needs are most acute in low- and
middle-income countries. An important reason for this is that the population in these
countries often do not have easy and direct access to highly skilled cardiologists
capable of accurately carrying out ECG diagnoses. Furthermore, cardiovascular
diseases in these countries are linked to more than 75% of deaths.
The key challenge in building such a computer program is that it is far from
obvious which computations are needed to turn the raw ECG signal into a predication
about the heart condition. Even if an experienced cardiologist were to try to explain
to a software developer which patterns in the data to look for, translating the
cardiologist’s experience into a reliable computer program would be extremely
challenging.
To tackle this difficulty, the machine learning approach is to instead teach the
computer program through examples. Specifically, instead of asking the cardiologist
to specify a set of rules for how to classify an ECG signal as normal or abnormal,
we simply ask the cardiologist (or a group of cardiologists) to label a large number
of recorded ECG signals with labels corresponding to the the underlying heart
condition. This is a much easier (albeit possibly tedious) way for the cardiologists
to communicate their experience and encode it in a way that is interpretable by a
computer.
The task of the learning algorithm is then to automatically adapt the computer
program so that its predictions agree with the cardiologists’ labels on the labelled
training data. The hope is that, if it succeeds on the training data (where we already
know the answer), then it should be possible to use the predictions made the by
program on previously unseen data (where we do not know the answer) as well.
This is the approach taken by Ribeiro et al. (2020), who developed a machine
learning model for ECG prediction. In their study, the training data consists of
more than 2 300 000 ECG records from almost 1 700 000 different patients from the
state of Minas Gerais in Brazil. More specifically, each ECG corresponds to 12
time series (one from each of the 12 electrodes that were used in conducting the
exam) of a duration between 7 to 10 seconds each, sampled at frequencies ranging
from 300 Hz to 600 Hz. These ECGs can be used to provide a full evaluation of
the electrical activity of the heart, and it is indeed the most commonly used test
in evaluating the heart. Importantly, each ECG in the dataset also comes with a
label sorting it into different classes – no abnormalities, atrial fibrillation, right
bundle branch block, etc. – according to the status of the heart. Based on this data,
a machine learning model is trained to automatically classify a new ECG recording
without requiring a human doctor to be involved. The model used is a deep neural
network, more specifically a so-called residual network, which is commonly used
for images. The researchers adapted this to work for the ECG signals of relevance
for this study. In Chapter 6, we introduce deep learning models and their training
algorithms.
Evaluating how a model like this will perform in practice is not straightforward.
The approach taken in this study was to ask three different cardiologists with

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 3
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

experience in electrocardiography to examine and classify 827 ECG recordings


from distinct patients. This dataset was then evaluated by the algorithm, two 4th
year cardiology residents, two 3rd year emergency residents, and two 5th year
medical students. The average performance was then compared. The result was
that the algorithm achieved better or the same result when compared to the human
performance on classifying six types of abnormalities.

Before we move on, let us pause and reflect on the example introduced above. In
fact, many concepts that are central to machine learning can be recognised in this
example.
As we mentioned above, the first cornerstone of machine learning is the data.
Taking a closer look at what the data actually is, we note that it comes in different
forms. First, we have the training data which is used to train the model. Each
training data point consists of both the ECG signal, which we refer to as the input,
and its label corresponding to the type of heart condition seen in this signal, which
we refer to as the output. To train the model, we need access to both the inputs
and the outputs, where the latter had to be manually assigned by domain experts
(or possibly some auxiliary examination). Training a model from lableled data
points is therefore referred to as supervised learning. We think of the learning
as being supervised by the domain expert, and the learning objective is to obtain
a computer program that can mimic the labelling done by the expert. Second,
we have the (unlabelled) ECG signals that will be fed to the program when it is
used ‘in production’. It is important to remember that the ultimate goal of the
model is to obtain accurate predictions in this second phase. We say that the
predictions made by the model must generalise beyond the training data. How to
train models that are capable of generalising, and how to evaluate to what extent
they do so, is a central theoretical question studied throughout this book (see in
particular Chapter 4).
We illustrate the training of the ECG prediction model in Figure 1.2. The general
structure of the training procedure is, however, the same (or at least very similar)
for all supervised machine learning problems.
Another key concept that we encountered in the ECG example is the notion of a
classification problem. Classification is a supervised machine learning task which
amounts to predicting a certain class, or label, for each data point. Specifically, for
classification problems, there are only a finite number of possible output values.
In the ECG example, the classes correspond to the type of heart condition. For
instance, the classes could be ‘normal’ or ‘abnormal’, in which case we refer to
it as a binary classification problem (only two possible classes). More generally,
we could design a model for classifying each signal as either ‘normal’, or assign it
to one of a predetermined set of abnormalities. We then face a (more ambitious)
multi-class classification problem.
Classification is, however, not the only application of supervised machine learning
that we will encounter. Specifically, we will also study another type of problem
referred to as regression problems. Regression differs from classification in that the

Machine Learning – A First Course for Engineers and Scientists


4 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1.1 Machine Learning Exemplified

Training data
Labels e.g. healty,
art. fib., RBBB Unseen data

?
Learning
Model prediction algorithm

update model Model prediction

Figure 1.2: Illustrating the supervised machine learning process with training to the left and
then the use of the trained model to the right. Left: Values for the unknown parameters of
the model are set by the learning algorithm such that the model best describes the available
training data. Right: The learned model is used on new, previously unseen data, where we
hope to obtain a correct classification. It is thus essential that the model is able to generalise
to new data that is not present in the training data.
output (that is, the quantity that we want the model to predict) is a numerical value.
We illustrate with an example from material science.

Example 1.2 Formation energy of crystals

Much of our technological development is driven by the discovery of new materials


with unique properties. Indeed, technologies such as touch screens and batteries for
electric vehicles have emerged due to advances in materials science. Traditionally,
materials discovery was largely done through experiments, but this is both time
consuming and costly, which limited the number of new materials that could be
found. Over the past few decades, computational methods have therefore played an
increasingly important role. The basic idea behind computational materials science
is to screen a very large number of hypothetical materials, predict various properties
of interest by computational methods, and then attempt to experimentally synthesise
the most promising candidates.
Crystalline solids (or, simply, crystals) are a central type of inorganic material.
In a crystal, the atoms are arranged in a highly ordered microscopic structure.
Hence, to understand the properties of such a material, it is not enough to know the
proportion of each element in the material, but we also need to know how these
elements (or atoms) are arranged into a crystal. A basic property of interest when
considering a hypothetical material is therefore the formation energy of the crystal.
The formation energy can be thought of as the energy that nature needs to spend to
form the crystal from the individual elements. Nature strives to find a minimum
energy configuration. Hence, if a certain crystal structure is predicted to have a
formation energy that is significantly larger than alternative crystals composed of
the same elements, then it is unlikely that it can be synthesised in a stable way in
practice.
A classical method (going back to the 1960s) that can be used for computing the
formation energy is so-called density functional theory (DFT). The DFT method,
which is based on quantum mechanical modelling, paved the way for the first break-

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 5
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

through in computational materials science, enabling high throughput screening


for materials discovery. That being said, the DFT method is computationally very
expensive, and even with modern supercomputers, only a small fraction of all
potentially interesting materials have been analysed.
To handle this limitation, there has been much recent interest in using machine
learning for materials discovery, with the potential to result in a second computational
revolution. By training a machine learning model to, for instance, predict the
formation energy – but in a fraction of the computational time required by DFT – a
much larger range of candidate materials can be investigated.
As a concrete example, Faber et al. (2016) used a machine learning method
referred to as kernel ridge regression (see Chapter 8) to predict the formation energy
of around 2 million so-called elpasolite crystals. The machine learning model is a
computer program which takes a candidate crystal as input (essentially, a description
of the positions and elemental types of the atoms in the crystal) and is asked to
return a prediction of the formation energy. To train the model, 10 000 crystals
were randomly selected, and their formation energies were computed using DFT.
The model was then trained to predict formation energies to agree as closely as
possible with the DFT output on the training set. Once trained, the model was used
to predict the energy on the remaining ∼99.5% of the potential elpasolites. Among
these, 128 new crystal structures were found to have a favorable energy, thereby
being potentially stable in nature.

Comparing the two examples discussed above, we can make a few interesting
observations. As already pointed out, one difference is that the ECG model is asked
to predict a certain class (say, normal or abnormal), whereas the materials discovery
model is asked to predict a numerical value (the formation energy of a crystal).
These are the two main types of prediction problems that we will study in this book,
referred to as classification and regression, respectively. While conceptually similar,
we often use slight variations of the underpinning mathematical models, depending
on the problem type. It is therefore instructive to treat them separately.
Both types are supervised learning problems, though. That is, we train a predictive
model to mimic the predictions made by a ‘supervisor’. However, it is interesting to
note that the supervision is not necessarily done by a human domain expert. Indeed,
for the formation energy model, the training data was obtained by running automated
(but costly) density functional theory computations. In other situations, we might
obtain the output values naturally when collecting the training data. For instance,
assume that you want to build a model for predicting the outcome of a soccer match
based on data about the players in the two teams. This is a classification problem
(the output is ‘win’, ‘lose’, or ‘tie’), but the training data does not have to be manually
labelled, since we get the labels directly from historical matches. Similarly, if you
want to build a regression model for predicting the price of an apartment based
on its size, location, condition, etc., then the output (the price) is obtained directly
from historical sales.
Finally, it is worth noting that, although the examples discussed above correspond
to very different application domains, the problems are quite similar from a machine
learning perspective. Indeed, the general procedure outlined in Figure 1.2 is also

Machine Learning – A First Course for Engineers and Scientists


6 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1.1 Machine Learning Exemplified

applicable, with minor modifications, to the materials discovery problem. This


generality and versatility of the machine learning methodology is one of its main
strengths and beauties.
In this book, we will make use of statistics and probability theory to describe
the models used for making predictions. Using probabilistic models allows us to
systematically represent and cope with the uncertainty in the predictions. In the
examples above, it is perhaps not obvious why this is needed. It could (perhaps) be
argued that there is a ‘correct answer’ both in the ECG problem and the formation
energy problem. Therefore, we might expect that the machine learning model
should be able to provide a definite answer in its prediction. However, even in
situations when there is a correct answer, machine learning models rely on various
assumptions, and they are trained from data using computational learning algorithms.
With probabilistic models, we are able to represent the uncertainty in the model’s
predictions, whether it originates from the data, the modelling assumptions, or the
computation. Furthermore, in many applications of machine learning, the output is
uncertain in itself, and there is no such thing as a definite answer. To highlight the
need for probabilistic predictions, let us consider an example from sports analytics.

Example 1.3 Probability of scoring a goal in soccer

Soccer is a sport where a great deal of data has been collected on how individual
players act throughout a match, how teams collaborate, how they perform over time,
etc. All this data is used to better understand the game and to help players reach
their full potential.
Consider the problem of predicting whether or not a shot results in a goal. To this
end, we will use a rather simple model, where the prediction is based only on the
player’s position on the field when taking the shot. Specifically, the input is given by
the distance from the goal and the angle between two lines drawn from the player’s
position to the goal posts; see Figure 1.3. The output corresponds to whether or not
the shot results in a goal, meaning that this is a binary classification problem.
1

0.9
𝝋 0.8

0.7
Frequency of goals

0.6

𝝋 0.5

0.4

0.3

0.2

Fig. 0.1
1.3 𝝋
0.0

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 7
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

Clearly, knowing the player’s position is not enough to definitely say if the shot
will be successful. Still, it is reasonable to assume that it provides some information
about the chance of scoring a goal. Indeed, a shot close to the goal line with a large
angle is intuitively more likely to result in a goal than one made from a position
close to the sideline. To acknowledge this fact when constructing a machine learning
model, we will not ask the model to predict the outcome of the shot but rather to
predict the probability of a goal. This is accomplished by using a probabilistic
model which is trained by maximising the total probability of the observed training
data with respect to the probabilistic predictions. For instance, using a so-called
logistic regression model (see Chapter 3) we obtain a predicted probability of
scoring a goal from any position, illustrated using a heat map in the right panel in
Figure 1.3.

The supervised learning problems mentioned above were categorised as either


classification or regression problems, depending on the type of output. These
problem categories are the most common and typical instances of supervised
machine learning, and they will constitute the foundation for most methods discussed
in this book. However, machine learning is in fact much more general and can be
used to build complex predictive models that do not naturally fit into either the
classification or the regression category. To whet the appetite for further exploration
of the field of machine learning, we provide two such examples below. These
examples go beyond the specific problem formulations that we explicitly study in
this book, but they nevertheless build on the same core methodology.
In the first of these two examples, we illustrate a computer vision capability,
namely how to classify each individual pixel of an image into a class describing the
object that the pixel belongs to. This has important applications in, for example,
autonomous driving and medical imaging. When compared to the earlier examples,
this introduces an additional level of complexity, in that the model needs to be able
to handle spatial dependencies across the image in its classifications.

Example 1.4 Pixel-wise class prediction

When it comes to machine vision, an important capability is to be able to associate


each pixel in an image with a corresponding class; see Figure 1.4 for an illustration
in an autonomous driving application. This is referred to as semantic segmentation.
In autonomous driving, it is used to separate cars, road, pedestrians, etc. The output
is then used as input to other algorithms, for instance for collision avoidance. When
it comes to medical imaging, semantic segmentation is used, for instance, to tell
apart different organs and tumors.
To train a semantic segmentation model, the training data consist of a large number
of images (inputs). For each such image, there is a corresponding output image of
the same size, where each pixel has been labelled by hand to belong to a certain class.
The supervised machine learning problem then amounts to using this data to find a
mapping that is capable of taking a new unseen image and produce a corresponding
output in the form of a predicted class for each pixel. Essentially, this is a type of clas-
sification problem, but all pixels need to be classified simultaneously while respecting

Machine Learning – A First Course for Engineers and Scientists


8 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1.1 Machine Learning Exemplified

the spatial dependencies across the image to result in a coherent segmentation.

Fig.
1.4

The bottom part of Figure 1.4 shows the prediction generated by such an algorithm,
where the aim is to classify each pixel as either car (blue), traffic sign (yellow),
pavement (purple), or tree (green). The best performing solutions for this task today
rely on cleverly crafted deep neural networks (see Chapter 6).

In the final example, we raise the bar even higher, since here the model needs
to be able to explain dependencies not only over space, but also over time, in a
so-called spatio-temporal problem. These problems are finding more and more
applications as we get access to more and more data. More precisely, we look into
the problem of how to build probabilistic models capable of better estimating and
forecasting air pollution across time and space in a city, in this case London.

Example 1.5 Estimating air pollution levels across London

Roughly 91% of the world’s population lives in places where the air quality levels are
worse than those recommended by the world health organisation. Recent estimates
indicate that 4.2 million people die each year from stroke, heart disease, lung cancer,
and chronic respiratory diseases caused by ambient air pollution.
A natural first step in dealing with this problem is to develop technology to
measure and aggregate information about the air pollution levels across time and
space. Such information enables the development of machine learning models to
better estimate and accurately forecast air pollution, which in turn permits suitable
interventions. The work that we feature here sets out to do this for the city of London,
where more than 9 000 people die early every year as a result of air pollution.
Air quality sensors are now – as opposed to the situation in the recent past –
available at relatively low cost. This, combined with an increasing awareness of the

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 9
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

problem, has caused interested companies, individuals, non-profit organisations,


and community groups to contribute by setting up sensors and making the data
available. More specifically, the data in this example comes from a sensor network
of ground sensors providing hourly readings of NO2 and hourly satellite data at
a spatial resolution of 7 km × 7 km. The resulting supervised machine learning
problem is to build a model that can deliver forecasts of the air pollution level
across time and space. Since the output – pollution level – is a continuous
variable, this is a type of regression problem. The particularly challenging aspect
here is that the measurements are reported at different spatial resolutions and on
varying timescales.
The technical challenge in this problem amounts to merging the information
from many sensors of different kinds reporting their measurements on different
spatial scales, sometimes referred to as a multi-sensor multi-resolution problem.
Besides the problem under consideration here, problems of this kind find many
different applications. The basis for the solution providing the estimates exemplified
in Figure 1.5 is the Gaussian process (see Chapter 9).

Fig.
1.5

Figure 1.5 illustrates the output from the Gaussian process model in terms of
spatio-temporal estimation and forecasting of NO2 levels in London. To the left,
we have the situation on February 19, 2019 at 11:00 using observations from both
ground sensors providing hourly readings of NO2 and from satellite data. To the
right, we have the situation on 19 February 2019 at 17:00 using only the satellite data.
The Gaussian process is a non-parametric and probabilistic model for nonlinear
functions. Non-parametric means that it does not rely on any particular parametric
functional form to be postulated. The fact that it is a probabilistic model means that
it is capable of representing and manipulating uncertainty in a systematic way.

1.2 About This Book

The aim of this book is to convey the spirit of supervised machine learning,
without requiring any previous experience in the field. We focus on the underlying
mathematics as well as the practical aspects. This book is a textbook; it is not
a reference work or a programming manual. It therefore contains only a careful

Machine Learning – A First Course for Engineers and Scientists


10 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1.2 About This Book

(yet comprehensive) selection of supervised machine learning methods and no


programming code. There are by now many well-written and well-documented
code packages available, and it is our firm belief that with a good understanding of
the mathematics and the inner workings of the methods, the reader will be able to
make the connection between this book and his/her favorite code package in his/her
favorite programming language.
We take a statistical perspective in this book, meaning that we discuss and
motivate methods in terms of their statistical properties. It therefore requires some
previous knowledge in statistics and probability theory, as well as calculus and
linear algebra. We hope that reading the book from start to end will give the reader
a good starting point for working as a machine learning engineer and/or pursuing
further studies within the subject.
The book is written such that it can be read back to back. There are, however,
multiple possible paths through the book that are more selective depending on the
interest of the reader. Figure 1.6 illustrates the major dependencies between the
chapters. In particular, the most fundamental topics are discussed in Chapters 2, 3,
and 4, and we do recommend the reader to read those chapters before proceeding
to the later chapters that contain technically more advanced topics (Chapters 5–9).
Chapter 10 goes beyond the supervised setting of machine learning, and Chapter 11
focuses on some of the more practical aspects of designing a successful machine
learning solution and has a less technical nature than the preceding chapters. Finally,
Chapter 12 (written by David Sumpter) discusses certain ethical aspects of modern
machine learning.
Fundamental
chapters

3: Basic Parametric Models


2: Supervised Learning: A 4: Understanding, Evaluating,
and a Statistical Perspective
First Approach and Improving Performance
on Learning

6: Neural Networks and Deep 7: Ensemble Methods: Bag-


Advanced chapters

5: Learning Parametric Models


5.4, Learning ging and Boosting
5.5
5.2 (for 8.3, 8.5)
8.1,
8.2,
8: Non-linear Input Transfor- 8.4 9: The Bayesian Approach and
mations and Kernels Gaussian Processes
Special
chapters

10: Generative Models and 11: User Aspects of Machine 12: Ethics in Machine Learn-
Learning from Unlabelled Data Learning ing

Figure 1.6: The structure of this book, illustrated by blocks (chapters) and arrows (recom-
mended order in which to read the chapters). We do recommend everyone to read (or at
least skim) the fundamental material in Chapters 2, 3, and 4 first. The path through the
technically more advanced Chapters 5–9, can be chosen to match the particular interest of
the reader. For Chapters 11, 10, and 12, we recommend reading the fundamental chapters
first.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 11
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
1 Introduction

1.3 Further Reading


There are by now quite a few extensive textbooks available on the topic of machine
learning, which introduce the area in different ways compared to how we do so in this
book. We will only mention a few here. The book of Hastie et al. (2009) introduces
the area of statistical machine learning in a mathematically solid and accessible
manner. A few years later, the authors released a different version of their book
(James et al. 2013), which is mathematically significantly lighter, conveying the main
ideas in an even more accessible manner. These books do not venture long either
into the world of Bayesian methods or the world of neural networks. However, there
are several complementary books that do exactly that – see e.g. Bishop (2006) and
Murphy (2021). MacKay (2003) provides a rather early account drawing interesting
and useful connections to information theory. It is still very much worth looking
into. The book by Shalev-Shwartz and Ben-David (2014) provides an introduction
with a clear focus on the underpinning theoretical constructions, connecting very
deep questions – such as ‘what is learning?’ and ‘how can a machine learn’ – with
mathematics. It is a perfect book for those of our readers who would like to deepen
their understanding of the theoretical background of that area. We also mention the
work of Efron and Hastie (2016), where the authors take a constructive historical
approach to the development of the area, covering the revolution in data analysis
that emerged with computers. Contemporary introductions to the mathematics of
machine learning are provided by Strang (2019) and Deisenroth et al. (2019).
For a full account of the work on automatic diagnosis of heart abnormalities, see
Ribeiro et al. 2020, and for a general introduction to the use of machine learning –
in particular deep learning – in medicine, we point the reader to Topol (2019). The
application of kernel ridge regression to elpasolite crystals was borrowed from
Faber et al. (2016). Other applications of machine learning in materials science are
reviewed in the collection edited by Schütt et al. (2020). The London air pollution
study was published by Hamelijnck et al. (2019), where the authors introduce
interesting and useful developments of the Gaussian process model that we explain
in Chapter 9. When it comes to semantic segmentation, the ground-breaking work
of Long et al. (2015) has received massive interest. The two main bases for the
current development in semantic segmentation are Zhao et al. (2017) and L.-C. Chen
et al. (2017). A thorough introduction to the mathematics of soccer is provided in
the book by D. Sumpter (2016), and a starting point to recent ideas on how to assess
the impact of player actions is given in Decroos et al. (2019).

Machine Learning – A First Course for Engineers and Scientists


12 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First
Approach

In this chapter, we will introduce the supervised machine learning problem as well as
two basic machine learning methods for solving it. The methods we will introduce
are called 𝑘-nearest neighbours and decision trees. These two methods are relatively
simple, and we will derive them on intuitive grounds. Still, these methods are useful
in their own right and are therefore a good place to start. Understanding their inner
workings, advantages, and shortcomings also lays a good foundation for the more
advanced methods that are to come in later chapters.

2.1 Supervised Machine Learning


In supervised machine learning, we have some training data that contains examples
of how some input1 variable x relates to an output2 variable 𝑦. By using some
mathematical model or method, which we adapt to the training data, our goal is to
predict the output 𝑦 for a new, previously unseen, set of test data for which only x is
known. We usually say that we learn (or train) a model from the training data, and
that process involves some computations implemented on a computer.

Learning from Labelled Data

In most interesting supervised machine learning applications, the relationship


between input x and output 𝑦 is difficult to describe explicitly. It may be too
cumbersome or complicated to fully unravel from application domain knowledge,
or even unknown. The problem can therefore usually not be solved by writing a
traditional computer program that takes x as input and returns 𝑦 as output from
a set of rules. The supervised machine learning approach is instead to learn the
relationship between x and 𝑦 from data, which contains examples of observed pairs
of input and output values. In other words, supervised machine learning amounts to
learning from examples.
The data used for learning is called training data, and it has to consist of several
input–output data points (samples) (x𝑖 , 𝑦 𝑖 ), in total 𝑛 of them. We will compactly

1 The input is commonly also called feature, attribute, predictor, regressor, covariate, explanatory
variable, controlled variable, and independent variable.
2 The output is commonly also called response, regressand, label, explained variable, predicted
variable, or dependent variable.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 13
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

write the training data as T = {x𝑖 , 𝑦 𝑖 }𝑖=1


𝑛 . Each data point in the training data

provides a snapshot of how 𝑦 depends on x, and the goal in supervised machine


learning is to squeeze as much information as possible out of T . In this book, we
will only consider problems where the individual data points are assumed to be
(probabilistically) independent. This excludes, for example, applications in time
series analysis, where it is of interest to model the correlation between x𝑖 and x𝑖+1 .
The fact that the training data contains not only input values x𝑖 but also output
values 𝑦 𝑖 is the reason for the term ‘supervised’ machine learning. We may say
that each input x𝑖 is accompanied by a label 𝑦 𝑖 , or simply that we have labelled
data. For some applications, it is only a matter of jointly recording x and 𝑦. In
other applications, the output 𝑦 has to be created by labelling of the training data
inputs x by a domain expert. For instance, to construct a training dataset for the
cardiovascular disease application introduced in Chapter 1, a cardiologist needs to
look at all training data inputs (ECG signals) x𝑖 and label them by assigning to the
variable 𝑦 𝑖 to correspond to the heart condition that is seen in the signal. The entire
learning process is thus ‘supervised’ by the domain expert.
We use a vector boldface notation x to denote the input, since we assume it to
be a 𝑝-dimensional vector, x = [𝑥 1 𝑥 2 · · · 𝑥 𝑝 ] T , where T denotes the transpose.
Each element of the input vector x represents some information that is considered
to be relevant for the application at hand, for example the outdoor temperature or
the unemployment rate. In many applications, the number of inputs 𝑝 is large, or
put differently, the input x is a high-dimensional vector. For instance, in a computer
vision application where the input is a greyscale image, x can be all pixel values in
the image, so 𝑝 = ℎ × 𝑤 where ℎ and 𝑤 denote the height and width of the input
image.3 The output 𝑦, on the other hand, is often of low dimension, and throughout
most of this book, we will assume that it is a scalar value. The type of the output
value, numerical or categorical, turns out to be important and is used to distinguish
between two subtypes of the supervised machine learning problems: regression and
classification. We will discuss this next.

Numerical and Categorical Variables

The variables contained in our data (input as well as output) can be of two different
types: numerical or categorical. A numerical variable has a natural ordering.
We can say that one instance of a numerical variable is larger or smaller than
another instance of the same variable. A numerical variable could, for instance,
be represented by a continuous real number, but it could also be discrete, such
as an integer. Categorical variables, on the other hand, are always discrete, and

3 For image-based problems it is often more convenient to represent the input as a matrix of size
ℎ × 𝑤 than as a vector of length 𝑝 = ℎ𝑤, but the dimension is nevertheless the same. We will get
back to this in Chapter 6 when discussing the convolutional neural network, a model structure
tailored to image-type inputs.

Machine Learning – A First Course for Engineers and Scientists


14 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.1 Supervised Machine Learning

Table 2.1: Examples of numerical and categorical variables.


Variable type Example Handled as
Number (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Numerical
Number (discrete) with natural 0 children, 1 child, 2 children Numerical
ordering
Number (discrete) without natural 1 = Sweden, 2 = Denmark, Categorical
ordering 3 = Norway
Text string Hello, Goodbye, Welcome Categorical

importantly, they lack a natural ordering. In this book we assume that any categorical
variable can take only a finite number of different values. A few examples are given
in Table 2.1 above.
The distinction between numerical and categorical is sometimes somewhat
arbitrary. We could, for instance, argue that having no children is qualitatively
different from having children, and use the categorical variable ‘children: yes/no’
instead of the numerical ‘0, 1 or 2 children’. It is therefore a decision for the machine
learning engineer whether a certain variable is to be considered as numerical or
categorical.
The notion of categorical vs. numerical applies to both the output variable 𝑦 and
to the 𝑝 elements 𝑥 𝑗 of the input vector x = [𝑥 1 𝑥 2 · · · 𝑥 𝑝 ] T . All 𝑝 input variables
do not have to be of the same type. It is perfectly fine (and common in practice) to
have a mix of categorical and numerical inputs.

Classification and Regression


We distinguish between different supervised machine learning problems by the type
of the output 𝑦.

Regression means that the output is numerical, and classification means that the
output is categorical.

The reason for this distinction is that the regression and classification problems have
somewhat different properties, and different methods are used for solving them.
Note that the 𝑝 input variables x = [𝑥 1 𝑥 2 · · · 𝑥 𝑝 ] T can be either numerical or
categorical for both regression and classification problems. It is only the type of the
output that determines whether a problem is a regression or a classification problem.
A method for solving a classification problems is called a classifier.
For classification, the output is categorical and can therefore only take val-
ues in a finite set. We use 𝑀 to denote the number of elements in the set of
possible output values. It could, for instance, be {false, true} (𝑀 = 2) or
{Sweden, Norway, Finland, Denmark} (𝑀 = 4). We will refer to these elements
as classes or labels. The number of classes 𝑀 is assumed to be known in the
classification problem. To prepare for a concise mathematical notation, we use

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 15
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

integers 1, 2, . . . , 𝑀 to denote the output classes if 𝑀 > 2. The ordering of the


integers is arbitrary and does not imply any ordering of the classes. When there are
only 𝑀 = 2 classes, we have the important special case of binary classification. In
binary classification we use the labels −1 and 1 (instead of 1 and 2). Occasionally
we will also use the equivalent terms negative and positive class. The only reason for
using a different convention for binary classification is that it gives a more compact
mathematical notation for some of the methods, and it carries no deeper meaning.
Let us now have a look at a classification and a regression problem, both of which
will be used throughout this book.

Example 2.1 Classifying songs

Say that we want to build a ‘song categoriser’ app, where the user records a song,
and the app answers by reporting whether the song has the artistic style of either the
Beatles, Kiss, or Bob Dylan. At the heart of this fictitious app, there has to be a
mechanism that takes an audio recording as an input and returns an artist’s name.
If we first collect some recordings with songs from the three groups/artists
(where we know which artist is behind each song: a labelled dataset), we could use
supervised machine learning to learn the characteristics of their different styles and
therefrom predict the artist of the new user-provided song. In supervised machine
learning terminology, the artist name (the Beatles, Kiss, or Bob Dylan) is the
output 𝑦. In this problem, 𝑦 is categorical, and we are hence facing a classification
problem.
One of the important design choices for a machine learning engineer is a detailed
specification of what the input x really is. It would in principle be possible to consider
the raw audio information as input, but that would give a very high-dimensional x
which (unless an audio-specific machine learning method is used) would most likely
require an unrealistically large amount of training data in order to be successful (we
will discuss this aspect in detail in Chapter 4). A better option could therefore be to
define some summary statistics of audio recordings and use those so-called features
as input x instead. As input features, we could, for example, use the length of the
audio recording and the ‘perceived energy’ of the song. The length of a recording
is easy to measure. Since it can differ quite a lot between different songs, we take
the logarithm of the actual length (in seconds) to get values in the same range for all
songs. Such feature transformations are commonly used in practice to make the
input data more homogeneous.
The energy of a songa is a bit more tricky, and the exact definition may even
be ambiguous. However, we leave that to the audio experts and re-use a piece
of software that they have written for this purposeb without bothering too much
about its inner workings. As long as this piece of software returns a number for
any recording that is fed to it, and always returns the same number for the same
recording, we can use it as an input to a machine learning method.
In Figure 2.1 we have plotted a dataset with 230 songs from the three artists. Each
song is represented by a dot, where the horizontal axis is the logarithm of its length
(measured in seconds) and the vertical axis the energy (on a scale 0–1). When
we later return to this example and apply different supervised machine learning
methods to it, this data will be the training data.

Machine Learning – A First Course for Engineers and Scientists


16 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.1 Supervised Machine Learning

Rock and roll all nite The Beatles


1
Kiss
Help! Bob Dylan

Energy (scale 0-1)


0.8

0.6

0.4

0.2

A hard rain’s a-gonna fall


0
Fig. 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7

2.1 Length (ln s)

a We use this term to refer to the perceived musical energy, not the signal energy in a strict
sense.
b Specifically, we use http://api.spotify.com/ here.

Example 2.2 Car stopping distances

Ezekiel and Fox (1959) present a dataset with 62 observations of the distance needed
for various cars at different initial speeds to brake to a full stop.a The dataset has
the following two variables:
- Speed: The speed of the car when the brake signal is given.
- Distance: The distance traveled after the signal is given until the car has
reached a full stop.

150
Data
Distance (feet)

100

50

0
Fig. 0 10 20 30 40

2.2 Speed (mph)

To make a supervised machine learning problem out of this, we interpret Speed as


the input variable 𝑥 and Distance as the output variable 𝑦, as shown in Figure 2.2.
Note that we use a non-bold symbol for the input here since it is a scalar value and
not a vector of inputs in this example. Since 𝑦 is numerical, this is a regression

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 17
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

problem. We then ask ourselves what the stopping distance would be if the initial
speed were, for example, 33 mph or 45 mph, respectively (two speeds at which
no data has been recorded). Another way to frame this question is to ask for the
prediction b
𝑦 (𝑥★) for 𝑥★ = 33 and 𝑥★ = 45.
a The data is somewhat dated, so the conclusions are perhaps not applicable to modern cars.

Generalising Beyond Training Data


There are two primary reasons why it can be of interest to mathematically model
the input–output relationships from training data.

(i) To reason about and explore how input and output variables are connected.
An often-encountered task in sciences such as medicine and sociology is
to determine whether a correlation between a pair of variables exists or not
(‘does eating seafood increase life expectancy?’). Such questions can be
addressed by learning a mathematical model and carefully reasoning about
the likelihood that the learned relationships between input x and output 𝑦 are
due only to random effects in the data or if there appears to be some substance
to the proposed relationships.

(ii) To predict the output value 𝑦★ for some new, previously unseen input x★.
By using some mathematical method which generalises the input–output
examples seen in the training data, we can make a prediction b 𝑦 (x★) for a
previously unseen test input x★. The hat b indicates that the prediction is an
estimate of the output.

These two objectives are sometimes used to roughly distinguish between classical
statistics, focusing more on objective (i), and machine learning, where objective
(ii) is more central. However, this is not a clear-cut distinction since predictive
modelling is a topic in classical statistics too, and explainable models are also
studied in machine learning. The primary focus in this book, however, is on making
predictions, objective (ii) above, which is the foundation of supervised machine
learning. Our overall goal is to obtain as accurate predictions b 𝑦 (x★) as possible
(measured in some appropriate way) for a wide range of possible test inputs x★. We
say that we are interested in methods that generalise well beyond the training data.
A method that generalises well for the music example above would be able to
correctly tell the artist of a new song which was not in the training data (assuming
that the artist of the new song is one of the three that was present in the training
data, of course). The ability to generalise to new data is a key concept of machine
learning. It is not difficult to construct models or methods that give very accurate
predictions if they are only evaluated on the training data (we will see an example
in the next section). However, if the model is not able to generalise, meaning that
the predictions are poor when the model is applied to new test data points, then

Machine Learning – A First Course for Engineers and Scientists


18 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.2 A Distance-Based Method: k-NN

the model is of little use in practice for making predictions. If this is the case, we
say that the model is overfitting to the training data. We will illustrate the issue
of overfitting for a specific machine learning model in the next section, and in
Chapter 4 we will return to this concept using a more general and mathematical
approach.

2.2 A Distance-Based Method: k-NN


It is now time to encounter our first actual machine learning method. We will start
with the relatively simple 𝑘-nearest neighbours (𝑘-NN) method, which can be used
for both regression and classification. Remember that the setting is that we have
𝑛 , which consists of 𝑛 data points with input x and
access to training data {x𝑖 , 𝑦 𝑖 }𝑖=1 𝑖
corresponding output 𝑦 𝑖 . From this we want to construct a prediction b 𝑦 (x★) for what
we believe the output 𝑦★ would be for a new x★, which we have not seen previously.

The k-Nearest Neighbours Method


Most methods for supervised machine learning build on the intuition that if the test
data point x★ is close to training data point x𝑖 , then the prediction b 𝑦 (x★) should be
close to 𝑦 𝑖 . This is a general idea, but one simple way to implement it in practice is
the following: first, compute the Euclidean distance4 between the test input and all
training inputs, kx𝑖 − x★ k 2 for 𝑖 = 1, . . . , 𝑛; second, find the data point x 𝑗 with the
shortest distance to x★, and use its output as the prediction, b 𝑦 (x★) = 𝑦 𝑗 .
This simple prediction method is referred to as the 1-nearest neighbour method.
It is not very complicated, but for most machine learning applications of interest it
is too simplistic. In practice we can rarely say for certain what the output value
𝑦 will be. Mathematically, we handle this by describing 𝑦 as a random variable.
That is, we consider the data as noisy, meaning that it is affected by random errors
referred to as noise. From this perspective, the shortcoming of 1-nearest neighbour
is that the prediction relies on only one data point from the training data, which
makes it quite ‘erratic’ and sensitive to noisy training data.
To improve the 1-nearest neighbour method, we can extend it to make use
of the 𝑘 nearest neighbours instead. Formally, we define the set N★ = {𝑖 : x𝑖
is one of the 𝑘 training data points closest to x★ } and aggregate the information
from the 𝑘 outputs 𝑦 𝑗 for 𝑗 ∈ N★ to make the prediction. For regression problems,
we take the average of all 𝑦 𝑗 for 𝑗 ∈ N★, and for classification problems, we
use a majority vote.5 We illustrate the 𝑘-nearest neighbours (𝑘-NN) method by
Example 2.3 and summarise it in Method 2.1.

4 The
√︁ Euclidean distance between a test point x★ and a training data point x𝑖 is kx𝑖 − x★ k 2 =
(𝑥𝑖1 − 𝑥★1 ) 2 + (𝑥𝑖2 − 𝑥★2 ) 2 . Other distance functions can also be used and will be discussed in
Chapter 8. Categorical input variables can be handled, as we will discuss in Chapter 3.
5 Ties can be handled in different ways, for instance by a coin-flip, or by reporting the actual vote
count to the end user, who gets to decide what to do with it.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 19
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

Methods that explicitly use the training data when making predictions are referred
to as nonparametric, and the 𝑘-NN method is one example of this. This is in contrast
with parametric methods, where the prediction is given by some function (a model)
governed by a fixed number of parameters. For parametric methods, the training
data is used to learn the parameters in an initial training phase, but once the model
has been learned, the training data can be discarded since it is not used explicitly
when making predictions. We will introduce parametric modelling in Chapter 3.

Data: Training data {x𝑖 , 𝑦 𝑖 }𝑖=1


𝑛 and test input x

Result: Predicted test output b 𝑦 (x★)
1 Compute the distances kx𝑖 − x★ k 2 for all training data points 𝑖 = 1, . . . , 𝑛
2 Let N★ = {𝑖 : x𝑖 is one of the 𝑘 data points closest to x★ }
3 Compute the prediction b 𝑦 (x★) as
(
Average{𝑦 𝑗 : 𝑗 ∈ N★ } (Regression problems)
b
𝑦 (x★) =
MajorityVote{𝑦 𝑗 : 𝑗 ∈ N★ } (Classification problems)

Method 2.1: 𝑘-nearest neighbour, 𝑘-NN

Example 2.3 Predicting colours with 𝑘-NN

We consider a synthetic binary classification problem (𝑀 = 2). We are given a


training dataset with 𝑛 = 6 observations of 𝑝 = 2 input variables 𝑥1 , 𝑥2 and one
categorical output 𝑦, the colour Red or Blue,
𝑖 𝑥1 𝑥2 𝑦
1 −1 3 Red
2 2 1 Blue
3 −2 2 Red
4 −1 2 Blue
5 −1 0 Blue
6 1 1 Red

and we are interested in predicting the output for x★ = [1 2] T . For this purpose we
will explore two different 𝑘-NN classifiers, one using 𝑘 = 1 and one using 𝑘 = 3.
First, we compute the Euclidean distance kx𝑖 − x★ k 2 between each training data
point x𝑖 (red and blue dots) and the test data point x★ (black dot), and then sort
them in ascending order.
Since the closest training data point to x★ is the data point 𝑖 = 6 (Red), this
means that for 𝑘-NN with 𝑘 = 1, we get the prediction b 𝑦 (x★) = Red. For 𝑘 = 3,
the three nearest neighbours are 𝑖 = 6 (Red), 𝑖 = 2 (Blue), and 𝑖 = 4 (Blue).
Taking a majority vote among these three training data points, Blue wins with 2
votes against 1, so our prediction becomes b 𝑦 (x★) = Blue. In Figure 2.3, 𝑘 = 1 is

Machine Learning – A First Course for Engineers and Scientists


20 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.2 A Distance-Based Method: k-NN

represented by the inner circle and 𝑘 = 3 by the outer circle.

4
𝑘=3

1
𝑖 kx𝑖 − x★ k 2 𝑦𝑖

𝑖=
√ 𝑘=1
6 √1

𝑖=
Red 2
𝑖=3

4
2 √2 Blue

𝑥2
𝑖=6
4 √4 Blue
𝑖=2
1 √5 Red
0
5 √8 Blue 𝑖=5
3 9 Red
Fig. −2 0 2

2.3 𝑥1

Decision Boundaries for a Classifier


In Example 2.3 we only computed a prediction for one single test data point x★.
That prediction might indeed be the ultimate goal of the application, but in order to
visualise and better understand a classifier, we can also study its decision boundary,
which illustrates the prediction for all possible test inputs. We introduce the decision
boundary using Example 2.4. It is a general concept for classifiers, not only 𝑘-NN,
but it is only possible to visualise easily when the dimension of x is 𝑝 = 2.

Example 2.4 Decision boundaries for the colour example

In Example 2.3 we computed the prediction for x★ = [1 2] T . If we were to shift that


test point by one unit to the left at x★alt = [0 2] T , the three closest training data points
would still include 𝑖 = 6 and 𝑖 = 4, but now 𝑖 = 2 is exchanged for 𝑖 = 1. For 𝑘 = 3
this would give two votes for Red and one vote for Blue, and we would therefore
predict b𝑦 = Red. In between these two test data points x★ and x★alt , at [0.5 2] T , it
is equally far to 𝑖 = 1 as to 𝑖 = 2, and it is undecided if the 3-NN classifier should
predict Red or Blue. (In practice this is most often not a problem, since the test
data points rarely end up exactly at the decision boundary. If they do, this can be
handled by a coin-flip.) For all classifiers, we always end up with points in the input
space where the class prediction abruptly changes from one class to another. These
points are said to be on the decision boundary of the classifier.
Continuing in a similar way, changing the location of the test input across the
entire input space and recording the class prediction, we can compute the complete
decision boundaries for Example 2.3. We plot the decision boundaries for 𝑘 = 1
and 𝑘 = 3 in Figure 2.4.
In Figure 2.4 the decision boundaries are the points in input space where the class
prediction changes, that is, the borders between red and blue. This type of figure
gives a concise summary of a classifier. However, it is only possible to draw such a
plot in the simple case when the problem has a 2-dimensional input x. As we can
see, the decision boundaries of 𝑘-NN are not linear. In the terminology we will

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 21
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

introduce later, 𝑘-NN is thereby a non-linear classifier.


𝑘=1 𝑘=3
4 4
b
𝑦 = red b
𝑦 = red

2 2
𝑥2

𝑥2
b
0 0
𝑦 = blue
b
𝑦 = blue
Fig. −2 0 2 −2 0 2
2.4 𝑥1 𝑥1

Choosing k
The number of neighbours 𝑘 that are considered when making a prediction with
𝑘-NN is an important choice the user has to make. Since 𝑘 is not learned by
𝑘-NN itself, but is design choice left to the user, we refer to it as a hyperparameter.
Throughout the book, we will use the term ‘hyperparameter’ for similar tuning
parameters for other methods.
The choice of the hyperparameter 𝑘 has a big impact on the predictions made by
𝑘-NN. To understand the impact of 𝑘, we study how the decision boundary changes
as 𝑘 changes in Figure 2.5, where 𝑘-NN is applied to the music classification
Example 2.1 and the car stopping distance Example 2.2, both with 𝑘 = 1 and 𝑘 = 20.
With 𝑘 = 1, all training data points will, by construction, be correctly predicted,
and the model is adapted to the exact x and 𝑦 values of the training data. In the
classification problem there are, for instance, small green (Bob Dylan) regions
within the red (the Beatles) area that are most likely misleading when it comes to
accurately predicting the artist of a new song. In order to make good predictions, it
would probably be better to instead predict red (the Beatles) for a new song in the
entire middle-left region since the vast majority of training data points in that area
are red. For the regression problem, 𝑘 = 1 gives quite shaky behaviour, and also for
this problem, it is intuitively clear that this does not describe an actual effect, but
rather that the prediction is adapting to the noise in the data.
The drawbacks of using 𝑘 = 1 are not specific to these two examples. In most
real world problems there is a certain amount of randomness in the data, or at
least insufficient information, which can be thought of as a random effect. In the
music example, the 𝑛 = 230 songs were selected from all songs ever recorded
from these artists, and since we do not know how this selection was made, we may
consider it random. Furthermore, and more importantly, if we want our classifier to
generalise to completely new data, like new releases from the artists in our example
(overlooking the obvious complication for now), then it is not reasonable to assume
that the length and energy of a song will give a complete picture of the artistic
styles. Hence, even with the best possible model, there is some ambiguity about

Machine Learning – A First Course for Engineers and Scientists


22 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.2 A Distance-Based Method: k-NN

𝑘=1 𝑘 = 20
1 Beatles 1 Beatles

Energy (scale 0-1)

Energy (scale 0-1)


Kiss Kiss
Bob Dylan Bob Dylan

0.5 0.5

0 0
5 6 7 5 6 7

Length (ln s) Length (ln s)


(a) Decision boundaries for the music classifi- (b) The music classification problem again, now
cation problem using 𝑘 = 1. This is a typical using 𝑘 = 20. A higher value of 𝑘 gives a
example of overfitting, meaning that the model smoother behaviour which, hopefully, predicts
has adapted too much to the training data so that it the artist of new songs more accurately.
does not generalise well to new previously unseen
data.
𝑘=1 𝑘 = 20
150 150
𝑘-NN, 𝑘 = 1 𝑘-NN, 𝑘 = 20
Distance (feet)

Distance (feet)
Data Data
100 100

50 50

0 0
20 40 20 40

Speed (mph) Speed (mph)


(c) The black dots are the car stopping distance (d) The car stopping distance, this time with
data, and the blue line shows the prediction for 𝑘 = 20. Except for the boundary effect at the
𝑘-NN with 𝑘 = 1 for any 𝑥. As for the classifica- right, this seems like a much more useful model
tion problem above, 𝑘-NN with 𝑘 = 1 overfits to which captures the interesting effects of the data
the training data. and ignores the noise.

Figure 2.5: 𝑘-NN applied to the music classification Example 2.1 (a and b) and the car
stopping distance Example 2.2 (c and d). For both problems 𝑘-NN is applied with 𝑘 = 1
and 𝑘 = 20.

which artist has recorded a song if we only look at these two input variables. This
ambiguity is modelled as random noise. Also for the car stopping distance, there
appear to be a certain amount of random effects, not only in 𝑥 but also in 𝑦. By
using 𝑘 = 1 and thereby adapting very closely to the training data, the predictions
will depend not only on the interesting patterns in the problem but also on the (more
or less) random effects that have shaped the training data. Typically we are not
interested in capturing these effects, and we refer to this as overfitting.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 23
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

With the 𝑘-NN classifier, we can mitigate overfitting by increasing the region
of the neighbourhood used to compute the prediction, that is, increasing the
hyperparameter 𝑘. With, for example, 𝑘 = 20, the predictions are no longer based
only on the closest neighbour but are instead a majority vote among the 20 closest
neighbours. As a consequence, all training data points are no longer perfectly
classified, but some of the songs end up in the wrong region in Figure 2.5b. The
predictions are, however, less adapted to the peculiarities of the training data and
thereby less overfitted, and Figure 2.5b and d are indeed less ‘noisy’ than Figure 2.5a
and c. However, if we make 𝑘 too large, then the averaging effect will wash out
all interesting patterns in the data as well. Indeed, for sufficiently large 𝑘 the
neihbourhood will include all training data points, and the model will reduce to
predicting the mean of the data for any input.
Selecting 𝑘 is thus a trade-off between flexibility and rigidity. Since selecting
𝑘 either too big or too small will lead to a meaningless classifiers, there must
exist a sweet spot for some moderate 𝑘 (possibly 20, but it could be less or more)
where the classifier generalises the best. Unfortunately, there is no general answer
to the 𝑘 for which this happens, and this is different for different problems. In
the music classification problem, it seems reasonable that 𝑘 = 20 will predict new
test data points better than 𝑘 = 1, but there might very well be an even better
choice of 𝑘. For the car stopping problem, the behaviour is also more reasonable
for 𝑘 = 20 than 𝑘 = 1, except for the boundary effect for large 𝑥, where 𝑘-NN is
unable to capture the trend in the data as 𝑥 increases (simply because the 20 nearest
neighbours are the same for all test points 𝑥★ around and above 35). A systematic
way of choosing a good value for 𝑘 is to use cross-validation, which we will discuss
in Chapter 4.

Time to reflect 2.1 The prediction b 𝑦 (x★) obtained using the 𝑘-NN method is
a piecewise constant function of the input x★. For a classification problem,
this is natural, since the output is categorical (see, for example, Figure 2.5
where the coloured regions correspond to areas of the input space where
the prediction is constant according to the colour of that region). However,
𝑘-NN will also have piecewise constant predictions for regression problems.
Why?

Input Normalisation

A final important practical aspect when using 𝑘-NN is the importance of normal-
isation of the input data. Imagine a training dataset with 𝑝 = 2 input variables
x = [𝑥 1 𝑥 2 ] T where all values of 𝑥 1 are in the range [100, 1100] and the values
for 𝑥 2 are in the much smaller range [0, 1]. It could, for example, be that 𝑥 1 and
𝑥 2 are measured in different units. The Euclidean √︁ distance between a test point
x★ and a training data point x𝑖 is kx𝑖 − x★ k 2 = (𝑥 𝑖1 − 𝑥★1 ) 2 + (𝑥 𝑖2 − 𝑥★2 ) 2 . This

Machine Learning – A First Course for Engineers and Scientists


24 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

expression will typically be dominated by the first term (𝑥 𝑖1 − 𝑥★1 ) 2 , whereas the
second term (𝑥 𝑖2 − 𝑥★2 ) 2 tends to have a much smaller effect, simply due to the
different magnitude of 𝑥 1 and 𝑥 2 . That is, the different ranges lead to 𝑥 1 being
considered much more important than 𝑥 2 by 𝑘-NN.
To avoid this undesired effect, we can re-scale the input variables. One option, in
the mentioned example, could be to subtract 100 from 𝑥 1 and thereafter divide it by
new = 𝑥𝑖1 −100 such that 𝑥 new and 𝑥 both are in the range [0, 1].
1 000 and create 𝑥 𝑖1 1 000 1 2
More generally, this normalisation procedure for the input data can be written as
𝑥 𝑖 𝑗 − minℓ (𝑥ℓ 𝑗 )
𝑥 𝑖new
𝑗 = , for all 𝑗 = 1, . . . , 𝑝, 𝑖 = 1, . . . , 𝑛. (2.1)
maxℓ (𝑥ℓ 𝑗 ) − minℓ (𝑥ℓ 𝑗 )

Another common normalisation approach (sometimes called standardising) is by


using the mean and standard deviation in the training data:
𝑥 𝑖 𝑗 − 𝑥¯ 𝑗
𝑥 𝑖new
𝑗 = , ∀ 𝑗 = 1, . . . , 𝑝, 𝑖 = 1, . . . , 𝑛, (2.2)
𝜎𝑗

where 𝑥¯ 𝑗 and 𝜎 𝑗 are the mean and standard deviation for each input variable,
respectively.
It is crucial for 𝑘-NN to apply some type of input normalisation (as was indeed
done in Figure 2.5), but it is a good practice to apply this also when using other
methods, for numerical stability if nothing else. It is, however, important to compute
the scaling factors (minℓ (𝑥ℓ 𝑗 ), 𝑥¯ 𝑗 , etc.) using training data only and to also apply
that scaling to future test data points. Failing to do this, for example by performing
normalisation before setting test data aside (which we will discuss more in Chapter 4),
might lead to wrong conclusions on how well the method will perform in predicting
future (not yet seen) data points.

2.3 A Rule-Based Method: Decision Trees


The 𝑘-NN method results in a prediction b 𝑦 (x★) that is a piecewise constant function
of the input x★. That is, the method partitions the input space into disjoint regions,
and each region is associated with a certain (constant) prediction. For 𝑘-NN, these
regions are given implicitly by the 𝑘-neighbourhood of each possible test input. An
alternative approach, that we will study in this section, is to come up with a set of
rules that defines the regions explicitly. For instance, considering the music data in
Example 2.1, a simple set of high-level rules for constructing a classifier would be:
inputs to the right in Figure 2.1 are classified as green (Bob Dylan), in the left as
red (The Beatles), and in the upper part as blue (Kiss). We will now see how such
rules can be learned systematically from the training data.
The rule-based models that we consider here are referred to as decision trees.
The reason is that the rules used to define the model can be organised in a graph
structure referred to as a binary tree. The decision tree effectively divides the input
space into multiple disjoint regions, and in each region, a constant value is used for
the prediction b𝑦 (x★). We illustrate this with an example.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 25
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

Example 2.5 Predicting colours with a decision tree

We consider a classification problem with two numerical input variables x = [𝑥1 𝑥2 ] T


and one categorical output 𝑦, the colour Red or Blue. For now, we do not consider
any training data or how to actually learn the tree but only how an already existing
decision tree can be used to predict b 𝑦 (x★).
The rules defining the model are organised in the graph in Figure 2.6, which
is referred to as a binary tree. To use this tree to predict a label for the test input
x★ = [𝑥★1 𝑥★2 ] T , we start at the top, referred to as the root node of the tree (in the
metaphor, the tree is growing upside down, with the root at the top and the leaves at
the bottom). If the condition stated at the root is true, that is, if 𝑥★2 < 3.0, then we
proceed down the left branch, otherwise along the right branch. If we reach a new
internal node of the tree, we check the rule associated with that node and pick the
left or the right branch accordingly. We continue and work our way down until we
reach the end of a branch, called a leaf node. Each such final node corresponds to a
constant prediction b 𝑦 𝑚 , in this case one of the two classes Red or Blue.

𝑥2 < 3.0 5.0

𝑥1 < 5.0
𝑅2 𝑅3
𝑥2

𝑅1
b
𝑦 1 = Blue
3.0
𝑅2 𝑅3 𝑅1
Fig. b
𝑦 2 = Blue b
𝑦 3 = Red
2.6 Fig. 𝑥1
2.7
A classification tree. At each internal A region partition, where each region cor-
node, a rule of the form 𝑥 𝑗 < 𝑠 𝑘 indicates responds to a leaf node in the tree. Each
the left branch coming from that split, border between regions corresponds to a
and the right branch then consequently split in the tree. Each region is coloured
corresponds to 𝑥 𝑗 ≥ 𝑠 𝑘 . This tree has with the prediction corresponding to that
two internal nodes (including the root) region, and the boundary between red
and three leaf nodes. and blue is therefore the decision bound-
ary.

The decision tree partitions the input space into axis-aligned ‘boxes’, as shown in
Figure 2.7. By increasing the depth of the tree (the number of steps from the root to
the leaves), the partitioning can be made finer and finer and thereby describes more
complicated functions of the input variable.
Pseudo-code for predicting a test input with the tree in Figure 2.6 would look like:

if x_2 < 3.0 then


return Blue
else
if x_1 < 5.0 then

Machine Learning – A First Course for Engineers and Scientists


26 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

return Blue
else
return Red
end
end

As an example, if we have x★ = [2.5 3.5] T , in the first split we would take


the right branch since 𝑥★2 = 3.5 ≥ 3.0, and in the second split we would take
the left branch since 𝑥★1 = 2.5 < 5.0. The prediction for this test point would be
b
𝑦 (x★) = Blue.

To set the terminology, the endpoint of each branch 𝑅1 , 𝑅2 , and 𝑅3 in Example 2.5
are called leaf nodes, and the internal splits, 𝑥2 < 3.0 and 𝑥 1 < 5.0, are known
as internal nodes. The lines that connect the nodes are referred to as branches.
The tree is referred to as binary since each internal node splits into exactly two
branches.
With more than two input variables, it is difficult to illustrate the partitioning
of the input space into regions (Figure 2.7), but the tree representation can still be
used in the very same way. Each internal node corresponds to a rule where one of
the 𝑝 input variables 𝑥 𝑗 , 𝑗 = 1, . . . , 𝑝, is compared to a threshold 𝑠. If 𝑥 𝑗 < 𝑠, we
continue along the left branch, and if 𝑥 𝑗 ≥ 𝑠, we continue along the right branch.
The constant predictions that we associate with the leaf nodes can be either
categorical (as in Example 2.5 above) or numerical. Decision trees can thus be used
to address both classification and regression problems.
Example 2.5 illustrated how a decision tree can be used to make a prediction. We
will now turn to the question of how a tree can be learned from training data.

Learning a Regression Tree

We will start by discussing how to learn (or, equivalently, train) a decision tree for a
regression problem. The classification problem is conceptually similar and will be
explained later.
As mentioned above, the prediction b 𝑦 (x★) from a regression tree is a piecewise
constant function of the input x★. We can write this mathematically as,

∑︁
𝐿
b
𝑦 (x★) = b
𝑦 ℓ I{x★ ∈ 𝑅ℓ }, (2.3)
ℓ=1

where 𝐿 is the total number of regions (leaf nodes) in the tree, 𝑅ℓ is the ℓth region,
and b𝑦 ℓ is the constant prediction for the ℓth region. Note that in the regression
setting, b𝑦 ℓ is a numerical variable, and we will consider it to be a real number for
simplicity. In the equation above, we have used the indicator function, I{x ∈ 𝑅ℓ } = 1
if x ∈ 𝑅ℓ and I{x ∈ 𝑅ℓ } = 0 otherwise.
Learning the tree from data corresponds to finding suitable values for the
parameters defining the function (2.3), namely the regions 𝑅ℓ and the constant

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 27
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

predictions b𝑦 ℓ , ℓ = 1, . . . , 𝐿, as well as the total size of the tree 𝐿. If we start by


assuming that the shape of the tree, the partition (𝐿, {𝑅ℓ }ℓ=1 𝐿 ), is known, then we

can compute the constants {b 𝑦 ℓ }ℓ=1 in a natural way, simply as the average of the
𝐿

training data points falling in each region:

b
𝑦 ℓ = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }.

It remains to find the shape of the tree, the regions 𝑅ℓ , which requires a bit more
work. The basic idea is, of course, to select the regions so that the tree fits the
training data. This means that the output predictions from the tree should match the
output values in the training data. Unfortunately, even when restricting ourselves
to seemingly simple regions such as the ‘boxes’ obtained from a decision tree,
finding the tree (a collection of splitting rules) that optimally partitions the input
space to fit the training data as well as possible turns out to be computationally
infeasible. The problem is that there is a combinatorial explosion in the number
of ways in which we can partition the input space. Searching through all possible
binary trees is not possible in practice unless the tree size is so small that it is not of
practical use.
To handle this situation, we use a heuristic algorithm known as recursive binary
splitting for learning the tree. The word recursive means that we will determine
the splitting rules one after the other, starting with the first split at the root and
then building the tree from top to bottom. The algorithm is greedy, in the sense
that the tree is constructed one split at a time, without having the complete tree ‘in
mind’. That is, when determining the splitting rule at the root node, the objective is
to obtain a model that explains the training data as well as possible after a single
split, without taking into consideration that additional splits may be added before
arriving at the final model. When we have decided on the first split of the input
space (corresponding to the root node of the tree), this split is kept fixed, and we
continue in a similar way for the two resulting half-spaces (corresponding to the
two branches of the tree), etc.
To see in detail how one step of this algorithm works, consider the situation when
we are about to do our very first split at the root of the tree. Hence, we want to
select one of the 𝑝 input variables 𝑥 1 , . . . , 𝑥 𝑝 and a corresponding cutpoint 𝑠 which
divide the input space into two half-spaces,

𝑅1 ( 𝑗, 𝑠) = {x | 𝑥 𝑗 < 𝑠} and 𝑅2 ( 𝑗, 𝑠) = {x | 𝑥 𝑗 ≥ 𝑠}. (2.4)

Note that the regions depend on the index 𝑗 of the splitting variable as well as the
value of the cutpoint 𝑠, which is why we write them as functions of 𝑗 and 𝑠. This is
the case also for the predictions associated with the two regions,

b
𝑦 1 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅1 ( 𝑗, 𝑠)} and b
𝑦 2 ( 𝑗, 𝑠) = Average{𝑦 𝑖 : x𝑖 ∈ 𝑅2 ( 𝑗, 𝑠)},

since the averages in these expressions range over different data points depending
on the regions.

Machine Learning – A First Course for Engineers and Scientists


28 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

For each training data point (x𝑖 , 𝑦 𝑖 ),we can compute a prediction error by first
determining which region the data point falls in and then computing the difference
between 𝑦 𝑖 and the constant prediction associated with that region. Doing this for
all training data points, the sum of squared errors can be written as
∑︁ ∑︁
(𝑦 𝑖 − b
𝑦 1 ( 𝑗, 𝑠)) 2 + (𝑦 𝑖 − b
𝑦 2 ( 𝑗, 𝑠)) 2 . (2.5)
𝑖:x𝑖 ∈𝑅1 ( 𝑗,𝑠) 𝑖:x𝑖 ∈𝑅2 ( 𝑗,𝑠)

The square is added to ensure that the expression above is non-negative and that both
positive and negative errors are counted equally. The squared error is a common
loss function used for measuring the closeness of a prediction to the training data,
but other loss functions can also be used. We will discuss the choice of loss function
in more detail in later chapters.
To find the optimal split, we select the values for 𝑗 and 𝑠 that minimise the squared
error (2.5). This minimisation problem can be solved easily by looping through
all possible values for 𝑗 = 1, . . . , 𝑝. For each 𝑗, we can scan through the finite
number of possible splits and pick the pair ( 𝑗, 𝑠) for which the expression above
is minimised. As pointed out above, when we have found the optimal split at the
root node, this splitting rule is fixed. We then continue in the same way for the left
and right branches independently. Each branch (corresponding to a half-space) is
split again by minimising the squared prediction error over all training data points
following that branch.
In principle, we can continue in this way until there is only a single training data
point in each of the regions – that is, until 𝐿 = 𝑛. Such a fully grown tree will result
in predictions that exactly match the training data points, and the resulting model is
quite similar to 𝑘-NN with 𝑘 = 1. As pointed out above, this will typically result in
too erratic a model that has overfitted to (possibly noisy) training data. To mitigate
this issue, it is common to stop the growth of the tree at an earlier stage using some
stopping criterion, for instance by deciding on 𝐿 beforehand, limiting the maximum
depth (number of splits in any branch), or adding a constraint on the minimum
number of training data points associated with each leaf node. Forcing the model to
have more training data points in each leaf will result in an averaging effect, similar
to increasing the value of 𝑘 in the 𝑘-NN method. Using such a stopping criterion
means that the value of 𝐿 is not set manually but is determined adaptively based on
the result of the learning procedure.
A high-level summary of the method is given in Method 2.2. Note that the
learning in Method 2.2 includes a recursive call, where in each recursion we grow
one branch of the tree one step further.

Classification Trees
Trees can also be used for classification. We use the same procedure of recursive
binary splitting but with two main differences. Firstly, we use a majority vote instead
of an average to compute the prediction associated with each region:
b
𝑦 ℓ = MajorityVote{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ }.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 29
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

Learn a decision tree using recursive binary splitting


Data: Training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛

Result: Decision tree with regions 𝑅1 , . . . , 𝑅 𝐿 and corresponding


predictions b
𝑦1, . . . , b
𝑦𝐿
1 Let 𝑅 denote the whole input space
2 Compute the regions (𝑅1 , . . . , 𝑅 𝐿 ) = Split(𝑅,T )
3 Compute the predictions b 𝑦 ℓ for ℓ = 1, . . . , 𝐿 as
(
Average{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ } (Regression problems)
b
𝑦ℓ =
MajorityVote{𝑦 𝑖 : x𝑖 ∈ 𝑅ℓ } (Classification problems)

4 Function Split(𝑅,T ):
5 if stopping criterion fulfilled then
6 return 𝑅
7 else
8 Go through all possible splits 𝑥 𝑗 < 𝑠 for all input variables
𝑗 = 1, . . . , 𝑝.
9 Pick the pair ( 𝑗, 𝑠) that minimises (2.5)/(2.6) for
regression/classification problems.
10 Split region 𝑅 into 𝑅1 and 𝑅2 according to (2.4).
11 Split data T into T1 and T2 accordingly.
12 return Split(𝑅1 ,T1 ), Split(𝑅2 ,T2 )
13 end
14 end

Predict from a decision tree


Data: Decision tree with regions 𝑅1 , . . . , 𝑅 𝐿 , training data T = {x𝑖 , 𝑦 𝑖 }𝑖=1
𝑛 ,

test data point x★


Result: Predicted test output b
𝑦 (x★)
1 Find the region 𝑅ℓ which x★ belongs to.
2 Return the prediction b
𝑦 (x★) = b
𝑦ℓ .

Method 2.2: Decision trees

Secondly, when learning the tree, we need a different splitting criterion than the
squared prediction error to take into account the fact that the output is categorical.
To define these criteria, note first that the split at any internal node is computed by
solving an optimisation problem of the form

arg min 𝑛1 𝑄 1 + 𝑛2 𝑄 2 , (2.6)


𝑗,𝑠

Machine Learning – A First Course for Engineers and Scientists


30 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

where 𝑛1 and 𝑛2 denote the number of training data points in the left and right nodes
of the current split, respectively, and 𝑄 1 and 𝑄 2 are the costs (derived form the
prediction errors) associated with these two nodes. The variables 𝑗 and 𝑠 denote the
index of the splitting variable and the cutpoint as before. All of the terms 𝑛1 , 𝑛2 ,
𝑄 1 , and 𝑄 2 depend on these variables, but we have dropped the explicit dependence
from the notation for brevity. Comparing (2.6) with (2.5), we see that we recover
the regression case if 𝑄 ℓ corresponds to the mean-squared error in node ℓ.
To generalise this to the classification case, we still solve the optimisation problem
(2.6) to compute the split, but choose 𝑄 ℓ in a different way which respects the
categorical nature of a classification problem. To this end, we first introduce
1 ∑︁
b
𝜋ℓ𝑚 = I{𝑦 𝑖 = 𝑚}
𝑛ℓ 𝑖:x ∈𝑅
𝑖 ℓ

to be the proportion of training observations in the ℓth region that belong to the
𝑚th class. We can then define the splitting criterion, 𝑄 ℓ , based on these class
proportions. One simple alternative is the misclassification rate

𝑄 ℓ = 1 − max b
𝜋ℓ𝑚 , (2.7a)
𝑚

which is simply the proportion of data points in region 𝑅ℓ which do not belong to
the most common class. Other common splitting criteria are the Gini index
∑︁
𝑀
𝑄ℓ = b
𝜋ℓ𝑚 (1 − b
𝜋ℓ𝑚 ) (2.7b)
𝑚=1

and the entropy criterion,


∑︁
𝑀
𝑄ℓ = − b
𝜋ℓ𝑚 ln b
𝜋ℓ𝑚 . (2.7c)
𝑚=1

In Example 2.6 we illustrate how to construct a classification tree using recursive


binary splitting and with the entropy as the splitting criterion.

Example 2.6 Learning a classification tree (continuation of Example 2.5)

We consider the same setup as in Example 2.5, but now with the following dataset:
We want to learn a classification tree by using the entropy criterion in (2.7c) and
growing the tree until there are no regions with more than five data points left.
First split: There are infinitely many possible splits we can make, but all splits
which give the same partition of the data points are equivalent. Hence, in practice
we only have nine different splits to consider in this dataset. The data (dots) and
these possible splits (dashed lines) are visualised in Figure 2.8.
We consider all nine splits in turn. We start with the split at 𝑥1 = 2.5, which splits
the input space into two regions, 𝑅1 = 𝑥1 < 2.5 and 𝑅2 = 𝑥1 ≥ 2.5. In region 𝑅1 we

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 31
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

10

𝑥1 𝑥2 𝑦
9.0 2.0 Blue 8

1.0 4.0 Blue


4.0 6.0 Blue 6

4.0 1.0

𝑥2
Blue
1.0 2.0 Blue 4

1.0 8.0 Red


6.0 4.0 Red 2

7.0 9.0 Red


0
9.0 8.0 Red
Fig. 0 2 4 6 8 10
9.0 6.0 Red 𝑥1
2.8

have two blue data points and one red, in total 𝑛1 = 3 data points. The proportion
of the two classes in region 𝑅1 will therefore be b𝜋1B = 2/3 and b 𝜋1R = 1/3. The
entropy is calculated as
   
2 2 1 1
𝑄 1 = −b 𝜋1B ) − b
𝜋1B ln(b 𝜋1R ln(b
𝜋1R ) = − ln − ln = 0.64.
3 3 3 3
In region 𝑅2 we have 𝑛2 = 7 data points with the proportions b 𝜋2B = 3/7 and
b
𝜋2R = 4/7. The entropy for this region will be
   
3 3 4 4
𝑄 2 = −b 𝜋2B ) − b
𝜋2B ln(b 𝜋2R ) = − ln
𝜋2R ln(b − ln = 0.68,
7 7 7 7
and inserted in (2.6), the total weighted entropy for this split becomes
𝑛1 𝑄 1 + 𝑛2 𝑄 2 = 3 · 0.64 + 7 · 0.68 = 6.69.
We compute the costs for all other splits in the same manner and summarise them
in the table below:

Split (𝑅1 ) 𝑛1 b
𝜋1B b
𝜋1R 𝑄1 𝑛2 b
𝜋2B b
𝜋2R 𝑄2 𝑛1 𝑄 1 + 𝑛2 𝑄 2
𝑥1 < 2.5 3 2/3 1/3 0.64 7 3/7 4/7 0.68 6.69
𝑥1 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.50 5.00
𝑥1 < 6.5 6 4/6 2/6 0.64 4 1/4 3/4 0.56 6.07
𝑥1 < 8.0 7 4/7 3/7 0.68 3 1/3 2/3 0.64 6.69
𝑥2 < 1.5 1 1/1 0/1 0.00 9 4/9 5/9 0.69 6.18
𝑥2 < 3.0 3 3/3 0/3 0.00 7 2/7 5/7 0.60 4.18
𝑥2 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.06 5.00
𝑥2 < 7.0 7 5/7 2/7 0.60 3 0/3 3/3 0.00 4.18
𝑥2 < 8.5 9 5/9 4/9 0.69 1 0/1 1/1 0.00 6.18

From the table, we can read that the two splits at 𝑥 2 < 3.0 and 𝑥 2 < 7.0 are both
equally good. We choose to continue with 𝑥2 < 3.0.

Machine Learning – A First Course for Engineers and Scientists


32 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

After first split After second split


10 10

8 8

6 6 𝑅2 𝑅3
𝑥2

𝑥2
4 4

2 2
𝑅1 𝑅1
Fig. 0
0 2 4 6 8 10
0
0 2 4 6 8 10
2.9 𝑥1 𝑥1

Second split: We note that only the upper region has more than five data points.
Also, there is no point splitting region 𝑅1 further since it only contains data points
from the same class. In the next step, we therefore split the upper region into two
new regions, 𝑅2 and 𝑅3 . All possible splits are displayed in Figure 2.9 to the left
(dashed lines), and we compute their costs in the same manner as before:

Splits (𝑅1 ) 𝑛2 b
𝜋2B b
𝜋2R 𝑄2 𝑛3 b
𝜋3B b
𝜋3R 𝑄3 𝑛2 𝑄 2 + 𝑛3 𝑄 3
𝑥 1 < 2.5 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.89
𝑥 1 < 5.0 3 2/3 1/3 0.63 4 0/4 4/4 0.00 1.91
𝑥 1 < 6.5 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥 1 < 8.0 5 2/5 3/5 0.67 2 0/2 2/2 0.00 3.37
𝑥 2 < 5.0 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.88
𝑥 2 < 7.0 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
𝑥 2 < 8.5 6 2/6 4/6 0.64 1 0/1 1/1 0.00 3.82

The best split is the one at 𝑥1 < 5.0, visualised above to the right. None of the three
regions has more than five data points. Therefore, we terminate the training. The
final tree and its partitions were displayed in Example 2.5. If we want to use the
tree for prediction, we predict blue if x★ ∈ 𝑅1 or x★ ∈ 𝑅2 since the blue training
data points are in the majority in each of these two regions. Similarly, we predict
red if x★ ∈ 𝑅3 .

When choosing between the different splitting criteria mentioned above, the
misclassification rate sounds like a reasonable choice since that is typically the
criterion we want the final model to do well on.6 However, one drawback is that it
does not favour pure nodes. By pure nodes we mean nodes where most of the data
points belong to a certain class. It is usually an advantage to favour pure nodes in
the greedy procedure that we use to grow the tree, since this can lead to fewer splits

6 Thisis not always true, for example for imbalanced and asymmetric classification problems; see
Section 4.5.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 33
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

0.5
Misclassification rate
Gini index
0.25
Entropy

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
𝑟

Figure 2.10: Three splitting criteria for classification trees as a function of the proportion
of the first class 𝑟 = 𝜋ℓ1 in a certain region 𝑅ℓ as given in (2.8). The entropy criterion has
been scaled such that it passes through (0.5,0.5).

in total. Both the entropy criterion and the Gini index favour node purity more than
the misclassification rate does.
This advantage can also be illustrated in Example 2.6. Consider the first split in
this example. If we were to use the misclassification rate as the splitting criterion,
both the split 𝑥 2 < 5.0 and the split 𝑥 2 < 3.0 would provide a total misclassification
rate of 0.2. However, the split at 𝑥 2 < 3.0, which the entropy criterion favoured,
provides a pure node 𝑅1 . If we now went with the split 𝑥 2 < 5.0, the misclassification
after the second split would still be 0.2. If we continued to grow the tree until no
data points were misclassified, we would need three splits if we used the entropy
criterion, whereas we would need five splits if we used the misclassification criterion
and started with the split at 𝑥 2 < 5.0.
To generalise this discussion, consider a problem with two classes, where we
denote the proportion of the first class as 𝜋ℓ1 = 𝑟 and hence the proportion of the
second class as 𝜋ℓ2 = 1 − 𝑟. The three criteria (2.7) can then be expressed in terms
of 𝑟 as

Misclassification rate: 𝑄 ℓ = 1 − max(𝑟, 1 − 𝑟),


Gini index: 𝑄 ℓ = 2𝑟 (1 − 𝑟), (2.8)
Entropy: 𝑄 ℓ = −𝑟 ln 𝑟 − (1 − 𝑟) ln(1 − 𝑟).

These functions are shown in Figure 2.10. All three citeria are similar in the
sense that they provide zero loss if all data points belong to either of the two classes
and maximum loss if the data points are equally divided between the two classes.
However, the Gini index and entropy have a higher loss for all other proportions.
In other words, the gain of having a pure node (𝑟 close to 0 or 1) is higher for the
Gini index and the entropy than for the misclassification rate. As a consequence,
the Gini index and the entropy both tend to favour making one of the two nodes
pure (or close to pure) since that provides a smaller total loss, which can make a
good combination with the greedy nature of the recursive binary splitting.

Machine Learning – A First Course for Engineers and Scientists


34 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2.3 A Rule-Based Method: Decision Trees

Fully grown tree Tree with max depth 4

1 Beatles 1 Beatles
Kiss Kiss
Bob Dylan Bob Dylan

Energy (scale 0-1)

Energy (scale 0-1)


0.5 0.5

0 0
4.5 5 5.5 6 6.5 7 4.5 5 5.5 6 6.5 7

Length (ln s) Length (ln s)

(a) Decision boundaries for the music classifica- (b) The same problem and data as in 2.11a for
tion problem for a fully grown classification tree which a tree restricted to depth 4 has been learned,
with the Gini index. This model overfits the data. again using the Gini index. This models will
hopefully make better predictions for new data.

Fully grown tree Tree with max depth 3


150 Data 150
Data
Decision tree Decision tree (max depth 3)
Distance (feet)

Distance (feet)

100 100

50 50

0 0
10 20 30 40 10 20 30 40

Speed (mph) Speed (mph)

(c) The prediction for a fully grown regression (d) The same problem and data as in 2.11c for
tree. As for the classification problem above, this which a tree restricted to depth 3 has been learned.
model overfits to the training data.

Figure 2.11: Decision trees applied to the music classification Example 2.1 (a and b) and
the car stopping distance Example 2.2 (c and d).

How Deep Should a Decision Tree be?

The depth of a decision tree (the maximum distance between the root node and any
leaf node) has a big impact on the final predictions. The tree depth impacts the
predictions in a somewhat similar way to the hyperparameter 𝑘 in 𝑘-NN. We again
use the music classification and car stopping distance problems from Examples 2.1
and 2.2 to study how the decision boundaries change depending on the depth of the
trees. In Figure 2.11, the decision boundaries are illustrated for two different trees.
In Figure 2.11a and c, we have not restricted the depth of the tree and have grown it
until each region contains only data points with the same output value – a so-called
fully grown tree. In Figure 2.11b and d, the maximum depth is restricted to 4 and 3,
respectively.

This material is published by Cambridge University Press. This pre-publication version is free to view
and download for personal use only. Not for re-distribution, re-sale or use in derivative works. 35
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.
2 Supervised Learning: A First Approach

Similarly to choosing 𝑘 = 1 in 𝑘-NN, for a fully grown tree, all training data
points will, by construction, be correctly predicted since each region only contains
data points with the same output. As a result, for the music classification problem,
we get thin and small regions adapted to single training data points, and for the car
stopping distance problem, we get a very irregular line passing exactly through the
observations. Even though these trees give excellent performance on the training
data, they are not likely to be the best models for new, as yet unseen data. As we
discussed previously in the context of 𝑘-NN, we refer to this as overfitting.
In decision trees, we can mitigate overfitting by using shallower trees. Conse-
quently, we get fewer and larger regions with an increased averaging effect, resulting
in decision boundaries that are less adapted to the noise in the training data. This is
illustrated in Figure 2.11b and d for the two example problems. As for 𝑘 in 𝑘-NN,
the optimal size of the tree depends on many properties of the problem, and it is
a trade-off between flexibility and rigidity. Similar trade-offs have to be made for
almost all methods presented in this book, and they will be discussed systematically
in Chapter 4.
How can the user control the growth of the tree? Here we have different
strategies. The most straightforward strategy is to adjust the stopping criterion, that
is, the condition that should be fulfilled for not proceeding with further splits in a
certain node. As mentioned earlier, this criterion could be that we do not attempt
further splits if there are less than a certain number of training data points in the
corresponding region, or, as in Figure 2.11, we can stop splitting when we reach a
certain depth. Another strategy to control the depth is to use pruning. In pruning,
we start with a fully grown tree, and then in a second post-processing step, prune it
back to a smaller one. We will, however, not discuss pruning further here.

2.4 Further Reading


The reason why we started this book by 𝑘-NN is that it is perhaps the most intuitive
and straightforward way to solve a classification problem. The idea is at least a
thousand years old and was described already by H.assan Ibn al-Haytham (latinised
as Alhazen) around the year 1030 in Kitāb al-Manāz.ir (Book of Optics) (Pelillo
2014), as an explanation of how the human brain perceives objects. As with many
good ideas, the nearest neighbour idea has been re-invented many times, and a more
modern description of 𝑘-NN can be found in Cover and Hart (1967).
Also, the basic idea of decision trees is relatively simple, but there are many
possible ways to improve and extend them as well as different options for how to
implement them in detail. A somewhat longer introduction to decision trees is found
in Hastie et al. (2009), and a historically oriented overview can be found in Loh
(2014). Of particular significance is perhaps CART (Classification and Regression
Trees, Breiman et al. (1984)), as well as ID3 and C4.5 (Quinlan 1986), Quinlan
(1993).

Machine Learning – A First Course for Engineers and Scientists


36 Online draft version July 8, 2022, http://smlbook.org
© Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, and Thomas B. Schön 2022.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy