LN ML Rug
LN ML Rug
Lecture Notes
V 1.19, Jan 25, 2023
3
A Elementary mathematical structure-forming operations 263
A.1 Pairs, tuples and indexed families . . . . . . . . . . . . . . . . . . . 263
A.2 Products of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
A.3 Products of functions . . . . . . . . . . . . . . . . . . . . . . . . . . 264
4
A note on mathematical background that is required. Machine learning
programs process “training data”, and these data typically originally come in the
format of Excel files (yes! empirical scientists who actually generate those valuable
“raw data” often use Excel!), which are just matrices when seen with the eye
of a machine learner. Furthermore, neural networks are shaped by connecting
neurons with weighted “synaptic links”, and these weights are again naturally
sorted in matrices. And the main operation that a neural network actually does
is formalized by a matrix-vector multiplication. Besides neural networks, many
other (if not most) machine learning systems operate on vector-formatted data.
So it’s matrices and vectors all over the place, no escape possible. You will need at
least a basic, robust understanding of linear algebra to survive or even enjoy this
course. We will arrange a linear algebra crash refresher early in the course. A good
free online resource is the book “Mathematics for Machine Learning” (Deisenroth,
Faisal, and Ong 2019).
Furthermore, though to a lesser degree, also some familiarity with statistics and
probability is needed. You find a summary of the must-knows in the appendix of
these lecture notes, and again a tutorial exposition in Deisenroth, Faisal, and Ong
2019.
Finally, a little (not much) calculus is needed to top it off. If you are familiar
with the notion of partial derivatives, that should do it. In case of doubt - again
it’s all (and more) in Deisenroth, Faisal, and Ong 2019.
5
1 Introduction
1.1 Human Versus Machine Learning
Humans learn. Animals learn. Societies learn. Machines learn. It looks like
“learning” were a universal phenomenon and all we had to do is to develop a solid
scientific theory of “learning”, turn that into algorithms and then let “learning”
happen on computers. Wrong wrong wrong. Human learning is very different
from animal learning (and amoebas learn different things in different ways than
chimpanzees), societal learning is quite another thing as human or animal learning,
and machine learning (ML) is as different from any of the former, just as cars are
different from horses.
Human learning is incredibly elusive. It is as complex and impossible to un-
derstand as you are yourself — look into a mirror and think of all the things you
can do, all of your body motions from tying your shoes to playing the guitar; all
the thoughts you can think from “aaagrhhh!” to “I think therefore I am”; achieve-
ments personal, social, academic; all the things you can remember including your
first kiss and what you did 20 seconds ago (you started reading this paragraph,
in case you forgot); your plans for tomorrow and the next 40 years; well, just
everything about you — almost everything of that wild collection is the result of a
fabulous mixing of some kind of learning with other miracles and wonders of life.
To fully understand human learning, a scientist would have to integrate at least
the following fields and phenomena:
Recent spectacular advances in machine learning may have nurtured the im-
pression that machines come already somewhat close. Specifically, neural net-
works with many cascaded internal processing stages (so-called deep networks)
have been trained to solve problems that were considered close to impossible
only a few years back. A showcase example (one that got me hooked) is auto-
mated image caption (technical report: Kiros, Salakhutdinov, and Zemel 2014).
At http://www.cs.toronto.edu/~nitish/nips2014demo you can find stunning
examples of caption phrases that have been automatically generated by a neural
network based system which was given photographic images as input. Well, while
6
most examples are amazing, some other examples are also stunningly wrong. Fig-
ure 1 shows some screenshots. This is a demo from 2014. Since deep learning
is evolving incredibly fast, it’s already quite outdated today and current image
caption generators come much closer to perfection. But back in 2014 this was
a revelation. Other fascinating examples of deep learning are face recognition
(Parkhi, Vedaldi, and Zisserman 2015), online text translation (Bahdanau, Cho,
and Bengio 2015), inferring a Turing machine (almost) from input-output exam-
ples (Graves et al. 2016), or playing the game of Go at and beyond the level of
human grand-masters (Silver et al. 2016). I listed these examples when I wrote
the first edition of these lecture notes and I am too lazy to update this list every
year (though I should), — instead this year (2022) I just point out the current best
deep learning system, a text generation system called GPT-3 (read more about it
on Wikipedia and in this instructively critical blog).
So, apparently machine learning algorithms come close to human performance
in several tasks or even surpass humans, and these performance achievements have
been learnt by the algorithms, — thus, machines today can learn like humans??!?
The answer is NO. ML researchers (that is, the really good ones, not the average
TensorFlow user) are highly aware of this. Outside ML however, naive specta-
tors (from the popular press, politics, or other sciences) often conclude that since
learning machines can achieve similar feats as humans, they also learn like hu-
mans. It takes some effort to argue why this is not so (read Edelman 2015 for
a refutation from the perspective of cognitive psychology). I cannot embark on
this fascinating discussion here. Very roughly speaking, it’s the same story again
as with chess-playing algorithms: the best chess programs win against the best
human chess players, but not by fair means — chess programs are based on larger
amounts of data (recorded chess matches) than humans can memorize, and chess
programs can do vastly more computational operations per second than a human
can do. Brute force wins over human brains at some point when there is enough
data and processing bandwidth. Progress has accelerated in the last years be-
cause increasingly large training datasets have become available and fast enough
computing systems have become cheap enough.
This is not to say that powerful “deep learning” just means large datasets and
fast machines. These conditions are necessary but not sufficient. In addition,
also numerous algorithmical refinements and theoretical insights in the area of
statistical modeling had to be developed. Some of these algorithmical/theoretical
concepts will be presented in this course.
Take-home message: The astonishing learning achievements of today’s ML
are based on statistical modeling insight, very cleverly refined algorithms, raw
processing power and a lot of personal experience and trial-and-error optimization.
It’s technology and maths, not neurobiology or psychology. Dismiss any romantic
ideas about ML that you may have had. ML is data technology stuff for sober
engineers. But you are allowed to become very excited about that stuff, and that
stuff can move mountains.
7
Figure 1: Three screenshots from the image caption demo at http://www.cs.
toronto.edu/~nitish/nips2014demo. A “deep learning” system was trained on
some tens of thousands of photos showing everyday scenes. Each photo in the
training set came with a few short captions provided by humans. From these
training data, the system learnt to generate tags and captions for new photos.
The tags and captions on the left were produced by the trained system upon
input of the photos at the right.
8
1.2 The two super challenges of ML - from an eagle’s eye
In this section I want to explain, on an introductory, pre-mathematical level,
that large parts of ML can be understood as the art of estimating probability
distributions from data. And that this art faces a double super challenge: the
unimaginably complex geometry of real-world data distributions, and the extreme
scarcity of information provided by real-world data. I hope that after reading this
section you will be convinced that machine learning is impossible. Since however
GPT-3 exists, where is the trick? ... A nice cliff hanger, you see — I want to make
you read on.
9
dimensional word vectors (I am too lazy to check out the exact dimension that was
used). Thus, a caption can be represented by a sequence of ten 300-dimensional
vectors, which boils down to a single 3000-dimensional vector, that is, a point
c ∈ R3000 .
Similarly, an input picture sized 600 × 800 pixels with 3 color channels is
represented to TICS as a vector u of dimension 3 × 600 × 800 = 1, 440, 000.
Now TICS generates, upon a presentation of an input picture vector u, a list
of what it believes to be the five most probable captions for this input. That is,
TICS must have an idea of the probability ratios of different caption candidates.
Formally, if C denotes the random variable (RV) that returns captions c, and if U
denotes the RV that produces sample input pictures u, TICS must compute ratios
of conditional probabilities of the form P (C = ci |U = u)/P (C = cj |U = u). In the
semantic word vector space, these ratios become ratios of the values of probability
density functions (pdf) over the caption vector space R3000 . For every input image
u, TICS must have some representation of the 3000-dimensional pdf describing
the probabilities of caption candidates for image u.
Now follow me on a little excursion into geometric imagination.
Consider some specific vector c ∈ R3000 which represents a plausible 10-word
caption for image u, that is, the pdf value p(c) is relatively large. What happens
to the pdf value if we move away from c by taking a little step δ ∈ R3000 of
length, say, kδk = 0.1, that is, how does p(c + δ) compare to p(c)? This depends
on the direction in which δ points. In a few directions, p(c + δ) will be about
as large as p(c). This happens when δ points toward another caption vector c0
which has one word replaced by another word that is semantically close. For
example, consider the caption “A group of people on a bridge beside a boat”. The
last 300 elements in the 3000-dimensional vector c coding this caption stand for
the word boat. Replacing this caption by “a group of people on a bridge beside a
ship” gives another codevector c0 ∈ R3000 which is the same as c except in the last
300 components, which have been replaced by the semantic word vector for ship.
Then, if δ points from c toward c0 (that is, δ is a fraction of c0 − c), p(c + δ) will
not differ from p(c) in a major way.
If you think a little about it, you will come to the conclusion that such δ
which leave p(c + δ) roughly at the same level are always connected with replacing
words by semantically related words. Other change directions δ ∗ will either make
no semantical sense or destroy the grammatical structure of the caption phrase
encoded by c. The pdf value p(c + δ ∗ ) will drop dramatically compared to p(c) in
those cases.
Now, in a 10-word caption, how many replacements of some word with a related
one exist? Some words will be grammatical function words (“a”, “‘of ” etc.) which
admit only a small number of replacements, or none at all. The words that carry
semantic meaning (“‘group”, “people” etc.) typically allow for a few sense-making
replacements. Let us be generous and assume that a word in a 10-word caption, on
average, can be replaced by 5 alternative words such that after the replacement,
10
the new caption still is a reasonable description of the input image.
This means that around c there will be 5 · 10 = 50 directions in which the
relatively large value of p(c) stays large. Assuming these 50 directions are given
by linearly independent δ vectors, we find that around c there is a 50-dimensional
affine linear subspace S of R3000 in which we can find high p values in the vicinity
of c, while in the 2950 directions orthogonal to S, the value of p will drop fast if
one moves away from c.
Ok., this was a long journey to a single geometric finding: locally around
some point c where the pdf is relatively large, the pdf will stay relatively large
only a small fraction of the directions - these directions span a low-dimensional
linear subspace around c. If you move a little further away from c on this low-
dimensional “sheets”, following the lead of high pdf values, you will find that this
high-probability surface will take you on a curved path - these high-probability
sheets in R3000 are not flat but curved.
The mathematical abstraction of such relatively high-probability, low-dimensional,
curved sheets embedded in R3000 is the concept of a manifold. Machine learning
professionals often speak of the “data manifold” in a general way, in order to in-
dicate that the geometry of high-probability areas of real-world pdfs consists of
“thin” (low-dimensional), curved sheet-like domains curled into the embedding
data space. It is good for ML students to know the clean mathematical definition
of a manifold. Although the geometry of real-world pdfs will be less clean than
this mathematical concept, it provides useful intuitions, and advanced research in
DL algorithms frequently starts off from considering manifold models.
So, here is brief intro to the mathematical concept of a manifold. Consider
an n-dimensional real vector space Rn (for the TICS output space of caption
encodings this would be n = 3000). Let m ≤ n be a positive integer not larger
than n. An m-dimensional manifold M is a subset of Rn which locally can be
smoothly mapped to Rm , that is, at each point of M one can smoothly map a
neighborhood of that point to a neighborhood of the origin in the m-dimensional
Euclidean coordinate system (Figure 2A).
1-dimensional manifolds are just lines embedded in some higher-dimensional Rn
(Figure 2B), 2-dimensional manifolds are surfaces, etc. Manifolds can be wildly
curved, knotted (as in Figure 2C), or fragmented (as in Figure 2B). Humans
cannot visually imagine manifolds of dimension greater than 2.
The “data manifold” of a real-world data source is not a uniquely or even
well-defined thing. For instance, returning to the TICS example, the dimension
of the manifold around a caption c would depend on an arbitrary threshold fixed
by the researcher – a direction δ would / would not lead out of the manifold if
the pdf decreases in that direction faster than this threshold. Also (again using
the caption scenario) around some good captions c the number of “good” local
change directions will differ from the number around another, equally good, but
differently structured caption c0 . For these and other reasons, claiming that data
distributions are shaped like manifolds is a strongly simplifying abstraction.
11
A
kahrstrom.com/mathematics/illustrations.php
B C
en.wikipedia.org/wiki/Manifold www.math.utah.edu/carlson60/
Despite the fact that this abstraction does not completely capture the geometric-
statistical complexity of real-world data points, the geometric intuitions behind
the manifold concept have led to substantial insight into the (mal-)functioning of
machine learning methods. Adversarial attacks on deep networks is a good exam-
ple of the usefulness of the data manifold concept. Take a look at Figure 3 – taken
from a much-cited paper (I. J. Goodfellow, Shlens, and Szegedy 2014, 13500 cites
on Google Scholar 2022) which explores the phenomenon of adversarial attacks.
The left panel shows a photo of a panda (one would think). It is given as input to
a deep network that had been trained to classify images. The network correctly
classifies it as “panda”, with a “confidence” of 58%. The middle panel shows a
pattern that looks like noise. If a very small multiple (factor 0.007) of this pattern
is added to the panda picture, one gets the picture shown in the right panel. For
the human eye there is no change. However, the neural network now classifies it
as “gibbon” with a dead-sure confidence level of 99%.
What has happened here? Well, this is a manifold story. Let’s say that the
panda image is sized 600 × 600 pixels (I didn’t check) with three color channels.
12
“panda” “gibbon”
58% confidence 99% confidence
Figure 3: Manifold magic: Turning a panda into a gibbon. For explanation see
text. Picture taken from I. J. Goodfellow, Shlens, and Szegedy 2014
Thus, mathematically, such images are points in R1080000 . When the neural net-
work was trained, it was presented with a large number of example images, that
is, a point cloud in R1080000 . The outcome of the learning algorithm is an estimate
(up to an undetermined scaling factor) of a pdf in R1080000 . Geometrically speak-
ing, this pdf is highly concentrated along a low-dimensional manifold, call it M.
The dimension of this manifold is given by the number of neurons in the most
narrow layer of the neural network, say m = 1000 (this would be a standard order
of magnitude in such deep networks; I didn’t check). Thus, the original panda pic-
ture corresponds to a point u on a 1000-dimensional manifold which is curled into
a 1,080,000-dimensional embedding space. In terms of dimensions, the manifold
uses only about one out of thousand dimensions! Now, add that noise-like image
(call it δ) to u, getting the rightmost panel picture as u + δ. If δ is prepared in a
way that it points in a direction orthogonal to M, the value of the pdf will shrink
dramatically. The trained network, when presented with input u + δ, will “think”
that it has never seen anything like this u + δ image, and will return a random
classification – “gibbon” in this case. In fact, the “noise” pattern in Figure 3 isn’t
random noise at all; it has been cleverly computed to lead the neural network state
away from the trained manifold.
Adversarial examples are today widely used in a number of different ways to
improve the quality of deep learning applications. Check out Section 7.13 in the
bible of deep learning (I. Goodfellow, Bengio, and Courville 2016) for a primer if
you are interested.
Back to our TICS example. At the first stage in their neural processing
pipeline, the designers of this learning system set up a layered neural network
whose input layer (the “retina” of the network) had 1,440,000 neurons (one neu-
ron per color pixel), and whose output layer had 4096 neurons. This means that
the 1,440,000 dimensional raw input image vectors were projected on a manifold
that had at most 4096 dimensions (or less, if another layer in this network had
fewer neurons, like the 1000 that we assumed above). The right “folding” of this
manifold was effected by the neural network training.
Takehome summary: real-world data distributions are typically concentrated
13
Figure 4: Two images and their annotations from the training dataset. Taken
from Young et al. 2014.
14
3-dimensional projection of this 1,443,000-dimensional space – three dimensions
being the largest spatial dimension that our brains can handle. The rendering
in Figure 5 corresponds to a dataset where each “photo” is made of merely two
grayscale pixels. Each such “photo” is thus a point in the square [0, 1]2 (light blue
area in the figure, spanned by the two pixel intensities x1 , x2 ). Each caption is
coded by a single number y ∈ [0, 1]. We furthermore simplify the situation by
assuming that the training dataset contains only two photos u1 , u2 . The caption
coding vector of a photo is here reduced to a single number, plotted above the
photo’s coordinates on the y-axis (blue crosses). The two blue crosses in the figure
thus represent the information contained in the training data.
y
A green-shirted man
1 with a butcher‘s
?? apron uses a knife to
carve out the
y1 hanging carcass of a
cow.
?? y2
x2
?? u2
u1 u*
1 x1
Figure 5: Training data for TICS (highly simplified). The photo dataspace (light
blue square spanned by pixel intensities x1 , x2 ) is here reduced to 2 dimensions,
and the caption dataspace to a single one only (y-axis). The training dataset is
assumed to contain two photos only (blue diamonds u1 , u2 ), each with one caption
(blue crosses with y-values y1 , y2 ). A test image u∗ (orange diamond) must be
associated by TICS, after it has been trained, with suitable caption which lies
somewhere in the y-direction above u∗ (dashed orange ??-line).
Now consider a new test image u∗ (orange diamond in the figure). The TICS
system must determine a suitable caption for u∗ , that is, an appropriate value y ∗
– a point somewhere on the orange broken ??-line in the figure.
But all that TICS knows about captions is contained in the two training data
points (blue crosses). If you think about it, there seems to be no way for TICS to
15
infer from these two points where y ∗ should lie. Any placement along the ??-line
is logically possible!
In order to determine a caption position along the ??-line, TICS must add
some optimization criterion to the scenario. For instance, one could require one
of the following conditions:
1. Make y ∗ that point on the ??-line which has the smallest total distance to
all the training points.
2. One might wish to grant closer-by training images a bigger impact on the
caption determination than further-away training images. Thus, if d(ui , u∗ )
denotes the distance between training image ui and test image u∗ , set y ∗
to the weighted mean of the training captions yi , where the weights are
inversely proportional to the distance of the respective images:
1 X
y∗ = P ∗ )−a
d(ui , u∗ )−a yi .
i d(ui , u i
• All of these optimization criteria make some intuitive sense, but they would
lead to different results. The generated captions will differ depending on
which criterion is chosen.
• The criteria rest on quite different intuitions. It is unclear which one would
be better than another one, and on what grounds one would compare their
relative merits. A note in passing: finding comparison criteria to assess
relative merits of different statistical estimation procedures is a task that
constitutes an entire, important branch of statistics. For an introduction you
might take a look at my lecture notes on “Principles of statistical modeling”,
Section 19.3.1 “Comparing statistical procedures” (https://www.ai.rug.
nl/minds/uploads/LN_PSM.pdf).
16
• Criteria 2 and 3 require the choice of design parameters (a, σ 2 ). The other
criteria could be upgraded to include reasonable design parameters too. It
is absolutely typical for machine learning algorithms to have such hyper-
parameters which must be set by the experimenter. The effects of these
hyperparameters can be very strong, determining whether the final solution
is brilliant or useless.
Some methods in ML are cleanly derived from optimization criteria, with math-
ematical theory backing up the design of algorithms. An example of such well-
understood methods are decision trees, which will be presented first in our course.
Other branches employ learning algorithms that are so complex that one loses
control over what is actually optimized, and one has only little insight in the
geometrical and statistical properties of the models that are delivered by these
methods. This is sadly true for deep learning methods.
There is something like an irreducible arbitrariness of choosing an optimiza-
tion condition, or, if one gives up on fully understanding what is happening, an
arbitrariness in the design of complex learning algorithms. This turns machine
learning into a personal preference thing, even into an art.
To put this all in a nutshell: the available training data do not include the
information of how the information contained in them should be “optimally” ex-
tracted, or what kind of information should be extracted. This could be called a
lack of epistemic information (my private terminology; epistemology is the branch
of philosophy that is concerned with the question by which methods of reasoning
humans acquire knowledge, and with the question what “knowledge” is in the
first place - check out https://en.wikipedia.org/wiki/Epistemology if you
are interested in these deep, ancient, unsolved questions).
The discussion of lacking epistemic information is hardly pursued in today’s
ML, although there have been periods in the last decades when fierce debates on
such issues have been led.
But there is also another lack of information which is a standard theme in
today’s ML textbooks. This issue is called the curse of dimensionality. I will
highlight this curse with our TICS demo again.
In mathematical terminology, our super-simplified 3-dimensional TICS system
from Figure 5 is able (after training) to compute a function f : [0, 1]2 → [0, 1]
which computes a caption f (u∗ ) for every input test image u∗ ∈ [0, 1]2 . The only
information TICS has at learning time is contained in the training data points
(blue crosses in our figure).
17
Looking at Figure 5, estimating a function f : [0, 1]2 → [0, 1] from just the
information contained in the two blue crosses is clearly a dramatically underde-
termined task. You may argue that the version of the TICS learning task which I
gave in Figure 5 has been simplified to an unfair degree, and that the real TICS
system had not just 2, but 30,000 training images to learn on.
But, in fact, for the full-size TICS the situation is even much, much worse than
what appears in Figure 5:
The two conditions together – fewer data points than dimensions, and large
distances between the points – make it appear impossible to estimate a function
on [0, 1]1440000 from just those 30,000 points.
This is the dreaded “curse of dimensionality”.
In plain English, the curse of dimensionality says that in high-dimensional data
spaces, the available training data points will be spread exceedingly thinly in the
embedding data space, and they will lie far away from each other. Metaphorically
speaking, training data points are a few flickering stars in a vast and awfully
empty universe. But, most learning tasks require one to fill the empty spaces with
information — a pdf.
18
After what we have seen above, how is it possible for a human brain to learn
a meaningful and reliable representation of the world (that is, a manifold in brain
state space) at all?
In fact, we do not know. But neuroscientists, psychologists and cognitive
scientists have come up with a number of ideas.
An adult human has had a childhood and youth’s time to learn most of his/her
world model. The “training data” are the sensory impressions collected in, say,
the first 25 years of the human’s life. Thinking of a sensory impression as a vector,
it is hard to say what is the dimension of this vector. A lower bound could be
found in the number of sensory neural fibers reaching the brain. The optical
nerve has about 1,000,000 fibres. Having two of them, plus all sorts of other
sensory fibers reaching the brain (from the nose, ears, body), let us boldly declare
that the dimension of sensor inputs to the brain is 3 million. Now, the learning
process for a human differs from TICS learning in that sensory input arrives in
a continuous stream, not in isolated training images. In order to make the two
scenarios comparable, assume furthermore that a human organizes the continuous
input stream into “snapshots” at a rate of 1 snapshot “picture” of the current
sensory input per second (cognitive psychologists will tell us that the rate is likely
less). In the course of 25 years, with 12 wake hours per day, this makes about 25
× 360 × 12 × 3600 ≈ 390 million “training snapshot images”. This gives a ratio
of number of training points over data dimension of 390 million / 3 million = 130,
which looks so much better than the ratio of 30,000 / 1,443,000 ≈ 0.02 in TICS’s
case.
But even the ratio of 130 data points per dimension is still hopeless, and the
curse of dimensionality strikes just as well. Why?
For simplicity we again assume that the 3-million dimensional sensor image
vectors are normalized to a range of [0, 1], making a sensor impression a point in
[0, 1]3000000 . Statisticians, machine learners, and many cognitive scientists would
tell us that the “world model” of a human can be considered to be a probabil-
ity distribution over the sensory image space [0, 1]3000000 . This is a hypercube
with 23000000 corners. In order to estimate a probability distribution over an n-
dimensional hypercube, a statistician would demand to have many more data-
points than the cube has corners (to see why, think about estimating a probability
distribution over the one-dimensional unit hypercube [0, 1] from data points on
that interval, then extrapolate to higher dimensions). That is, a brain equipped
with ordinary statistical methods would demand to have many more than 23000000
training data points to distil a world model from that collection of experiences.
But there are only 390 million such datapoints collected in 25 years. The ratio
390 million / 23000000 is about 2−2999972 . That is, a human would have to have a
youth lifetime of about 25×22999972 ≈ 22999977 years in order to learn a statistically
defendable world model.
Still, the human brain (and the human around it, with the world around the
human around that brain) somehow can do it in 25 years. Cognitive scientists
19
believe that the key to this magic lies in the evolutionary history of man. Through
millions of years of incremental, evolutionary brain structure optimization, starting
from worm brains or earlier, the human brain is pre-structured in exactly such
ways that it comes with a built-in-by-birth data manifold geometry which reduces
the 3 million-dimensional raw sensor data format to a much lower-dimensional
manifold surface. Then, 390 million data points may be enough to cover this
manifold densely enough for meaningful distribution estimates.
The question which built-in experience pre-shapings a human brings to the
table at birth time has a long tradition in philosophy and psychology. A recent
line of work that brings this tradition to bear on machine learning is in the research
of Joshua Tenenbaum – check out, for instance, Tenenbaum, Griffiths, and Kemp
2006 if you are interested.
In recent years, besides evolutionary predisposition, another line of thinking
has started to be discussed as a solution to the question why humans (and maybe
machines) can learn from embarrassingly small training samples. This is the ap-
proach of causal modeling. It has its origins not in biology or psychology, but in
statistics, more specifically in the technique to model complex probability distri-
butions with graphical models. We will treat them later in this course (Section
10). For now, I can only give an explanation in vague intuitive terms. The core
idea of causal modeling is that in our real world, any phenomenon or experience
is causally influencing only a very small portion of other pheonomena or experi-
ences. When I walk and my foot hits a stone, the stone will be kicked away and
my foot will hurt and I will make an extra step. That’s it. The rest of the universe
remains unaffected, and all of the other zillions of experiences that form my per-
sonal life training dataset simply have nothing to do with this stone hitting event.
When walking through our lives, the huge collection of sensory experiences that we
accumulate does not form an unorganized point cloud in some high-dimensional
space of sensory experiences. In addition, we know which of these data points
are causally connected to which others. To stay in the abstract geometrical pic-
ture: The data points are connected with causal impact lines. Each data point is
causally connected only to a small number of others. These connecting lines span
low-dimensional neighborhoods around each data point: a guide for guessing a
local manifold geometry. — We will be able to formalize this after we have learnt
about graphical models. If you already know a little about them and want to
study ahead: a beautiful introduction to causal modeling is given by Schölkopf,
Locatello, et al. 2021.
20
ality; the models should be veridical (from Latin, “saying the truth”). The inner
workings of a model should reflect the inner workings of reality. For instance, a de-
tailed chemical model of a reaction, which changes a substance A into a substance
B, should give an account of the kinetic and quantum-mechanical substeps that
actually take place in this reaction. Newton’s laws of motion explain the dance
of planets by formulas made of variables which should correspond to physically
real quantities – like gravitational forces, masses or velocities. Models whose inner
mechanisms or mathematical variables are intended to capture real mechanisms
and real quantities are called analytical models.
Machine learners, in contrast, are application oriented. They want useful mod-
els. ML models of pieces of reality must function well in their respective appli-
cation context. The inner workings of an ML model need not mirror the inner
workings of the modeled system. Machine learning models are thus almost always
blackbox models (a possible exception being Bayesian networks, will be treated
in Session 9 of this course). A blackbox model of some real-world system just
captures the externally observable input-output behavior of the modeled system,
but it may use any internal mathematical or algorithmical tricks that its designer
can think of. Neural networks are a striking example. A neural network trained
for predicting the next day’s stock index will be made of hundreds or even mil-
lions of interacting variables (“neuron activations”) which have no corresponding
counterpart in the reality of stock markets.
Figure 6 sketches how blackbox models work. They are derived (“learnt”,
“estimated”) from data emitted by the source system, and they should generate
synthetic data that “fit” (have a similar distribution as) the real-world data. And
that’s it. The structure of the blackbox model need not agree in any way with the
structure of the source system – in Figure 6, this is a robot, while the model is
a neural network whose structure has no correspondence whatsoever in the robot
design.
Analytical and blackbox models have complementary merits:
• A blackbox model requires training data. There are cases where such data
are not available in sufficient quantity. Then analytical modeling is the only
way to go.
• As a rule, analytical models are much more compact than blackbox models.
Compare E = m c2 with the TICS system, which consists of several modules,
some of which are neural networks with hundreds of thousands of variables.
• When the target system that one wants to model is very complex, there is
typically no chance for an analytical model. Example: an analytical model
21
Figure 6: How blackbox models work. For explanation see text.
• The great – really very great – advantage of analytical models is that they
generalize to all situations within their scope. The laws of gravitation can be
applied to falling apples as well as to the majestic whirling of galaxies. If a
blackbox model would have been trained by Newton on data collected from
the sun and planets motions, this model would be exclusively applicable to
our planetary system, not to apples and not to galaxies.
An interesting and very relevant modeling task is to model the earth’s atmo-
sphere for weather forecasting. This modeling problem has been intensely worked
on for many decades, and an interesting mix of analytical and blackbox methods
marks the state of the art. If there is time and interest I will expand on this in
the tutorial session.
I conclude this section with a tale from my professional life, which nicely illus-
trates the difference between analytical and blackbox modeling. I was once called
to consult a company in the chemical industry. They wanted a machine learning
solution for the following problem. In one of their factories they had built a pro-
duction line whose output was a certain artificial resin. Imagine a large hall full
of vessels, heaters, pumps and valves and tubes. The production process of that
resin was prone to a particularly nasty possible failure: if the process would not
be controlled correctly, some intermediate products might solidify. The concerned
22
vessels, tubes and valves would be blocked and could not be cleared – necessitating
to disassemble the facility and replace the congested parts with new ones. Very
expensive! The chemical engineers had heard of the magic of neural networks
and wanted one of these, which should give them early warnings if the production
process was in danger of drifting into the danger zone. I told them that this was
(maybe) possible if they could provide training data. What training data, please?
Well, in order to predict failure, a neural network needs examples of this failure.
So, could the engineers please run the facility through a reasonably large number
of solidification accidents, a few hundreds maybe for good statistics? Obviously,
that was that. Only analytical modeling would do here. A good analytical model
would be able to predict any kind of imminent solidification situations. But that
wasn’t an option either because the entire production process was too complex for
an accurate enough analytical model. Now put yourself into the skin of the respon-
sible chief engineer. What should he/she do to prevent the dreaded solidification
to happen, ever? Another nice discussion item for our tutorial session.
23
it very soberly as the craft of estimating complex probability distributions with
efficient algorithms and powerful computers.
My personal map of the ML landscapte divides it into four main segments with
distinct academic communities, research goals and methods:
Segment 1: Theoretical ML. Here one asks what are the fundamental possi-
bilities and limitations of inferring knowledge from observation data. This
is the most abstract and “pure maths” strand of ML. There are cross-
connections to the theory of computational complexity. Practical applicabil-
ity of results and efficient algorithms are secondary. Check out https://en.
wikipedia.org/wiki/Computational_learning_theory and https://en.
wikipedia.org/wiki/Statistical_learning_theory for an impression of
this line of research.
Segment 2: Symbolic-logic learning, data mining. Here the goal is to infer
symbolic knowledge from data, to extract logical rules from data, to infer
facts about the real world expressed in fragments of first-order logic or other
logic formalisms, often enriched with probabilistic information. Neural net-
works are rarely used. A main motive is that these resulting models be
human-understandable and directly useful for human end-users. Key terms
are “knowledge discovery”, “data mining”, or “automated knowledge base
construction”. This is the area of my brother’s research. Check out https:
//en.wikipedia.org/wiki/Data_mining or https://en.wikipedia.org/
wiki/Inductive_logic_programming or Suchanek et al. 2013 for getting
the flavor. This is an application-driven field, with applications e.g. in bioin-
formatics, drug discovery, web mining, document analysis, decision support
systems.
A beautiful example is the PaleoDeepDive project described in Peters et al.
2014. This large-scale project aimed at making paleontological knowledge
easily searchable and more realiable. Paleontology is the science of extinct
animal (and plant) species. Its raw data are fossil bones. It is obviously dif-
ficult to reliably classify a handfuly of freshly excavated bones as belonging
to a particular species – first, because one usually doesn’t dig out a com-
plete skeleton, and second because extinct species are not known in the first
place. The field is plagued by misclassifications and terminological uncer-
tainties – often a newly found set of bones is believed to belong to a newly
discovered species, for which a new name is created, although in reality other
fossil findings, already named differently, belong to the same species. In the
PaleoDeepDive project, the web was crawled to retrieve virtually all scien-
tific pdf documents relating to paleontology – including documents that had
been published in pre-digital times and were just image scans. Using optical
character recognition and image analysis methods at the front end, these
documents were made machine readable, including information contained in
tables and images. Then, unsupervised, logic-based methods were used to
24
identify suspects for double naming of the same species, and also the oppo-
site: single names for distinct species – an important contribution to purge
the evolutionary tree of the animal kingdom.
Segment 3: Signal and pattern modeling. This is the most diverse sector in
my private partition of ML and it is difficult to characterize it globally. The
basic attitude here is one of quantitative-numerical blackbox modeling. Our
TICS demo would go here. The raw data are mostly numerical (like physical
measurement timeseries, audio signals, images and video). When they are
symbolic (texts in particular), one of the first processing steps typically en-
codes symbols to some numerical vector format. Neural networks are widely
used and there are some connections to computational neuroscience. The
general goal is to distil from raw data a numerical representation (often
implicit) of the data distribution which lends itself to efficient application
purposes, like pattern classification, time series prediction, or motor control,
to name a few. Human-user interpretability of the distribution representa-
tion is not easy to attain, but has lately become an important subject of
research. Like Segment 2, this field is decidedly application-driven. Under
the catchword “deep learning” a subfield of this area has recently received a
lot of attention.
It is hard to judge how “big” these four segments are in mutual comparison.
Surely Segment 1 receives much less funding and is pursued by substantially fewer
researchers than segments 2 and 3. In this material sense, segments 2 and 3 are
both “big”. Segment 4 is bigger than Segment 1 but smaller than 2 or 3. My own
research lies in 3 and 4. In this course I focus on the third segment — you should
be aware that you only get a partial glimpse of ML.
A common subdivision of ML, partly orthogonal to my private 4-section par-
tition, is based on three fundamental kinds of learning tasks:
25
Supervised learning. Training data are “labelled pairs” (xn , yn ), where x is
some kind of “input” and y is some kind of “target output” or “desired /
correct output”. TICS is a typical example, where the xn are images and
the yn are captions. The learning objective is to obtain a mechanism which,
when fed with new test inputs xtest , returns outputs ytest that generalize in
a meaningful way from the training sample. The underlying mathematical
task is to estimate the conditional distributions PY | X from the training sam-
ple (check the end of Appendix B for a brief explanation of this notation).
The learnt input-output mechanism is “good” to the extent that upon input
xtest it generates outputs that are distributed according to the true condi-
tional distribution P (Y = y | X = xtest ), just as we have seen in the TICS
demo. Typical and important special cases of supervised learning are pat-
tern classification (the y are correct class labels for the input patterns x)
or timeseries prediction (the y are correct continuations of initial timeseries
x). Segment 3 from my private segmentation of ML is the typical stage for
supervised learning.
Unsupervised learning. Training data are just data points xn . The task is to
discover some kind of “structure” (regularities, symmetries, redundancies...)
in the data distribution which can be used to create a compressed repre-
sentation of the data. Unsupervised learning can become very challenging
when data points are high-dimensional and/or when the distribution has a
complex shape. Unsupervised learning is often used for dimension reduction.
The result of an unsupervised learning process is a dimension-reducing, en-
coding function e which takes high-dimensional data points x as inputs and
returns low-dimensional encodings e(x). This encoding should preserve most
of the information contained in the original inputs x. That is, there should
also exist a decoding function d which takes encodings e(x) as inputs and
transforms them back to the high-dimensional format of x. The overall loss
in the encoding-decoding process should be small, that is, one wishes to
obtain x ≈ d(e(x)). A discovery of underlying rules and regularities is the
typical goal for data mining applications, hence unsupervised learning is the
main mode for Segment 2 from my private dissection of ML. Unsupervised
methods are often used for data preprocessing in other ML scenarios, be-
cause most ML techniques suffer from curse of dimensionality effects and
work better with dimension-reduced input data.
Reinforcement learning. The set-up for reinforcement learning (RL) is quite
distinct from the above two. It is always related to an agent that can choose
between different actions which in turn change the state of the environment
the agent is in, and furthermore the agent may or may not receive rewards
in certain environment states. RL thus involves at least the following three
types of random variables:
• action random variables A,
26
• world state random variables S,
• reward random variables R.
that is, the next action is chosen on the basis of the “lifetime experience” of
previous actions and the resulting world states. RL is naturally connected to
my Segment 4. Furthermore there are strong ties to neuroscience, because
neuroscientists have reason to believe that individual neurons in a brain can
adapt their functioning on the basis of neural or hormonal reward signals.
Last but not least, RL has intimate mathematical connections to a classical
subfield of control engineering called optimal control, where the (engineer-
ing) objective is to steer some system in a way that some long-term objective
is optimized. An advanced textbook example is to steer an interplanetary
missile from earth to some other planet such that fuel consumption is min-
imized. Actions here are navigation maneuvres, the (negative) reward is
fuel consumption, the world state is the missile’s position and velocity in
interplanetary space.
27
are high-dimensional; and an agent will be more capable of choosing good actions
if it possesses an input-output model (= supervised learning) of the environment
— inputs are actions, outputs are next states. Deep reinforcement learning is a
flourishing branch of deep learning where learning agents encode some of their
knowledge in deep neural networks.
Today machine learning is often identified with deep learning — at least, in the
general public perception. Deep learning is (almost) exclusively based on neural
networks as “learning machines”. In our course we will treat neural networks only
in one of thirteen sections, and only on a very basic level. I hope that the other
sections reveal ML as a field that is much broader than deep learning. I also want
to bring to the front a number of general insights and challenges that are the same
in deep learning and other branches of ML. If you are particularly interested in
deep learning, I must refer you to the specialization course on deep learning in
our Master program, or to textbooks. There are two deep learning textbooks,
both open-access and online, which I want to give a special recommendation. The
first is the original deep learning “bible” which was written by some of the very
pioneers of the field in 2016, when the basic methods of deep learning had been
established, had matured, and had become famous and widely used. It’s the book
Deep Learning by I. Goodfellow, Bengio, and Courville 2016. The second is the
book Dive into Deep Learning by Zhang et al. 2021, which, like the 2016 bible,
gives a comprehensive introduction to the basics of deep learning, but furthermore
an extensive overview of the current state-of-the-art architectures and algorithms,
with detailed instructions concerning their efficient implementation. Since the
field is developing very fast, the 2021 book has grown to 1000+ pages, 220 pages
more than the 2016 book. Both books include extended introductions to the
mathematical methods that underpin deep learning.
28
2 Decision trees and random forests
This section describes a class of machine learning models that is classical and
simple and intuitive and useful. Decision trees and random forests are not super
fashionable in these deep learning times, but practicians in data analysis use them
on a daily basis. The main inventions in this field have been made around 1980-
2000. In this chapter I rely heavily on the decision tree chapters in the classical
ML textbooks of Mitchell 1997 and Duda, P. E. Hart, and Stork 2001, and for
the random forest part my source is the landmark paper by Breiman 2001 which,
as per Nov 15, 2021, has been cited 81000 times (Google Scholar). Good stuff to
know, apparently.
Note: In this chapter of the lecture notes I will largely adhere to the notation
used in Breiman 2001 in order to make it easier for you to read that key paper
if you want to dig deeper. If in your professional future you want to use decision
tree methods, you will invariably use them in random forests, and in order to
understand what you are actually doing, you will have to read Breiman 2001 (like
hundreds of thousands before you). Unfortunately, the notation used by Breiman is
inconsistent, which makes the mathematical part of that paper hard to understand.
My hunch is that most readers skipped the two mathy sections and read only the
experimental sections with the concrete helpful hints for algorithm design, and
with gorgeous demo results. Inconsistent or even plainly incorrect mathematical
notation (and mathematical thinking) happens a lot in “engineering applied math”
papers and makes it difficult to really understand what the author wants to say (see
my ramblings in my lecture notes on “Principles of Statistical Modeling”, Chapter
14, “A note on what you find in textbooks”, online at https://www.ai.rug.nl/
minds/uploads/LN_PSM.pdf). Therefore I will not use Breiman’s notation 1-1,
but modify it a little to make it mathematically more consistent.
• Start at the root node of the tree. The root node is labeled with a property
that fruit have (color? is the root property used in the figure).
• Underneath the root node you find child nodes, one for each of the three
possible color attributes green, yellow, red. Decide which color your fruit has
and proceed to that child node. Let us assume that your fruit was yellow.
Then you are now at the child node labelled shape?.
• Continue this game of moving downwards in the tree according to the direc-
tion decisions taken at each node according to the attributes of the fruit in
29
your hand. If you reach a leaf node of this tree, it will be labeled with the
type of your fruit. If your fruit is yellow, round and small, it’s a lemon!
30
color = Green?
yes no
yes no yes no
yes no yes no
FIGURE 8.2. A tree with arbitrary branching factor at different nodes can always be rep-
resented by a functionally equivalent binary tree—that is, one having branching factor
Figure 8: A binary version of the decision tree shown in Figure 7. Taken from
B = 2 throughout, as shown here. By convention the “yes” branch is on the left, the “no”
branch on the right. This binary tree contains the same information and implements the
Duda, P. E. Hart, and Stork 2001.
same classification as that in Fig. 8.1. From: Richard O. Duda, Peter E. Hart, and David
G. Stork, Pattern Classification. Copyright ⃝c 2001 by John Wiley & Sons, Inc.
Here we get a first inkling that decision tree learning might not be as trivial as
the final result in Figure 7 makes it appear. The above fruit training data table has
missing values (marked by “?”); not all properties from the training data are used
in the decision tree (property “Weight” is ignored); some examples of the same
class have different attribute vectors (first two rows give different characterizations
of watermelons); some identical attribute vectors have different classes (rows 3
and 4). In summary: real-world training data will be partly redundant, partly
inconsistent and will be containing errors and gaps. All of this points in the same
direction: statistical analyses will be needed to learn decision trees.
31
Figure 9: Another decision tree. The right image shows a page from this botany
field guide. What you see on this page is, in fact, a small section of a large binary
classification tree. Image source: iberlibro.com, booklooker.de.
this course: in a rigorous “true math” version, and in an intuitive “makes some
sense” version.
I called the fruit attribute data table above a sample. Samples are mathemat-
ical objects of key importance in statistics and machine learning (where they are
also called “training data”). Samples are always connected with random variables
(RVs). Here is how.
First, the intuitive version. As an empirical fruit scientist, you would obtain a
“random draw” to get the training data table in a very concrete way: you would
go to the fruit market, collect 3000 fruit “at random”, observe and note down
their color, size, taste, weight and shape attributes in an Excel table, and for each
of the 3000 fruit you also ask the fruit vendor for the name of the fruit to get an
almost but not quite surely correct class label which you also note down in the
last column of the table.
The mathematical representation of observing and noting down the attributes
32
of the i-th fruit that you have picked (where i = 1, . . . , 3000) is Xi (ω). Xi is
the random variable which is the mathematical model of carrying out the i-th
observation that you do on a fruit market. This Xi could be described as the
procedure “pick a random fruit and observe and report the attribute vector”. The
ω in Xi (ω) is the occasion when you actually executed the procedure Xi – say,
ω stands for the concrete data collection event when you went to the Vismarkt
in Groningen last Tuesday and visited the fruit stands. If the entire procedure of
collecting 3000 fruit specimen is executed on another occasion – for instance, a
week earlier, or by your friend on the same day – this would be mathematically
represented by another ω. For instance, Xi (ω) might be the attribute vector that
you observed for the i-th fruit last Tuesday, Xi (ω 0 ) would be the attribute vector
that you observed for the i-th fruit one week earlier, Xi (ω 00 ) would be the attribute
vector that your friend observed for the i-th last Tuesday when he did the whole
thing in parallel with you, etc. In mathematical terminology, these “observation
occasions” ω are called elementary events.
Similarly, Yi (ω) is the fruit class name that you were told by the vendor when
you did the data sampling last Tuesday; Yi (ω 0 ) would be the ith fruit name you
were told when you did the whole exercise a week earlier already, etc.
The three thousand pairs (Xi (ω), Yi (ω)) correspond to the rows in the data
table. I will call each of these data (row) vectors (Xi (ω), Yi (ω)) a data point. Data
base people and practical field workers will call these vectors records. To get the
entire table as a single mathematical object – namely, the sample – one combines
these singleN fruit data points by N a product operation, obtaining (X(ω), Y(ω)),
where X = i=1,...,N Xi and Y = i=1,...,N Yi . Products or random variabels are
used abundantly in probability — you find them explained in a little more detail
in Appendix A.
And here is the rigorous probability theory account of the (X(ω), Y(ω)) nota-
tion. (This material is not mandatory and will not be asked in exercises or exams
— it is just a glimpse of the raw beauty of mathematics, for the ones who are
interested. The sketch I can give here is super condensed. I tell the same story,
with many detailed examples and explanations, in my lecture notes on 100 pages.)
As always in probability theory and statistics, we have an underlying probability
space (Ω, A, P ). In this structure, Ω is the set of all possible elementary events
ω ∈ Ω. A is a subset structure imposed on the set Ω called a σ-field, and P
is a probability measure on (Ω, A). The random variable X is a function which
returns samples, that is collections of N training data points. The expression
X(ω) is a function call: the argument ω is (intuitively speaking) and observation
event, and the result X(ω) of this function call is a sample. We assume that all
samples which could be drawn have N data points; this assumption is a matter
of convenience and simplifies notation and mathematical analysis a lot. – It is a
good exercise to formalize the structure of a sample in more detail. A single data
point consists of an attribute vector x and a class label y. In formal notation, we
have m properties Q1 , . . . , Qm , and property Qj (where j ∈ {1, . . . , m}) has a set
33
Aj of possible attribute values. Thus, x ∈ A1 × . . . × Am . We denote the set of
possible class labels by C. A data point (Xi (ω), Yi (ω)) is thus an element of the
set (A1 × . . . × Am × C). See Appendix A for the mathematical notation used
in creating product data structures. With SZ we denote the sample space of any
random variable Z. The sample space for Xi is thus SXi = A1 × . . . × Am and for
Yi it is SYi = C. Because SXi = SXj , SYi = SYj for all 1 ≤ i, j ≤ N , for simplicity
we also write SX for A1 × . . . × Am and SY for C.
The entire trainingNdata table is anNelement of the sample space SX⊗Y of the
product RV X⊗Y = ( i=1,...,N Xi )⊗( i=1,...,N Yi ). Just to exercise formalization
skills: do you see that
To round off our rehearsal of elementary concepts and notation, I repeat the
basic connections between random variables, probability spaces and sample spaces:
a random variable Z always comes with an underlying probability space (Ω, A, P )
and a sample space SZ . The RV is a function Z : Ω → SZ , and induces a probability
distribution on SZ . This distribution is denoted by PZ .
I torture you with these exercises in notation like a Russian piano teacher will
torture his students with technical finger-exercising études. It is technical and
mechanical and the suffering student may not happily appreciate the necessity of
it all, – and yet this torture is a precondition for becoming a virtuoso. The music
of machine learning is played in tunes of probability, no escape.
I am aware that, if you did not know these probability concepts before, this
condensed rehearsal cannot possibly be understandable. Probability theory is one
of the most difficult to learn sectors of mathematics and it takes weeks of digesting
and exercising to embrace the concepts of probability spaces, random variables and
sample spaces (not to speak of σ-fields). Besides that I want to recommend again
my lecture notes “Principles of Statistical Modeling” which I mentioned before.
They come from a graduate course whose purpose was to give a slow, detailed,
understandable, yet fully correct introduction to the concepts of probability theory
and statistics.
34
Two things must be given before a learning algorithm can be assembled and
learning can start: (A) the training data, and (B) an optimization criterion. While
(A) is obvious, (B) deserves a few words of explanation.
We observe that decision tree learning is a case of supervised learning. The
training data is a collection of labeled samples (xi , yi )i=1,...,N = (Xi (ω), Yi (ω))i=1,...,N ,
where xi ∈ A1 × . . . Am = SX is an attribute vector and yi ∈ C = SY is a class
label. A decision tree represents a function h : SX → SY in an obvious way: using
the sequential decision procedure outlined in Section 2.1, a vector of observed at-
tributes leads you from the root node to some leaf, and that leaf gives a class label.
In statistics, such functions h : SX → SY are generally called decision functions
(also in other supervised learning settings that do not involve decision trees), a
term that I will also sometimes use.
Decision tree learning is an instance of the general case of supervised learning:
the training data are labeled pairs (xi , yi )i=1,...,N , with xi ∈ SX , yi ∈ SY , and the
learning aims at finding a decision function h : SX → SY which is “optimal” in
some sense.
In what sense can such a function h : SX → SY (in this section: such a decision
tree) be “optimal”? How can one quantify the “goodness” of functions of the type
h : SX → SY ?
This is a very non-trivial question as we will learn to appreciate as the course
goes on. For supervised learning tasks, the key to optimizing a learning procedure
is to declare a loss function at the very outset of a learning project. A loss function
is a function
L : SY × SY → R≥0 , (1)
which assigns a nonnegative real number to a pair of class labels. The loss function
is used to compare the classification h(x), returned by a decision function h on
input x, with the correct value. After having learnt a decision function h from a
training sample SX × SY , one can quantify the performance of h averaged over
the training data, obtaining a quantity Remp that is called the empirical risk :
1 X
Remp (h) = L(h(xi ), yi ). (2)
N i=1,...,N
Most supervised machine learning procedures are built around some optimiza-
tion procedure which searches for decision functions that minimize the empirical
risk, that is, which try to give a small average loss on the training data.
I emphasize that minimizing the empirical loss (or in another common wording,
minimizing the “training error”) by a clever learning algorithm will usually not
lead to a very useful decision function. This is due to the problem of overfitting.
“Overfitting” means, intuitively speaking, that if the learning algorithm tries ev-
erything to be good on the training data (= small empirical risk), it will attempt
to minimize the loss individually for each training data point. This means, in some
way, to learn the training data “by heart” – encode an exact memory of the train-
ing data points in the learnt decision function h. This is not a good idea because
35
later test data points will be different, and the decision function obtained by “rote
learning” will not know how to generalize to the new data points. The ultimate
goal of supervised machine learning is not to minimize the empirical loss (that
is, to have small loss on the training data), but to minimize the loss on future
test data which were not available for training. Thus, the central optimization
criterion for supervised learning methods is to find decision functions h that have
a small risk
R(h) = E[L(h(X), Y )], (3)
where E is the statistical expectation operator (see Appendix D). That is, a good
decision function h should incur a small expected loss – it should have small “testing
error”. At a later point in this course we will analyse the problem of avoiding
overfitting in more detail. For now it is enough to be aware that overfitting is a
very serious threat, and that in order to avoid it one should not allow the models
h to dress themselves too closely around the training data points.
For a classification task with a finite number of class labels, like in our fruit
example, a natural loss function is one that simply counts misclassifications:
count 0, if h(x) = y
L (h(x), y) = (4)
1, if h(x) 6= y
This counting loss is often a natural choice, but in many situations it is not
appropriate. Consider, for instance, medical diagnostic decision making. Assume
you visit a doctor with some vague complaints. Now compare two scenarios.
• Scenario 1: after doing his diagnostics the doctor says, “sorry to tell you but
you have cancer and you should think about making your will” – and this is
a wrong diagnosis; in fact you are quite healthy.
• Scenario 2: after doing his diagnostics the doctor says, “good news, old boy:
there’s nothing wrong with you except you ate too much yesterday” – and
this is a wrong diagnosis; in fact you have intestinal cancer.
These two errors are called the “false positive” and the “false negative” deci-
sions. A simple counting loss would optimize medical decision making such that
the average number of any sort of error is minimized, regardless whether it is a false
negative or a false positive. But in medicine, false negatives should be avoided as
much as possible because their consequences can be lethal, whereas false positives
will only cause a passing anxiety and inconvenience. Accordingly, a loss function
used to optimize medical decision making should put a higher penality (= larger
L values) on false negatives than on false positives.
Generally, and specifically so in operations research and decision support sys-
tems, loss functions can become quite involved, including a careful balancing of
ethical, financial or other factors. However, we will not consider complex loss func-
tions here and stick to the simple counting loss. This loss is the one that guided
the design of the classical decision tree learning algorithms. And now, finally, we
have set the stage for actually discussing learning algorithms for decision trees!
36
2.4 Learning decision trees: the core algorithm
Ok., let us get concrete. We are given training data (xi , yi )i=1,...,N (in real life
often an Excel table), and – opting for the counting loss – we want to find an
optimal (more realistically: a rather good) decision tree hopt which will minimize
misclassifications on new “test” data.
In order to distil this hopt from the training data, we need to set up a learning
algorithm which, on input (xi , yi )i=1,...,N , outputs hopt .
All known learning algorithms for decision trees incrementally build hopt from
the root node downwards. The first thing a decision tree learning algorithm
(DTLA) must do is therefore to decide which property is queried first, making
it the root node.
To understand the following procedures, notice that a decision tree iteratively
splits the training dataset in increasingly smaller, disjoint subsets. The root node
can be associated with the entire training dataset – call it D. If the root node νroot
queries the property Qj and this property has kj attributes aj1 , . . . , ajkj , there will
be kj child nodes ν1 , . . . , νkj where node νl will be covering all training datapoints
that have attribute ajl ; and so forth down the tree. In detail: if νl1 l2 ···lr is a tree
node at level r (counting from the root), and this node is associated with the
subset Dl1 l2 ···lr of the training data, and this node queries property Qu which has
attributes au1 , . . . , auku , then the child node νl1 l2 ···lr ls will be that subset of Dl1 l2 ···lr
which contains those training data points that have attribute aus of property Qu .
The classical solution to this problem of selecting a property Q for the root
node is to choose that property which leads to a “maximally informative” split of
the training dataset in the first child node level. Intuitively, the data point subsets
associated with the child nodes should be as “pure” as possible with respect to
the classes c ∈ C.
In the best of all cases, if there are q different classes (that is, |C| = q), there
would be a property Qsupermagic with q attributes which already uniquely identify
classes, such that each first-level child node is already associated with a “pure”
set of training examples all from the same class – say, the first child node covers
only apples, the second only bananas, etc.
This will usually not be possible, among other reasons because normally there
is no property with exactly as many attributes as there are classes. Thus, “purity”
of data point sets associated with child nodes needs to be measured in a way that
tolerates class mixtures in each node. The measure that is traditionally invoked
in this situation comes from information theory. It is the entropy of the class
distribution within a child node. If Dl is the set of training points associated with
a child node νl , and nli is the number of data points in Dl that are from class i
(where i = 1, . . . , q), and the total size of Dl is nl , then the entropy Sl of the “class
mixture” in Dl is given by
X nl l
ni
i
Sl = − l
log2 l
. (5)
i=1,...,q
n n
37
nl
If in sum a term nl happens to be zero, by convention the product
this
i
l
ni nl
nl
log2 nil is set to zero, too. I quickly rehearse two properties of entropy which
are relevant here:
• The more “mixed” the set Dl , the higher the entropy Sl . In one extreme case,
Dl is 100% clean, that is, it contains only data points of a single class. Then
Sl = 0. The other extreme is the greatest possible degree of mixing, which
occurs when there is an equal number of data points from each class in Dl .
Then Sl attains its maximal possible value of −q (1/q) log2 (1/q) = log2 q.
When Sl is zero, the set Dl is “pure” — it contains examples of only one class.
The larger Sl , the less pure Dl . The entropy measure Sl is often called an impurity
measure.
The entropy of the root node is
X nroot root
ni
i
Sroot = − log2 ,
i=1,...,q
N N
where nrooti is the number of examples of class i in the total training data set D
and N is the number of training data points. Following the terminology of Duda,
P. E. Hart, and Stork 2001, I will denote the impurity measure of the root by
ientropy (νroot ) := Sroot .
Generally, if ν is some node in the tree, and this node is associated with the
training data point set Dν , and |Dν | = n, and the q classes are represented in Dν
by subsets of size n1 , . . . , nq , the entropy impurity of ν is given by
X ni n
i
ientropy (ν) = − log2 . (6)
i=1,...,q
n n
If the root node νroot queries the property Qj and this property has k attributes,
then the mixing of classes averaged over all child nodes ν1 , . . . , νk is given by
X |Dl |
ientropy (νl ),
l=1,...,k
N
38
It can be shown that this quantity is always nonnegative. It is maximal when
all child nodes have pure data point sets associated with them.
For any other node ν labeled with property Q, where Q has k attributes, and
the size of the set associated with ν is n and the sizes of the sets associated with
the k child nodes ν1 , . . . , νk are n1 , . . . , nk , the information gain for node ν is
X nl
∆ientropy (ν, Q) = ientropy (ν) − ientropy (νl ). (8)
l=1,...,k
n
The procedure to choose a property for the root node is to compute the infor-
mation gain for all properties and select the one which maximizes the information
gain.
This procedure of choosing that query property which leads to the greatest
information gain is repeated tree-downwards as the tree is grown by the DTLA.
A node is not further expanded and thus becomes a leaf node if (i) either the
training dataset associated with this node is 100% pure (contains only examples
from a single class), or if (ii) one has reached the level q, that is, all available
properties have been queried on the path from the root to that node. In case (i)
the leaf is labeled with the unique class of its associated data point set. In case
(ii) it is labeled with the class that has the largest number of representatives in
the associated data point set.
Information gain is the most popular and historically first criterion used for
determining the query property of a node. But other criteria are used too. I
mention three.
The first is a normalized version of the information gain. The motivation is
that the information gain criterion (8) favours properties that have more attributes
over properties with fewer attributes. This requires a little explanation. Generally
speaking, properties with many attributes statistically tend to lead to purer data
point sets in the children nodes than properties with only few attributes. To see
this, compare the extreme cases of a property with only a single attribute, which
will lead to a zero information gain, with the extreme case of a property that has
many more attributes than there are data points in the training dataset, which
will (statistically) lead to many children node data point sets which contain only
a single example, that is, they are 100% pure. A split of the data point set Dν
into a large number of very small subsets is undesirable because it is a door-
opener for overfitting. This preference for properties with more attributes can be
compensated if the information gain is normalized by the entropy of the data set
splitting, leading to the information gain ratio criterion, which here is given for
the root node:
∆ientropy (νroot , Qj )
∆ratio ientropy = P . (9)
− l=1,...,k |DNl | log2 |DNl |
The second alternative that I mention measures the impurity of a node by its
Gini impurity, which for a node ν associated with set Dν , where |Dν | = n and the
39
subsets of points in Dν of classes 1, . . . , q have sizes n1 , . . . , nq , is
X ni nj X ni 2
iGini (ν) = =1− ,
1≤i,j≤q;i6=j
n n 1≤i≤q
n
which is the error rate when a category decision for any example point in Dν is
made randomly according to the class distribution within Dν .
The third alternative is called the misclassification impurity in Duda, P. E.
Hart, and Stork 2001 and is given by
nl
imisclass (ν) = 1 − max{ | l = 1, . . . , q}.
n
This impurity measure is the minimum (taken over all classes c) probability
that a point from Dν which is of class c is misclassified if the classification is
randomly done according to the class distribution in Dν .
Like the entropy impurity, the Gini and misclassification impurities are non-
negative and equal to zero if and only if the set Dν is 100% pure. Like it was done
with the entropy impurity, these other impurity measures can be used to choose
a property for a node ν through the gain formula (8), plugging in the respective
impurity measure for ientropy .
This concludes the presentation of the core DTLA. It is a greedy procedure
which incrementally constructs the tree, starting from its root, by always choos-
ing that property for the node currently being constructed which maximizes the
information gain (7) or one of the alternative impurity measures.
You will notice some unfortunate facts:
• We started from claiming that our goal is to learn a tree which minimizes
the count loss (or any other loss we might opt for). However, none of the
impurity measures and the associated local gain optimization is connected
in a mathematically transparent way with that loss. In fact, no computa-
tionally tractable method for learning an empirical-loss-minimal tree exists:
the problem is NP-complete (Hyafil and Rivest 1976).
• It is not clear which of the impurity measures is best for the goal of mini-
mizing the empirical loss.
40
2.5 Dealing with overfitting
If a decision tree is learnt according to the core algorithm, there is the danger
that it overfits. This means that it performs well on the training data – it has
small empirical loss – which boils down to a condition where all leaf nodes are
rather pure. A zero empirical loss is easily attained if the number of properties
and attributes is large. Then it will be the case that every training example has
a unique combination of attributes, which leads to 100% pure leaf nodes, and
every leaf having only a small set of training examples associated with it, maybe
singleton sets. Then one has zero misclassifications of the training examples. But,
intuitively speaking, the tree has just memorized the training set. If there is some
“noise” in the attributes of data points, new examples (not in the training set)
will likely have attribute combinations that lead the learnt decision tree on wrong
tracks.
We will learn how to analyze and systematically fight overfitting later in the
course. At this point I only point out that when a learnt tree T overfits, one
can typically obtain from it another tree T 0 which overfits less, by pruning T .
Pruning a tree means to select some internal nodes and delete all of their children.
This makes intuitive sense: the deeper one goes down in a tree learnt by the core
algorithm, the more does the branch reflect “individual” attribute combinations
found in the training data.
To make this point particularly clear, consider a case where there are many
properties, but only one of them carries information relevant for classification,
while all others are purely random. For instance, for a classification of patients into
the two classes “healthy” and “has cancer”, the binary property “has increased
leukocyte count” carries classification relevant information, while the binary prop-
erty “given name starts with A” is entirely unconnected to the clinical status. If
there are enough such irrelevant properties to uniquely identify each patient in
the training sample, the core algorithm will (i) most likely find that the relevant
property “has increased leukocyte count” leads to the greatest information gain
at the root and thus use it for the root decision, and (ii) subsequently generate
tree branches that lead to 100% pure leaf nodes. Zero training error and poor
generalization to new patients results. The best tree here would be the one that
only expands the root node once, exploiting the only relevant property.
With this insight in mind, there are two strategies to end up with trees that
are not too deep.
The first strategy is early stopping. One does not carry the core algorithm
to its end but at each node that one has created, one decides whether a further
expansion would lead to overfitting. A number of statistical criteria are known to
decide when it is time to stop expanding the tree; or one can use cross-validation
(explained in later chapters in these lecture notes). We do not discuss this further
here – the textbook by Duda explains a few of these statistical criteria. A problem
with early stopping is that it suffers from the horizon effect: if one stops early at
some node, a more fully expanded tree might exploit further properties that are,
41
in fact, relevant for classification refinement underneath the stopped node.
The second strategy is pruning. The tree is first fully built with the core
algorithm, then it is incrementally shortened by cutting away end sections of
branches. An advantage of pruning over early stopping is that it avoids the horizon
effect. Duda says (I am not a decision tree expert and can’t judge) that pruning
should be preferred over early stopping “in small problems” (whatever “small”
means).
Missing values. Real-world datasets typically have missing values, that is, not
all training or test examples will have attribute values filled in for all proper-
ties. Missing values require adjustments both in training (an adapted version
of impurity measures which accounts for missing values) and in testing (if a
test example leads to a node with property Q and the example misses the
required attribute for Q, the normal classification procedure would abort).
The Duda book dedicates a subsection to recovery algorithms.
All in all, there is a large number of design decisions to be made when setting
up a decision tree learning algorithm, and a whole universe of design criteria and
algorithmic sub-procedures is available in the literature. Some combinations of
such design decisions have led to final algorithms which have become branded with
names and are widely used. Two algorithms which are invariably cited (and which
are available in statistical and ML programming toolboxes) are the ID3 algorithm
and its more complex and higher-performant successor, the C4.5 algorithm. They
had been introduced by decision tree pioneer Ross Quinlan in 1986 and 1993,
respectively. The Duda book gives brief descriptions of these classical algorithms.
42
2.7 Random forests
Decision tree learning is a subtle affair. If one designs a DTLA by choosing
a specific combination of the many design options, the tree that one gets from
a training dataset D will be influenced by that choice of options. It will not
be “the optimal” tree which one might obtain by some other choice of options.
Furthermore, if one uses pruning or early stopping to fight overfitting, one might
well end up with trees that, while not overfitting, are underfitting – that is, they do
not exploit all the information that is in the training data. In summary, whatever
one does, one is likely to obtain a decision tree that is significantly sub-optimal.
This is a common situation in machine learning. There are only very few
machine learning techniques where one has full mathematical control over getting
the best possible model from a given training dataset, and decision tree learning
is not among them. Fortunately, there is also a common escape strategy for
minimizing the quality deficit inherent in most learning designs: ensemble methods.
The idea is to train a whole collection (called “ensemble”) of models (here: trees),
each of which is likely suboptimal (jargon: each model is obtained from a “weak
learner”), but if their results on a test example are diligently combined, the merged
result is much better than what one gets from each of the models in the ensemble.
It’s the idea of crowd intelligence.
“Ensemble methods” is an umbrella term for a wide range of techniques. They
differ, obviously, in the kind of the individual models, like for instance decision
trees or neural networks. Second, they vary in the methods of how one generates
diversity in the ensemble. Ensemble methods work well only to the extent that the
individual models in the ensemble probe different aspects in the training data –
they should look at the data from different angles, so to speak. Thirdly, ensemble
methods differ in the way how the results of the individual models are combined.
The most common way is majority voting — the final classification decision is the
one made most often by the individual models. A general theory for setting up an
ensemble learning scheme is not available – we are again thrown back to heuristics.
The Wikipedia articles on “Ensemble learning” and “Ensemble averaging (machine
learning)” give a condensed overview.
Obviously, ensemble methods can only be used when the computational cost
of training a single model is rather low. This is the case for decision tree learning.
Because training an individual tree will likely not give a competitive result, but
is cheap, it is common practice to train not a single decision tree but an entire
ensemble – which in the case of tree learning is called a random forest. In fact,
random forests can yield competitive results (for instance, compared to neural
networks) at moderate cost, and are therefore often used in practical applications.
The definite reference for random forests is Breiman 2001. The paper has two
parts. In the first part, Breiman gives a mathematical analysis of why combining
decision trees does not lead to overfitting, and derives an instructive upper bound
on the expected generalization error (= risk, see Section 2.3). These results were
in synchrony with mainstream work in other areas of machine learning theory at
43
the time and made this paper the theory anchor for random forests. However, I
personally find the math notation used by Breiman opaque and hard to penetrate,
and I am not sure how many readers could understand it. The second, larger part
of this paper describes several variants and extensions of random forest algorithms,
discusses their properties and benchmarks some of them against the leading clas-
sification learning algorithms of the time, with favourable outcomes. This part
is an easy read and the presented algorithms are not difficult to implement. My
hunch is that it is the second rather than the first part which led to the immense
impact of this paper.
Here I give a summary of the paper, in reverse order, starting with the practical
algorithm recommended by Breiman. After that I give an account of the most
conspicuous theoretic results in transparent standard probability notation.
In ensemble learning one must construct many different models in an auto-
mated fashion. One way to achieve this is to employ a stochastic learning algo-
rithm. A stochastic learning algorithm can be seen as a learning algorithm which
takes two arguments. The first argument is the training data D, the same as in
ordinary, non-stochastic learning algorithms. The second argument is a random
vector, denoted by Θ in Breiman’s paper, which is set to different random values
in each run of the algorithm. Θ can be seen as a vector of control parameters
in the algorithm; different random settings of these control parameters lead to
different outcomes of the learning algorithm although the training data are always
the same, namely D.
Breiman proposes two ways in order to make the tree learning stochastic:
Bagging. In each run of the learning algorithm, the training dataset is resampled
with replacement. That is, from D one creates a new training dataset D0
of the same size as D by randomly copying elements from D into D0 . This
is a general approach for ensemble learning, called bagging. The Wikipedia
article on “Bootstrap aggregating” gives an easy introduction if you are
interested in learning more.
Random feature selection is the term used by Breiman for a randomization
technique where, at each node whose query property has to be chosen dur-
ing tree growing, a small subset of all still unqueried properties is randomly
drawn as candidates for the query property of this node. The winning prop-
erty among them is determined by the information gain criterion.
Combining these two randomness-producing mechanisms with a number of
other design decisions not mentioned here, Breiman obtains a stochastic learning
algorithm which, with ensembles of size 100 and majority voting, outperformed
other ML methods that were state-of-the-art at that time.
The random vector Θ, which is needed for a concise specification and a math-
ematical analysis of the resulting stochastic tree learning algorithm, is some es-
sentially arbitrary encoding of the random choices made for bagging and random
feature selection.
44
The main theoretical result in Breiman’s paper is his Theorem 2.3:
PE ∗ ≤ %̄ (1 − s2 )/s2 .
This theorem certainly added much to the impact of the paper, because it gives
an intuitive guidance for the proper design of random forests, and also because it
connected random decision tree forests to other machine learning methods which
were being mathematically explored at the time when the paper appeared. I will
now try to give a purely intuitive explanation, and then conclude this section with
a clean-math explanation of this theorem.
Note that the factor (1−s2 )/s2 ranges between zero and infinity and is monotonously
increasing with decreasing s. It is zero with maximal strength s = 1, and it is
infinite if the strength is zero.
The theorem gives an upper bound on the generalization error observed in
(asymptotically infinitely large) random forests. The main message is that this
bound is a product of a factor %̄ which is smaller when the different trees in a
forest vary more in their response to test inputs, with a factor (1 − s2 )/s2 which is
smaller when the trees in the forest place a larger probability gap between correct
and the second most common decision across the forest. The suggestive message
of all of this is that in designing a stochastic decision tree learning algorithm one
should
45
• attempting to ensure that trees mostly come out with correct decisions.
If one succeeds to generate trees that always give correct decisions, one has
maximal strength s = 1 and the generalization error is obviously zero. This will
usually be impossible. Instead, the stochastic tree learning algorithm will produce
trees that have a residual error probability, that is, s < 1. Then the first factor
implies that one should aim at a stochastic tree generation mechanism which (while
fixing the strength) show great variability in their response behavior.
The remainder of this section is only for those of you who are familiar with
probability theory, and this material will not be required in the final exam. I will
give a mathematically transparent account of Breiman’s Theorem 2.3. This boils
down to an exercise in clean mathematical formalism. We start by formulating the
ensemble learning scenario in rigorous probability terms. There are two probability
spaces involved, one for generating data points, and the other for generating the
random vectors Θ. I will denote these two spaces by (Ω, A, P ) and (Ω∗ , A∗ , P ∗ )
respectively, with the second one used for generating Θ. The elements of Ω, Ω∗
will be denoted by ω, ω ∗ , respectively.
Let X be the RV which generates attribute vectors,
X : Ω → A1 × . . . × Am ,
Y : Ω → C.
Zj : Ω∗ → T
be the RV which generates the j-th parameter vector, that is, Zj (ω ∗ ) = Θj . The
RVs Zj (j = 1, . . . , K) are independent and identically distributed (iid).
In the remainder of this section, we fix some training dataset D. Let h(·, Θ)
denote the tree that is obtained by running the tree learning algorithm with pa-
rameter vector Θ. This tree is a function which outputs a class decision if the
input is an attribute vector, that is,
h(x, Θ) ∈ C.
h(X, Θ) : Ω → C
is a random variable.
46
Now the stage is prepared to re-state Breiman’s central result (Theorem 2.3)
in a transparent notation.
Breiman starts by introducing a function mr, called margin function for a
random forest,
mr : (A1 × . . . × Am ) × C → R
(x, y) 7→ P ∗ (h(x, Z) = y) − max{P ∗ (h(x, Z) = ỹ)},
ỹ6=y
s = E[mr(X, Y )].
s is a measure to what extent, averaged over data point examples, the correct
answer probability (over all possible decision trees) is larger than the highest
probability of deciding for any other answer. Breiman calls s the strength of the
parametrized stochastic learning algorithm. The stronger the stochastic learning
algorithm, the greater the probability margin between correct and wrong classifi-
cations on average over data point examples.
Breiman furthermore introduces a raw margin function rmgΘ , which is a func-
tion of X and Y parametrized by Θ, through
1, if h(X(ω), Θ) = Y (ω),
−1, if h(X(ω), Θ) gives the maximally probable among the
rmgΘ (ω) =
wrong answers in an asymptotically infinitely large forest,
0, else.
(10)
0
Define %(Θ, Θ ) to be the correlation (= covariance normalized by division with
standard deviations) between rmgΘ and rmgΘ0 . For given Θ, Θ0 , this is a number
in [−1, 1]. Seen as a function of Θ, Θ0 , %(Θ, Θ0 ) maps each pair Θ, Θ0 into [−1, 1],
which gives a RV which we denote with the same symbol % for convenience,
where Z, Z 0 are two independent RVs with the same distribution as the Zj .
Let
%̄ = E[%(Z, Z 0 )]
47
be the expected value of %(Z, Z 0 ). It measures to what extent, in average across
random choices for Θ, Θ0 , the resulting two trees h(·, Θ) and h(·, Θ0 ) have both the
same correct or wrong decision averaged over data examples.
And here is, finally, Breiman’s Theorem 2.3:
PE ∗ ≤ %̄ (1 − s2 )/s2 , (11)
whose intuitive message I discussed earlier.
48
3 Elementary supervised temporal learning
In this section I give an introduction to an elementary temporal data learning
method. It combines simplicity with a very broad practical usefulness: using
linear regression to learn a map which transforms temporal input signals to output
signals. In many scenarios, this simple technique is all one needs. It can be
programmed and executed in a few minutes (really!) and you should run this
technique as a first baseline whenever you start a serious learning task that involves
time series data.
This section deals with numerical timeseries where the data format for each
time point is a real-valued scalar or vector. This includes the majority of all
learning tasks that arise in the natural sciences, in engineering, robotics, speech
and in video processing.
Methods for dealing with symbolic timeseries (in particular texts, but also dis-
crete action sequences of intelligent agents / robots, DNA sequences and more) can
be obtained by encoding symbols into numerical vectors and then apply numeri-
cal methods. Often however one uses methods that operate on symbol sequences
directly (all kinds of discrete-state dynamical systems, deterministic or stochastic,
like finite automata, Markov chains, hidden Markov models, dynamical Bayesian
networks, and more). I will not consider such methods in this section.
My secret educational agenda in this section is that this will let you rehearse
linear regression - which is such a basic, simple, yet widely useful method that
everybody should have a totally 100% absolutely unshakeable secure solid firm
hold on it.
49
In machine learning terminology, this is a supervised learning task, because the
training data points include teacher outputs yi . Note that the abstract data format
(xi , yi )i=1,...,N is the same as for decision trees, but here the data are numerical
while for decision trees they are most often symbolic.
By a small modification one can make linear regression much more versatile.
Note that any weight vector w∗ in (12) will map the zero vector 0 ∈ Rn on zero.
This is often not what one wants to happen. If one enriches (12) by also training
a bias b ∈ R, via
XN
(w, b) = argmin (w∗ xi + b∗ − yi )2 ,
w ∗ , b∗ i=1
one obtains an affine linear function (linear plus constant offset). The common
way to set up linear regression such that affine linear solutions become possible
is to pad the original input vectors xi with a last component of constant size 1,
that is, in (12) one uses n + 1-dimensional vectors [x; 1] (using Matlab notation
for the vertical concatentation of vectors). Then a solution of (12) on the basis
of the padded input vectors will be a regression weight vector [w, b] ∈ Rn+1 , with
the last component b giving the offset.
We now derive a solution formula for the minimization problem (12). P Most
textbook explanations start from the observation that the objective function N i=1 (wxi −
2
yi ) is a quadratic function in the weights w and then one uses calculus to find
the minimum of this quadratic function by setting its partial derivatives to zero.
I will present another derivation which does not need calculus and reveals the
underlying geometry of the problem more clearly.
Let xi = (x1i , . . . , xni )0 be the ith input vector. The key to understand linear
regression is to realize that the N values xj1 , . . . , xjN of the j-the component (j =
1, . . . , n), collected across all input vectors, make for an N -dimensional vector ϕj =
(xj1 , . . . , xjN )0 ∈ RN . Similarly, the N target values y1 , . . . , yN can be combined into
an N -dimensional vector y. Figure 10 Top shows a case with N = 10 input vectors
of dimension n = 4.
Using these N -dimensional vectors as a point of departure, geometric insight
gives us a nice clue how w should be computed. To admit a visualization, we
consider a case where we have only N = 3 input vectors that each have n = 2
components. This gives two 3-dimensional vectors ϕ1 , ϕ2 (Figure 10 Bottom).
The target values y1 , y2 , y3 are combined in a 3-dimensional vector y.
Notice that in machine learning, one should best have more input vectors than
the input vectors have components, that is, N > n. In fact, a very coarse rule of
thumb – with many exceptions – says that one should aim at N > 10 n (if this is
not warranted, use unsupervised dimension reduction methods to reduce n). We
will thus assume that we have fewer vectors ϕj than training data points. The
vectors ϕj thus span an n-dimensional subspace in RN (greenish shaded area in
Figure 10 Bottom).
Notice (easy exercise, do it!) that the minimization problem (12) is equivalent
50
Figure 10: Two visualizations of linear regression. Top. This visualization shows
a case where there are N = 10 input vectors xi , each one having n = 4 vector
components x1i , . . . , x4i (green circles). The fourth component is a constant-1 bias.
The ten values xj1 , . . . , xj10 of the j-th component (where j = 1, . . . , 4) form a
10-dimensional (row) vector ϕj , indicated by a green connecting line. Similarly,
the ten target values yi give a 10-dimensional vector y (shown in red). The linear
combination yopt = w[ϕ1 ; ϕ2 ; ϕ3 ; ϕ4 ] which gives the best approximation to y in
the least mean square error sense is shown in orange. Bottom. The diagram
shows a case where the input vector dimension is n = 2 and there are N = 3
input vectors x1 , x2 , x3 in the training set. The three values x11 , x12 , x13 of the first
component give a three dimensional vector ϕ1 , and the three values of the second
component give ϕ2 (green). These two vectors span a 2-dimensional subspace F
in RN = R3 , shown in green shading. The three target values y1 , y2 , y3 similarly
make for a vector y (red). The linear combination yopt = w1 ϕ1 + w2 ϕ2 which has
the smallest distance to y is given by the projection of y on this plane F (orange).
The vectors u1 , u2 shown in blue is a pair of orthonormal basis vectors which span
the same subspace F.
51
to n
X
w = argmin k( wj∗ ϕj ) − yk2 , (13)
w∗
j=1
U U 0 y = U U 0 X 0 w0 .
U 0 y = U 0 X 0 w0 . (17)
52
It remains to find a weight vector w which satisfies (17). I claim that w0 =
(X X 0 )−1 X y does the trick, that is, U 0 y = U 0 X 0 (X X 0 )−1 X y holds.
To see this, first observe that XX 0 is nonsingular, thus (XX 0 )−1 is defined.
Furthermore, observe that U 0 y and U 0 X 0 (XX 0 )−1 X y are n-dimensional vectors,
and that the N × n matrix U Σ has rank n. Therefore,
Step 1. Sort the input vectors as columns into an n × N matrix X and the
targets into an N -dimensional vector y.
• What we have derived here generalizes easily to cases where the data are of
the form (xi , yi )i=1,...,N where xi ∈ Rn , yi ∈ Rk . That is, the output data
are vectors, not scalars. The objective is to find a k × n regression weight
matrix W which solves
N
X
W = argmin kW ∗ xi − yi k2 . (20)
W∗
i=1
W 0 = (XX 0 )−1 X Y,
53
• Computing the inversion (XX 0 )−1 may suffer from numerical instability
when XX 0 is close to singular. Remark: this happens more often than
you would think - in fact, XX 0 matrices obtained from real-world, high-
dimensional data are often ill-conditioned (= close to singular). You should
always feel uneasy when your program code contains a matrix inverse! A
quick fix is to always add a small multiple of the n × n identity matrix before
inverting, that is, replace (19) by
0
wopt = (XX 0 + α2 In×n )−1 X y. (21)
This is called ridge regression. We will see later in this course that ridge
regression not only helps to circumvent numerical issues, but also offers a
solution to the problem of overfitting.
• A note on terminology. Here we have described linear regression. The word
“regression” is used in much more general scenarios. The general setting
goes like this:
Given: Training data (xi , yi )i=1,...,N , where xi ∈ Rn , yi ∈ Rk .
Also given: a search space H containing candidate functions h : Rn → Rk .
Also given: a loss function L : Rk × Rk → R≥0 .
Wanted: A solution to the optimization problem
N
X
hopt = argmin L(h(xi ), yi )
h∈H i=1
In the case of linear regression, the search space H consists of all linear
functions from Rn to Rk , that is, it consists of all k × n matrices. The loss
function is the quadratic loss which you see in (20). When one speaks of
linear regression, the use of the quadratic loss is implied.
Search spaces H can be arbitrarily large and rich in modeling options –
for instance, H might be the space of all deep neural networks of a given
structure and size.
Classification tasks look similar to regression tasks at first sight: training
data there have the format (xi , ci )i=1,...,N . The difference is that that the
target values ci are not numerical but symbolic — they are class labels.
an input signal (u(t))t∈T , where T is an ordered set of time points and for every
t ∈ T , u(t) ∈ Rk ;
54
a “teacher” output signal (y(t))t∈T , where y(t) ∈ Rm .
For simplicity we will only consider discrete time with equidistant unit timesteps,
that is T = N (unbounded time) or T = {0, 1, . . . , T } (finite time).
The learning task consists in training a system which operates in time and, if
it is fed with the input signal u(t), produces an output signal ŷ(t) which approx-
imates the teacher signal y(t) (in the sense of minimizing a loss L(ŷ(t) − y(t)) in
time average over the training data). A few examples for illustration:
• Input signal: a noisy radio signal with lots of statics and echos. Desired
output signal: the input signal in a version which has been de-noised and
where echos have been cancelled.
These are quite different sorts of tasks. The ECG monitoring task would be
called a temporal classification or fault monitoring task; the 0-1 switching of the
badly designed thermostat is too simplistic to have a respectable name; radio
engineers would speak of de-noising and equalizing, and speech translation is too
royally complex to have a name besides “online speech translation”. Input-output
signal transformation tasks are as many as there are wonders under the sun.
In many cases, one may assume that the current output data point y(t) only
depends on inputs up to that time point, that is, on the input history . . . , u(t −
2), u(t − 1), u(t). Specifically, y(t) does not depend on future inputs. Input-
output systems where the output does not depend on future inputs are called
causal systems in the signal processing world.
A causal system can be said to have memory if the current output y(t) is not
fully determined by the current input u(t), but is influenced by earlier inputs as
well. I must leave the meaning of “influenced by” vague at this point; we will make
it precise in a later section when we investigate stochastic processes in more detail.
All examples except the (poorly designed) thermostat example have memory.
Often, the output y(t) is influenced by long-ago input only to a negligeable ex-
tent, and it can be explained very well from only the input history extending back
to a certain limited duration. All examples in the list above except the English
55
translation task have such limited relevant memory spans. In causal systems with
bounded memory span, the current output y(t) thus depends on an input window
u(t − d + 1), u(t − d + 2), . . . , u(t − 1), u(t) of d steps duration. Figure 11 (top)
gives an impression.
u, y
y(t)
u(t-4)
u(t-3)
u(t-2)
u(t-1) u(t)
… t-5 t-4 t-3 t-2 t-1 t t+1 t+2 … Time
u u(t-4)
u(t-3)
u(t-2)
u(t-1) u(t) u(t+1)
… t-5 t-4 t-3 t-2 t-1 t t+1 t+2 … Time
56
a problem; all one has to do is to flatten the collection of d k-dimensional input
vectors which lie in a window into a single d · k dimensional vector, and then apply
(20) as before.
Linear regression is often surprisingly accurate, especially when one uses large
windows and a careful regularization (to be discussed later in this course) through
ridge regression. When confronted with a new supervised temporal learning task,
the first thing a seasoned pro does, is to run it through the machinery of window-
based linear regression. This takes a few minutes programming and gives, at the
very least, a baseline for comparing more sophisticated methods against — and
often it gives even already a very good soluation already without more effort.
But, linear regression only can give linear regression functions. This is not good
enough if the dynamical input-output system behavior has significant nonlinear
components. Then one must find a nonlinear regression function f .
If that occurs, one can take resort to a simple method which yields nonlinear
regression functions while not renouncing the conveniences of the basic linear re-
gression learning formula (19). I discuss this for the case of scalar inputs u(t) ∈ R.
The trick is to add fixed nonlinear transforms to the collection of input arguments
u(t−d+1), u(t−d+2), . . . , u(t). A common choice is to add polynomials. To make
notation easier, let us rename u(t − d + 1), u(t − d + 2), . . . , u(t) to u1 , u2 , . . . , ud .
If one adds all polynomials of degree 2, one obtains a collection of d + d(d + 1)/2
input components for the regression, namely
If one wants even more nonlinearity, one can add further, higher-order polynomi-
als. The idea to approximate nonlinear regression functions by linear combinations
of polynomial terms is a classical technique in signal processing, where it is treated
under the name of Volterra expansion or Volterra series. Very general classes of
nonlinear regression functions can be approximated to arbitrary degrees of preci-
sion with Volterra expansions.
Adding increasingly higher-order terms to a Volterra expansion obviously leads
to a combinatorial explosion. Thus one will have to use some pruning scheme to
keep only those polynomial terms which lead to an increase of accuracy. There
is a substantial literature in signal processing dealing with pruning strategies for
Volterra series (google “Volterra pruning”). I personally would never try to use
polynomials of degree higher than 2. If that doesn’t give satisfactory results, I
would switch to other modeling techniques, using neural networks for instance.
57
Time series prediction tasks come in many variants and I will not attempt
to draw the large picture but restrict this treatment to timeseries with integer
timesteps and vector-valued observations, as in the previous subsections.
Re-using terminology and notation, the input signal in the training data is, as
before, (u(t))t∈T . If one wants to predict this sequence of observations h timesteps
ahead (h is called prediction horizon), the desired output y(t) is just the input,
shifted by h timesteps:
(y(t))t∈T = (u(t + h))t∈T .
A little technical glitch is that due to the timeshift h, the last h data points
in the input signal (u(t))t∈T cannot be used as training data because their h-step
future would lie beyond the maximal time T .
Framed as an u(t)–to–y(t) input-output signal transformation task, all meth-
ods that can be applied for the latter can be used for timeseries prediction too.
Specifically, simple window-based linear regression (as in Figure 11 bottom) is
again a highly recommendable first choice for getting a baseline predictor when-
ever you face a new timeseries prediction problem with numerical timeseries.
58
“in the middle”: the physical world between the two antennas, inducing
all kinds of static noise, echos, distortions, such that the received radio
signal is a highly corrupted version of the cleanly transmitted one. Signal
processing engineers call this part of the world-in-the-middle the channel.
Does it not seem hopeless to model the input-output transformation of such
super-complex physical channels? – There can hardly be a more classical
problem than this one; analysing signal transmission channels gave rise to
Shannon’s information theory (Shannon 1948).
Abstracting from these examples, consider how natural scientists and math-
ematicians describe dynamical systems with input and output. We stay in line
with earlier parts of this section and consider only discrete-time models with a time
index set is T = N or T = Z. Three timeseries are considered simultaneously:
In soccer reporting example, x(t) would refer to some kind of brain state —
for instance, the vector of activations of all the reporter’s brain’s neurons. In the
signal transmission example, x(t) would be the state vector of some model of the
physical world stretched out between the sending and receiving antenna.
A note in passing: variable naming conventions are a little confusing. In the
machine learning literature, x’s and y’s in (x, y) pairs usually mean the arguments
and values (or inputs and outputs) of classification/regression functions. Both x
and y are observed data. In the dynamical systems and signal processing literature
(mathematics, physics and engineering), the variable name x typically is reserved
for the state of a physical system that generates or “channels” (= “transduces”,
“filters”) signals. The internal physical state of these systems is normally not
fully observable and not part of training data. Only the input and output signals
u(t), y(t) are observable data which are available for training models.
In a discrete-time setting, the temporal evolution of u(t), x(t), y(t) is governed
by two functions, the state update map
which describes how the internal states x(t) develop over time under the influence
of input, and the observation function
which describes which outputs y(t) can be observed when the physical system is
in state x(t).
The input signal u(t) is not specified by any equation, it is just “given”.
59
f f
Figure 12 visualizes the structural difference between the signal-based and the
state-based input-output transformation models.
There are many other types of state update maps and observation functions,
for instance ODEs and PDEs for continuous-time systems, automata models for
discrete-state systems, or a host of probabilistic formalisms for random dynamical
systems. For our present discussion, considering only discrete-time state update
maps is good enough.
A core difference between signal-based and state-based timeseries transfor-
mations is the achievable memory timespans. In windowed signal transformations
through regression functions, the memory depth is bounded by the window length.
In contrast, the dynamical system state x(t) of the intermediary system is poten-
tially co-determined by input that was fed to the dynamical system in an arbitrary
deep past – the memory span can be unbounded! This may seem counterintuitive
if one looks at Figure 12 because at each time point t, only the input data point
u(t) from that same timepoint is fed to the dynamical system. But u(t) leaves
some trace on the state x(t), and this effect is forwarded to the next timestep
through f , thus x(t + 1) is affected by u(t), too; and so forth. Thus, if one expects
long-range or even unbounded memory effects, using state-based transformation
models is often the best way to go.
60
Machine learning offers a variety of state-based models for timeseries transfor-
mations, together with learning algorithms. The most powerful ones are hidden
Markov models (which we’ll get to know in this course) and other dynamical
graphical models (which we will not touch), and recurrent neural networks (which
we’ll briefly meet I hope).
In some applications it is important that the input-output transformation can
be learnt in an online adaptive fashion. The input-output transformation is not
trained just once, on the basis on a given, fixed training dataset. Instead, training
never ends; while the system is being used, it continues to adapt to changing
input-output relationships in the data that it processes. My favourite example
is the online adaptive filtering (denoising, echo cancellation) of the radio signal
received by a mobile phone. When the phone is physically moving while it is
being used (phonecall in a train, or just while walking up and down in a room),
the signal channel from the transmitter antenna to the phone’s antenna keeps
changing its characteristics because the radiowaves will take different mixes of
reflection pathways all the time. The denoising, echo-cancelling filter has to be re-
learnt every few milliseconds. This is done with a window-based linear regression
(window size several tens of thousands) and ingeniously simplified/accelerated
algorithms. Because this is powerful stuff, we machine learners should not leave
these powerful methods only to the electrical engineers (who invented them) but
learn to use them ourselves. I will devote a session later in this course to these
online adaptive signal processing methods.
61
Floris Takens’ original theorem was formulated in a context of continuous-time
dynamics governed by differential equations. Many variants and extensions of
Takens theorem are known today. To stay in tune with earlier parts of this section
I present a discrete-time version. Consider the input-free dynamical system
62
Figure 13: Takens theorem visualized. Left plot shows the Lorenz attractor, a
chaotic attractor with a state sequence x(t) defined in an n = 3 dimensional
state space. The center plot shows a 1-dimensional observation y(t) thereof. The
right plot (orange) shows the state sequence y(t) obtained from y(t) by three-
dimensional delay embedding (I forget which delay δ I used to generate this plot).
63
Figure 14: Getting nice graphics from delay embeddings. Left: a timeseries
recorded from the “Mackey-Glass” chaotic attractor. Right: plotting the tra-
jectory of a delay-embedded version of the left signal.
64
4 Basic methods for dimension reduction
One way to take up the fight with the “curse of dimensionality”, which I high-
lighted in Section 1.2.2, is to reduce the dimensionality of the raw input data
before they are fed to subsequent learning algorithms. The dimension reduction
ratio can be enormous.
In this Section I will introduce two standard, elementary methods for dimension
reduction: K-means clustering and principal component analysis; and a method
which is not used in practice except sometimes for data visualization, but which is
very likely used by our brains, too: self-organizing feature maps. All three methods
reduce the dimension of vector data, which come in the form of points x ∈ Rn ,
that is, this section is only about dimension reduction methods for numerical data.
Dimension reduction is the archetypical unsupervised learning task.
• m < n, that is, we indeed reduce the number of dimensions — maybe even
dramatically;
• the low-dimensional vectors f (xi ) should preserve from xi the specific in-
formation that is needed to solve the learning task that comes after this
dimension-reducing data “preprocessing”.
65
• The number m of features should be small — after all, one of the reasons
for using features is dimension reduction.
• Each feature fi should be relevant for the task at hand. For example, when
the task is to distinguish helicopter images from winged aircraft photos (a
2-class classification task), the brightness of the background sky would be
an irrelevant feature; but the binary feature “has wings” would be extremely
relevant.
• A general intuition about features is that they should be rather cheap and
easy to compute at the front end where the ML system meets the raw data.
The “has wings” feature for helicopter vs. winged aircraft classification more
or less amounts to actually solving the classification task and presumably
is neither cheap nor easy to compute. Such highly informative, complex
features are sometimes called high-level features; they are usually computed
on the basis of more elementary, low-level features.
• Often features are computed stage-wise, low-level features first (directly from
data), then stage by stage more complex, more directly task-solving, more
“high-level cognition” features are built by combining the lower-level ones.
Feature hierarchies are often found in ML systems. Example: in face recog-
nition from photos, low-level features might extract coordinates of isolated
black dots from the photo (candidates for the pupils of the person’s eyes); in-
termediate features might give distance ratios between eyes, nose-tip, center-
of-mouth; high-level features might indicate gender or age. Such feature hier-
archies are implicit in feedforward neural networks. Each neuron in the first
hidden layer can be seen as computing a feature from the input patterns;
each neuron in the next hidden layer computes next-higher-level features
that are computed from the first-level features, etc.
66
Figure 15: Some examples from the Digits dataset.
Mean brightness. f1 (x) = 10n x / n (1n is the vector of n ones). This is just the
mean brightness of all pixels. Might be useful e.g. for distinguishing “1”
images from “8” images because we might suspect that for drawing an “8”
one needs more black ink than for drawing a “1”. Cheap to compute but
not very relevant for the task of digit classification.
πj0 x. We might hope that f3j has a high value for patterns of class j and low
values for other patterns.
67
rated. In fact, the classical ML approach to speech recognition was for two decades
relying on low-level acoustic features that had been hand-designed by insightful
phonologists. The MP3 sound coding format is based on features that reflect char-
acteristics of the human auditory system. Many of the first functional computer
vision and optical character recognition systems relied heavily on visual feature
hierarchies which grew from the joint efforts of signal processing engineers and
cognitive neuroscience experts.
However, since hand-designing good features means good insight on the side of
the engineer, and good engineers are rare and have little time, the practice of ML
today relies much more on features that are obtained from learning algorithms.
Numerous methods exist. In the following subsections we will inspect three such
methods.
Given: a training data set (xi )i=1,...,N ∈ Rn , and a number K of clusters that one
maximally wishes to obtain.
Initialization: randomly assign the training points to K sets Sj (j = 1, . . . , K).
68
Figure 16: Clusters obtained from K-means clustering (schematic): For a training
set of data points (light blue dots), a spatial grouping into clusters Cj is determined
by the K-means algorithm. Each cluster becomes represented by a codebook vector
(dark blue crosses). The figure shows three clusters. The light blue straight lines
mark the cluster boundaries. A test data point xtest (red cross) may then become
coded in terms of the distances αj of that point to the codebook vectors. Since
this xtest falls into the second cluster C2 , it could also be compressed into the
codebook index “2” of this cluster.
Repeat: For each set Sj , compute the mean µj = |Sj |−1 x∈Sj x. This mean
P
vector µj is the “center of gravity” of the vector cluster Sj . Create new sets Sj0 by
putting each data point xi into that set Sj0 where kxi − µj k is minimal. If some
Sj0 remains empty, dismiss it and reduce K to K 0 by subtractring the number of
dismissed empty sets (this happens rarely). Put Sj = Sj0 (for the nonempty sets)
and K = K 0 .
Termination: Stop when in one iteration the sets remain unchanged.
will not increase. The algorithm typically converges quickly and works well in
practice. It finds a local minimum or saddle point of J. The final clusters Sj
may depend on the random initialization. The clusters are bounded by straight-
line boundaries; each cluster forms a Voronoi cell. K-means cannot find clusters
defined by curved boundaries. Figure 17 shows an example of a clustering run
using K-means.
K-means clustering and other clustering methods have many uses besides di-
mension reduction. Clustering can also be seen as a stand-alone technique of
unsupervised learning. The detected clusters and their corresponding codebook
69
Figure 17: Running K-means with K = 3 on two-dimensional training points.
Thick dots mark cluster means µj , lines mark cluster boundaries. The algorithm
terminates after three iterations, whose boundaries are shown in light gray, dark
gray, red. (Picture taken from Chapter 10 of the textbook Duda, P. E. Hart, and
Stork 2001).
vectors are of interest in their own right. They reveal a basic structuring of a set
of patterns {xi } into subsets of mutually similar patterns. These clusters may
be further analyzed individually, given meaningful names and helping a human
data analyst to make useful sense of the original unstructured data cloud. For
instance, when the patterns {xi } are customer profiles, finding a good grouping
into subgroups may help to design targetted marketing strategies.
70
ear functions and when the similarity x ≈ d ◦ f (x) is measured by mean square
error, the optimal solution for f and d can be easily and cheaply computed by
a method that is known since the early days of statistics, principal component
analysis (PCA). It was first found, in 1901, by Karl Pearson, one of the fathers
of modern mathematical statistics. The same idea has been independently re-
discovered under many other names in other fields and for a variety of purposes
(check out https://en.wikipedia.org/wiki/Principal_component_analysis
for the history). Because of its simplicity, analytical transparency, modest compu-
tational cost, and numerical robustness PCA is widely used — it is the first-choice
default method for dimension reduction that is tried almost by reflex, before more
elaborate methods are maybe considered.
PCA is best explained alongside with a visualization (Figure 18). Assume the
patterns are 3-dimensional vectors, and assume we are given a sample of N = 200
raw patterns x1 , . . . , x200 . We will go through the steps of computing a PCA for
this demo dataset.
A B C
u3
u1 u1 u1
u2 u2
D E
u3
u1 u1
u2 u2
Figure 18: Visualization of PCA. A. Centered data points and the first principal
component vector u1 (blue). The origin of R3 is marked by a red cross. B. Pro-
jecting all points to the orthogonal subspace of u1 and computing the second PC
u2 (green). C. Situation after all three PCs have been determined. D. Summary
visualization: the original data cloud with the three PCs and an ellipsoid aligned
with the PCs whose main axes are scaled to the standard deviations of the data
points in the respective axis direction. E. A new dimension-reduced coordinate
system obtained by the projection of data on the subspace Um spanned by the m
first PCs (here: the first two).
71
The first stepP
in PCA is to center the training patterns xi , that is, subtract their
mean µ = 1/N i xi from each pattern, obtaining centered patterns x̄i = xi − µ.
The centered patterns form a point cloud in Rn whose center of gravity is the
origin (see Figure 18A).
This point cloud will usually not be perfectly spherically shaped, but instead
extend in some directions more than in others. “Directions” in Rn are character-
ized by unit-norm “direction” vectors u ∈ Rn . The distance of a point x̄i from the
origin in the direction of u is given by the projection of x̄i on u, that is, the inner
product u0 x̄i (see Figure 19).
Figure 19: Projecting a point x̄i on a direction vector u: the inner product u0 x̄i
(length of the green vector) is the distance of x̄i from the origin along the direction
given by u.
Notice that since thePcloud x̄i is centered, the mean of all u0 x̄i is zero, and
0 2 0
hence the number 1/N i (u x̄i ) is the variance of the numbers u x̄i .
Inspecting Figure 18A, one sees how u1 points in the “longest” direction of the
pattern cloud. The vector u1 is called the first principal component (PC) of the
centered point cloud.
Next step: project patterns on the (n − 1)-dimensional linear subspace of Rn
that is orthogonal to u1 (Figure 18B). That is, map pattern points x̄ to x̄∗ =
x̄ − (u01 x̄) · u1 . Within this “flattened” pattern cloud, again find the direction
vector of greatest variance
X
u2 = argmax 1/N (u0 x̄∗i )2
u,kuk=1 i
72
and call it the second PC of the centered pattern sample. From this procedure it
is clear that u1 and u2 are orthogonal, because u2 lies in the orthogonal subspace
of u1 .
Now repeat this procedure: In iteration k, the k-th PC uk is constructed by
projecting pattern points to the linear subspace that is orthogonal to the already
computed PCs u1 , . . . , uk−1 , and uk is obtained as the unit-length vector pointing
in the “longest” direction of the current (n − k + 1)-dimensional pattern point
distribution. This can be repeated until n PCs u1 , . . . , un have been determined.
They form an orthonormal coordinate system of Rn . Figure 18C shows this situ-
ation, and Figure 18D visualizes the PCs plotted into the original data cloud.
Now define features fk (where 1 ≤ k ≤ n) by
that is, fk (x̄) is the projection component of x̄ on uk . Since the n PCs form an
orthonormal coordinate system, any point x ∈ Rn can be perfectly reconstructed
from its feature values by
X
x=µ+ fk (x) uk . (29)
k=1,...,n
The PCs and the corresponding features fk can be used for dimension reduction
as follows. We select the first (“leading”) PCs u1 , . . . , um up to some index m.
Then we obtain a feature map
How “good” is this dimension reduction, that is, how similar are the original
patterns xi to their reconstructions d ◦ f (xi )? When dissimilarity of two patterns
x1 , x2 ∈ Rn is measured in the square error sense by
δ(x1 , x2 ) := kx1 − x2 k2 ,
73
denote the variance of the feature values fk (xi ) (notice that the mean of the fk (xi ),
taken over all patterns, is zero, so σk2 is indeed their variance). Then the mean
square distance between patterns and their reconstructions is
X n
X
2
1/N kxi − d ◦ f (xi )k = σk2 . (32)
i k=m+1
2. The feature variances σ12 , . . . , σn2 are the eigenvalues of these eigenvectors.
74
Computing a set of unit-norm eigenvectors and eigenvalues from C can be
most conveniently done by computing the singular value decomposition (SVD) of
C. Algorithms for computing SVDs of arbitrary matrices are shipped with all
numerical or statistical mathematics software packages, like Matlab, R, or Python
with numpy. At this point let it suffice to say that every covariance matrix C is a
so-called positive semi-definite matrix. These matrices have many nice properties.
Specifically, their eigenvectors are orthogonal and real, and their eigenvalues are
real and nonnegative.
In general, when an SVD algorithm is run on an n-dimensional positive semi-
definite matrix C, it returns a factorization
C = U Σ U 0,
where U is an n×n matrix whose columns are the normed orthogonal eigenvectors
u1 , . . . , un of C and where Σ is an n × n diagonal matrix which has the eigenvalues
λ1 , . . . , λn on its diagonal. They are usually arranged in descending order. Thus,
computing the SVD of C = U Σ U 0 directly gives us the desired PC vectors uk ,
lined up in U , and the variances σk2 , which appear as the eigenvalues of C, collected
in Σ.
This enables a convenient control of the goodness of similarity that one wants
to ensure. For example, if one wishes to preserve 98% of the variance information
from the original patterns, one can use the r.h.s. of (33) to determine the “cutoff”
m such that the ratio in this equation is about 0.02.
Procedure.
Step 1. Compute the pattern mean µ and center the patterns to obtain
a centered pattern matrix X̄ = [x̄1 , . . . , x̄N ].
Step 2. Compute the SVD U Σ U 0 of C = 1/N X̄ X̄ 0 and keep from U
only the first m columns, making for a n × m sized matrix Um .
75
4.6 Eigendigits
For a demonstration of dimension reduction by PCA, consider the “3” digit images.
After reshaping the images into 240-dimensional grayscale vectors and centering
and computing the PCA on the basis of N = 100 training examples, we obtain
240 PCs uk associated with variances σk2 . Only the first 99 of these variances are
nonzero (because the 100 image vectors xi span a 100-dimensional subspace in
R240 ; after centering the x̄i however span only a 99-dimensional subspace – why?
homework exercise! – thus the matrix C = 1/N X̄ X̄ 0 has rank at most the rank
of X̄, which is 99), thus only the first 99 PCs are useable. Figure 20 A shows
some of these eigenvectors ui rendered as 15×16 grayscale images. It is customary
to call such PC re-visualizations eigenimages, in our case “eigendigits”. (If you
have some spare time, do a Google image search for “eigenfaces” and you will find
weird-looking visualizations of PC vectors obtained from PCA carried out on face
pictures.)
Figure 20 B shows the variances σi2 of the first 99 PCs. You can see the rapid
(roughly exponential) decay. Aiming for a dissimilarity ratio (Equation 33) of 0.1
gives a value of m = 32. Figure 20 C shows the reconstructions of some “3”
patterns from the first m PC features using (31).
76
Figure 20: A. Visualization of a PCA computed from the “3” training images.
Top left panel shows the mean µ, the next 7 panels (row-wise) show the first 7
PCs. Third row shows PCs 20–23, last row PCs 96-99. Grayscale values have
been automatically scaled per panel such that they spread from pure white to
pure black; they do not indicate absolute values of the components of PC vectors.
B. The (log10 of) variances of the PC features on the “3” training examples. C.
Reconstructions of digit “3” images from the first m = 32 features, corresponding
to a re-constitution of 90% of the original image dataset variance. First row: 4
original images from the training set. Second row: their reconstructions. Third
row: 4 original images from the test set. Last row: their reconstruction.
77
inspired models continue to be explored. Furthermore, the SOM model is simple
and intuitive. I think all of this is reason enough to include it in this lecture.
The learning task for SOMs is defined as follows (for the case of m = 2):
Four comments: (i) In classical SOM models, input patterns x are always as-
sumed to have unit norm. (ii) Here I use a grid with a rectangular neighborhood
structure. Classical SOM papers and many applications of SOMs use a hexagonal
neighborhood structure instead, where each neuron has 6 neighbors, all at the same
distance. (iii) The “learning objective” sounds vague. It is. At the time when
Kohonen introduced SOMs, tailoring learning algorithms along well-defined loss
functions was not standard. Kohonen’s modeling attitude was biological modeling
oriented and a mathematical analysis of the SOM algorithm was not part of his
agenda. In fact, the problem to find a loss function which is minimized by the orig-
inal SOM algorithm was still unsolved in the year 2008 (Yin 2008). I don’t know
what the current state of research in this respect is — I would guess not much has
happened since. (iv) The learning task is impossible to solve. One cannot map the
n-dimensional pattern space Rn to the lower-dimensional (even just 2-dimensional)
space Rm while preserving neighborhood relations. For a graphical illustration of
this impossibility, consider the case where the patterns are 3-dimensional and are
uniformly spread in the unit cube [0, 1] × [0, 1] × [0, 1]. Then, in order to let every
grid neuron have about the same number of pattern points which are mapped on it
(condition 2), the SOM learning task would require that the 2-dimensional neuron
grid — think of it as a large square sheet of paper with the gridlines printed on
— becomes “wrinkled up” in the 3-dimensional cube such that in every place in
the cube the surrounding “neural sheet density” is about the same (Figure 21).
When this condition 2 is met (as in the figure), condition 1 is necessarily violated:
there will be points in the high-dimensional space which are close to each other
but which will become mapped to grid neurons that are far from each other on
the grid. Thus, SOM training always means finding a compromise.
78
Figure 21: Trying to uniformly fill a cube volume with a 2-dimensional grid sheet
will invariably lead to some points in the cube which are close to two (or more)
“folds” of the sheet. That is, points that are far away from each other in the
2-dimensional sheet (here for instance: points on the red gridline vs. points on the
blue gridline segment) will be close to each other in the 3-dimensional cube space;
or stated the other way round: some points that are close in the 3D cube will
become separated to distant spots on the neuron grid. “Crumpled grid” graphic
taken from Obermayer, Ritter, and Schulten 1990.
In a trained SOM, each grid neuron vkl is associated with a weight vector
w(vkl ) ∈ Rn . This weight vector represents a point in the space Rn from which
the training patterns xi came. One could say that w(vkl ) ∈ Rn “locates” the grid
neuron vkl in the input pattern space (in Figure 21, think of every gridline crossing
point as a location of a grid neuron vkl ; the 3-dimensional position of this crossing
point in the cube volume gives the weight vector w(vkl )).
The weight vectors associated with the grid neurons are the basis for defining
the function κ : P → V which assigns a grid neuron to every pattern x ∈ P, and
also to any new test pattern. If some pattern x ∈ Rn is presented to the trained
SOM, its κ-image is computed by
In words, the neuron v whose weight vector w(v) best matches the input
pattern x is chosen. “Best matching” here means: maximal inner product w0 x.
In other variants of an SOM, one uses minimal metric distance as “best matching”
criteriom, i.e. instead of w0 (vkl )x one would write kw0 (vkl ) − x in Equation (34).
Kohonen used the “best matching” criterion (34) in his classical papers. In the
SOM literature the neuron determined by (34) is called the best matching unit
79
(BMU) for pattern x. Clearly the map κ is determined by the weight vectors
w(v). In order to train them on the basis of the training data set P, the basic
SOM learning algorithm works as follows:
Initialization: The weights w(vkl ) are set to small random values.
0.5
0
-5 0 5
80
This mechanics has the effect that after convergence, the training dataset P
will be covered by grid neurons rather evenly (see objective nr. 2 stated above).
To get an intuition why this is so, let us consider a specific scenario. Assume that
the pattern set P contains a dense cluster of mutually quite similar patterns x,
besides a number of other, dissimilar patterns. Furthermore assume that we are in
an early stage of the learning process, where the radius r is still rather large, and
also assume that at this early stadium of learning, each pattern from the cluster
yields the same BMU v0 . Due to the large number of members in the cluster,
patterns from that cluster will be drawn for the learning algorithm rather often.
With r large, this will have the effect that neurons in the wider neighborhood of
the BMU v0 will grow their weight vectors toward w(v0 ). After some time, v0
will be surrounded by grid neurons whose weight vectors are all similar to w(v0 ),
and w(v0 ) will roughly be the mean of all patterns x in the cluster. Now, some
patterns x0 in the cluster will start to have as their BMU not v0 any longer, but
some of its surrounding neighbors (why?). As a consequence, increasingly many
patterns in the cluster will best-match the weight vectors of an increasing number
of neighbors of v0 : the subpopulation of grid neurons which respond to cluster
patterns has grown from the singleton population {v0 } to a larger one. This
growth will continue until the population of grid neurons responding to cluster
patterns has become so large that each member’s BMU-response-rate has become
too low to further drive this population’s expansion.
The radius r is set to large values initially in order to let all (or most) patterns in
P compete with each other, leading to a coarse global organization of the emerging
map κ. In later iterations, increasingly smaller r leads to a fine-balancing of the
BMU responses of patterns that are similar to each other.
SOM learning algorithms come in many variations. I sketched an arbitrary
exemplar. The core idea is always the same. Setting up a SOM learning algorithm
and tuning it is not always easy – the weight initialization, the decrease schedule
for r, the learning rate, the random sampling strategy of training patterns from
P, the grid dimension (2 or 3 or even more... 2 is the most common choice),
or the pattern preprocessing (for instance, normalizing all patterns to the same
norm) are all design decisions that can have a strong impact on the convergence
properties and final result quality of the learning process.
Yin 2008 includes a brief survey of SOM algorithm variants. I mention in
passing an algorithm which has its roots in SOMs but is significantly different: The
Neural Gas algorithm (good brief intro in https://en.wikipedia.org/wiki/
Neural_gas), like SOMs, leads to a collection of neurons which respond to patterns
from a training set through trainable weight vectors. The main difference is that
the neurons are not spatially arranged on a grid but are spatially uncoupled from
each other (hence, neural “gas”). The spatially defined distance d appearing in
the adaptation efficacy term fr (d(vkl , vBMU )) is replaced by a rank ordering: the
neuron with the best response to training pattern x (i.e., the BMU) is adapted
most, the unit v with the second best response (i.e., second largest value of w(v) x)
81
is adapted second most strongly, etc.
For a quick SOM demo I used a version of a legacy Matlab toolbox published
by Kohonen’s own research group (I downloaded it 20 years ago). As a pattern
dataset P I used the Digits dataset that I also used before in this section. I used
100 examples from each digit class. Figure 23 shows the result of training an 8 × 8
neuron grid on this dataset. As expected, the ten digit classes become represented
each by approximately the same number of grid neurons, reflecting the fact that
the classes were represented in P in equal shares.
0 0 4 4 4 1 2 2
8 6 6 6 6 8 8
2 8 5 6 6 8 8 8
2 7 3 1 3 5 8
7 7 3 1 5 5 9
7 1 1 1 1 5 5 9
1 3 3 4 4 5 5 9
4 3 3 3 3 5 9 9
I also generated an unbalanced pattern set that had 180 examples of class “5”
and 20 examples of every other class. The result is shown in Figure 24. As it should
be, roughly half of the SOM neurons are covering the “5” class examples, while the
other SOM neurons reach out their weight vectors into the remaining classes. This
is a desirable effect: SOM training on the basis of unbalanced datasets will lead
to a higher resolution for the pattern sorts that occur more often in the dataset
(more grid neurons covering the more often found sorts of data points).
Practical uses of SOMs appear to have been mostly 2-dimensional visualiza-
tions for exploring (labelled) high-dimensional datasets in two-dimensional graph-
82
5 5 5 8 0 8 5 5
5 5 5 8 0 6 5 5
5 5 5 8 6 6
3 5 5 9 8 5 1 4
3 8 9 8 1 4
2 2 7 9 1 1 4
7 7 7 5 5 5 0
1 1 9 5 5 5 3 3
Figure 24: Similar to previous figure, but with an unbalanced pattern set where
the number of “5” examples was the same as the total number of all other classes.
ics. Specifically, such SOM visualizations can give insight into metric similiarities
between pattern classes. For instance, inspecting the bottom right 3 × 3 panels in
Figure 23, one finds “morphable” similarities between certain versions of “5” and
“9” patterns. For the fun of it, in Figure 25 you find how a 1-dimensional SOM
curls up in a 2-dimensional pattern space.
What I find more relevant and interesting about SOMs is their use in neuro-
science. Learning mechanisms similar to the SOM learning algorithm have been
(and are being) invoked to explain the 2-dimensional spatial organization of cor-
tical maps. They can explain how neurons on a surface patch of the cortical sheet
align their specific responsiveness to high-dimensional input (from sensors or other
brain areas) with their local 2-dimensional metric neighborhood. Pictures which
put synthetic SOM grids side to side with images recorded from small patches of
cortical surface have variously been published. Figure 26 gives an example from
a study by Swindale and Bauer 1998.
If you are interested in such themes, I can recommend the recent review of
Bednar and Wilson 2016 on cortical maps as a starting point.
83
Figure 25: A 1-dim SOM trying to be a true representation of a 2-dim image.
Figure taken from https://eintrittverboten.wordpress.com/2012/10/04/kohonen-
art/.
84
Figure 26: The SOM algorithm reproducing biological cortical response patterns.
The scenario: an anaesthesized but eye-open ferret is shown moving images of a
bar with varying orientation and direction of movement, while response activity
from neurons on a patch (about 10 square mm) of visual cortex is recorded. A.
A color-coded recording of neural response activity depending on the orientation
of the visually presented bar. For instance, if the bar was shown in horizontal
orientation, the neurons who responded to this orientation with maximal activity
are rendered in red. B. Like in panel A., but for the motion direction of the bar
(same cortical patch). C., D. Artificial similies of A., B. generated with a SOM
algorithm. Figures taken from Swindale and Bauer 1998, who in turn took the
panels A. and B from Weliky, Bosking, and Fitzpatrick 1996.
85
Model reduction is a term used when it comes to trim down not just static
“data points” but entire dynamical “system models”. All branches of sci-
ence and engineering today deal with models of complex physical dynamical
systems which are instantiated as systems of coupled differential equations
— often millions, sometimes billions ... or more ... of them. One obtains
such gargantuan systems of coupled ODEs almost immediately when one
discretizes system models expressed in partial differential equations (PDEs)
or so-called finite-element models (used by engineers to model complex me-
chanical systems, like skyscrapers or airplane mainframes). Such sytems
cannot be numerically solved on today’s computing hardware and need to
be dramatically shrunk before a simulation can be attempted. That is, one
must replace a system model made from (say) millions of ODE’s by another
sytem model made only of a few hundred ODE’s. The reduced-size system,
when simulated on a computer, should still give very much the same dynam-
ical responses to external stimulation as the full-scale system. I once had
an office across the corridor from the office of Anastasios Antoulas. He was
a model reduction expert whose expertise was sought by actual skyscraper
building companies who needed mathematical simulations of their planned
supertowers, in particular how they responded to earthquake and storm (ob-
viously, they should be demonstrated not to break, and that in turn boils
down to show that they will not start swinging / oscillating / vibrating
when externally “perturbed”). I am not familiar with the mathematics of
this field, but what I understood from Anastasios’ explanations is that these
model reductions ultimately boil down to a linear algebra exercise whose
core is again the hero of practical linear algebra, the singular value decom-
position (SVD). I mention this field for completeness and because the name
“model reduction” invites analogies with dimension reduction, although the
latter is for static data points and model reduction is for entire dynamical
models. But there is a common mathematical core. Antoulas and Sorensen
2001 give a tutorial overview with instructive examples.
In this section we took a look at three methods for dimension reduction of high-
dimensional “raw” data vectors, namely K-means clustering, PCA, and SOMs.
While at first sight these methods appear quite different from each other, there is
a unifying view which connects them. In all three of them, the reduction was done
by introducing a comparatively simple kind of geometric object in the original
high-dimensional pattern space Rn , which was then used to re-express raw data
points in a lightweight encoding:
1. In K-means-clustering, this object is the set of codebook vectors, which can
be used to compress a test data point to the mere natural number index
of its associated codebook vector; or which can be used to give a reduced
K-dimensional vector comprised of the distances αj to the codebook vectors.
2. In PCA, this object is the m-dimensional (affine) linear subspace spanned
86
by the first m eigenvectors of the data covariance matrix. An n-dimensional
test point is represented by its m coordinates in this subspace.
Thus, the SOM shares properties with both K-means clustering and PCA. In
fact, one can systematically explore a whole spectrum of dimension reduction /
data compression algorithms which are located between K-means clustering and
PCA, in the sense that they describe m-dimensional manifolds of different degrees
of nonlinearity through codebook vectors. K-means clustering is the extreme case
that uses only codebook vectors and no manifolds; PCA is the other extreme with
only manifolds and no codebook vectors; and SOMs are somewhere in between.
The extensive Preface to the collection volume Principal Manifolds for Data Vi-
sualization and Dimension Reduction (Gorban et al. 2008) gives a readable intro
to this interesting field.
In today’s deep learning practice one often ignores the traditional methods
treated in this section. Instead one immediately fires the big cannon, training a
deep neural network wired up in an auto-encoder architecture. An autoencoder
network is a multilayer feedforward network whose ouput layer has the same large
dimension n as the input layer. It is trained in a supervised way, using training
data (xi , yi ) to approximate the identity function: the training output data yi are
identical to the input patterns xi (possibly up to some noise added to the inputs).
The trick is to insert a “bottleneck” layer with only m n neurons into the
layer sequence of the network. In order to achieve a good approximation of the
n-dimensional identity map on the training data, the network has to discover an
n → m-dimensional compression mapping which preserves most of the information
that is needed to describe the training data points. I will not give an introduc-
tion to autoencoder networks in this course (it’s a topic for the “Deep Learning”
course given by Mathia Sabatelli). The Deep Learning standard reference of I.
Goodfellow, Bengio, and Courville 2016 has an entire section on autoencoders.
87
5 Discrete symbolic versus continuous real-valued
I hope this section will be as useful as it will be short and simple. Underneath
it, however, lurks a mysterious riddle of mathematics, philosophy and the neuro-
sciences.
Some data are given in symbolic form, for instance
• financial data,
88
fields of mathematics arise from crossover formalisms between the Discrete and
the Continuous. The fundamental difference between the two is not dissolved in
these theories, but the tension between the Discrete and the Continuous sets free
new forms of mathematical energy. Sadly, these lines of research are beyond what
I understand and what I can explain (or even name), and certainly beyond what
is currently used in machine learning.
The hiatus (an educated word of latin origin, meaning “dividing gap”) between
the Discrete and the Continuous is also the source of one of the great unresolved
riddles in the neurosciences, cognitive science and AI: how can symbolic reasoning
(utterly discrete) emerge from the continuous matter and signal processing in
our material brains (very physical, very continuous)? This question has kept AI
researchers and philosophers busy (and sometimes aggressively angry with one
another) for 5 decades now and is not resolved; if you are interested, you can get
a first flavor in the Wikipedia article on “Physical symbol system” or by reading
up on the overview articles listed in http://www.neural-symbolic.org/.
Back to our down-to-earth business. Machine learning formalisms and al-
gorithms likewise are often either discrete-flavored or continuous-flavored. The
former feed on symbolic data and create symbolic results, using tools like de-
cision trees, Bayesian networks and graphical models (including hidden Markov
models), inductive logic, and certain sorts of neural networks where neurons have
0-1-valued activations (Hopfield networks, Boltzmann machines). The latter di-
gest vector data and generate vector output, like neural networks, support vector
machines and various sorts of regression learning “machines”.
The great built-in advantage of discrete formalisms is that they often lend
themselves well to explainable, human-understandable solutions. Their typical
disadvantage is that learning or inference algorithms are often based on combina-
torial search, which quickly lets computing times explode. In contrast, continuous
formalisms typically lead to results that cannot be intuitively interpreted – vec-
tors don’t talk – but lend themselves to nicely, smoothly converging optimization
algorithms.
When one speaks of “machine learning” today, one mostly has vector processing
methods in mind. Also this RUG course is focussing on vector data. The discrete
strands of machine learning are more associated with what one often calls “data
mining”. This terminology is however not clearly defined (see Section 1.5).
Sometimes one has vector data but wants to exploit benefits that come with
discrete methods, or conversely, one has symbolic data and wants to use a neural
network (because everybody else seems to be using them, or because one doesn’t
want to fight with combinatorial explosions). Furthermore, many an interesting
dataset comes as a mix of symbolic-discrete and numerical-continuous data – for
instance, data originating from questionnaires or financial/business/admin data
often are mixed-sort.
Then, one way to go is to convert discrete data to vectors or vice versa. It is a
highly empowering professional skill to know about basic methods of discrete ↔
89
continuous conversions.
Here are some discrete-to-continuous transformations:
One-hot encodings. Given: data points aν that are symbols from a finite “al-
phabet” A = {a1 , . . . , ak }. Examples: yes/no answers in a questionnaire;
words from a vocabulary; nucleic acids A, C, G, T occuring in DNA. Turn
each aν into the k-dimensional binary vector vν ∈ {0, 1}k which is zero every-
where except at position ν. This is a very common way to present symbolic
input to neural networks. On the output side of a neural network (or any
other regression learning machine), one-hot encodings are also often used to
give vector teacher data in classification tasks: if (xi , ci ) is a classification-
task training dataset, where ci is a symbolic class label from a class set
C = {c1 . . . , ck }, transform each ci to its k-dimensional one-hot vector vi
and get a purely vector-type training dataset (xi , vi ).
A = {certainly not, rather not, don’t know, rather yes, certainly yes}.
90
texts. A large collection of English texts is processed, collecting statistics
about similar sub-phrases in those texts that differed only in the two words
whose similarity one wished to assess (plus, there was another trick: can
you think of an important improvement of this basic idea?). The current
de-facto standard algorithm for learning a vector embedding for words is
documented in Devlin et al. 2018.
91
Multi-dimensional discretization by hierarchical refinement. If one wants
to discretize a set {xi } of n-dimensional vectors, one has to split the n-
dimensional volume which contains the points {xi } into a finite set of dis-
crete regions Rν . A common approach is to let these regions be n-dimensional
hypercubes. By a process of hierarchical refinement one constructs these re-
gions Rν such that in areas where there is a higher point density, or in areas
where there is much fine-grained information encoded in the local point dis-
tribution, one uses smaller hypercubes to increase resolution. This leads to
a tree structure, which in the case n = 2 is called a quadtree (because every
non-leaf node has four children) and in the case n = 3 an octree of hierar-
chically nested hypercubes. This tree structure enables a computationally
efficient indexing of the hypercubes. The left panel in Figure 27 shows an
example. The Wikipedia article on quadtrees is quite instructive. One may
also opt for regions Rν of other polygonal shape (see right panel in Figure
27). There are many such mesh refinement methods, with task-specific opti-
mization criteria. They are not typically used in machine learning but rather
in methods for simulating spatiotemporal (fluid) dynamics by numerically
solving PDEs. Still, it is good to know that such methods exist.
Vector quantization. Using K-means or other clustering algorithms, the vector
set {xi } is partitioned into k cells whose center of gravity vectors are indexed
and the indices are used as symbolic encodings of the {xi }. We have seen this
in Section 4.2. This is a typical machine learning method for discretization.
Turning neural dynamics into symbol sequences. When we (I really mean
“we”, = us humans!) speak or write, the continuous-time, continuous-valued
neural brain dynamics leads to a discrete sequence of words. Somehow, this
symbolic sequence is encoded in the “subsymbolic”, continuous brain dy-
namics. It is unknown how, exactly, this encoding is realized. Numerous
proposals based on nonlinear dynamical systems theory have been made.
This is an area of research in which I am personally engaged. If you are in-
terested: some approaches are listed in Durstewitz, Seamans, and Sejnowski
2000, Pascanu and Jaeger 2011, Fusi and Wang 2016. In machine learning,
the problem of transforming continuous-valued neural state sequences to se-
quences of words (or letters) arises in applications like speech recognition
(“speech-to-text”) or gesture recognition. Here the most common solution
(which is not necessarily biologically plausible) is to use the core neural net-
work to generate a sequence of continuous-valued hypothesis output vectors
with as many components as there are possible target symbols. At each
point in time, the numerical value of each vector component reflects a cur-
rent “degree of belief” about which symbol should be generated. With some
postprocessing mechanism (not easy to set up), this hypothesis stream is
denoised and turned into a symbol sequence, for instance by selecting at
each point in time the symbol that has the largest degree of belief.
92
Figure 27: Left: A hierarchical hypercube mesh for a 2-dimensional dataset
consisting of points homogeneously distributed inside a circle. Right: adaptive
non-orthogonal meshing, here used in an airflow simulation. Sources: left panel:
http://www.questinygroup.com/tag/quad-tree/; right panel: Luchinsky and
al. 2012.
93
Figure 28: He gave the reason why machine learning works (from https:
//commons.wikimedia.org/wiki/File:Albert_Einstein_-_Colorized.jpg)
question is, which law. In machine learning, this question is known as the problem
of overfitting, or in more educated terms, the bias-variance dilemma. Welcome
to this chapter which is all about this question. For all your practical exploits
of machine learning in your professional future, this is the most important and
enabling chapter in this course.
94
First stage: dimension reduction by PCA: 1. Center the set of image vec-
tors (xtrain
i )i=1,...,1000 by subtracting the mean vector µ, obtaining (x̄train
i )i=1,...,1000 .
2. Assemble the centered image vectors column-wise in a 240 × 1000 ma-
trix X̄. Compute the correlation matrix C = 1/1000 X X 0 and factorize
C into its SVD C = U ΣU 0 . The columns of U are the 240 principal
component vectors of C.
3. Decide how strongly you want to reduce the dimension, shrinking it
from n = 240 to m < n. Let Um be the matrix made from the first m
columns from U .
4. Project the centered patterns x̄train
i on the m first principal components,
0
obtaining m-dimensional feature vectors fitrain = Um x̄train
i .
Vectorize the class labels by one-hot encoding: Each ctrain i is re-written as
train
a binary 10-dimensional vector vi which has a 1 entry in the corresponding
class ci . Assemble these vectors column-wise in a 10 × 1000 matrix V .
Compute a linear regression classifier: Assemble the 1000 m-dimensional fea-
ture vectors fitrain into a m × 1000 dimensional matrix F and obtain a 10 × m
dimensional regression weight matrix W by
W 0 = (F F 0 )−1 F V 0 .
Compute the training MSE and training error rate: The training mean square
error (MSE) is given by
1000
X
train
MSE = 1/1000 kvitrain − W (fitrain )k2 .
i=1
where maxInd picks the index of the maximal element in a vector. (Note:
the “−1” is owed to the fact that the vector indices range from 1 to 10, while
the class labels go from 0 to 9).
Similarly, compute the test MSE and error rates by
1000
X
MSEtest = 1/1000 kvitest − W (fitest )k2
i=1
and
%test = 1/1000 |{i | maxInd (W fitest ) − 1 6= ctest
i }|,
0
where fitest = Um (xtest
i − µ). Note: for centering the test data, use the mean
µ obtained from the training data!
95
This is a procedure made from basic linear operations that you should com-
mand even when sleepwalking; with some practice the entire thing should not take
you more than 30 minutes for programming and running. Altogether a handy
quick-and-not-very-dirty routine that you should consider carrying out in every
classification task, in order to get a baseline before you start exploring more so-
phisticated methods.
And now - let us draw what is probably the most helpful graphics in these
lecture notes, worth to be burnt into your subconscious. Figure 29 shows these
diagnostics for all possible choices of the number m = 1, . . . , 240 of PC features
used.
Figure 29: Dashed: Train (blue) and test (red) MSE obtained for m = 1, . . . , 240
PCs. Solid: Train (blue) and test (red) misclassification rates. The y-axis is
logarithmic base 10.
This plot visualizes one of the most important issues in (supervised) machine
learning and deserves a number of comments.
96
• The analog performance curves for the testing MSE and misclassification
first exhibit a decrease, followed by an increasing tail. The testing misclas-
sification rate is minimal for m = 34.
• This “first decrease, then increase” behavior of testing MSE (or classifica-
tion rate) is always observed in supervised learning tasks when models are
compared which have growing degrees of data fitting flexibility. In our digit
example, this increase in flexibility was afforded by growing numbers of PC
features, which in turn gave the final linear regression a richer repertoire of
feature values to combine into the hypothesis vectors.
• The increasing tail of testing MSE (or classification rate) is the hallmark
of overfitting. When the learning algorithm admits too much flexibility, the
resulting model can fit itself not only to what is “lawful” in the training data,
but also to the random fluctuations in the training data. Intuitively and
geometrically speaking, a learning algorithm that can shape many degrees
of freedom in its learnt models allows the models to fold in curls and wiggles
to accomodate the random whims of the training data. But then, the random
curls and wiggles of the learnt model will be at odds with fresh testing data.
97
!
Figure 30: An example of training data (red squares) obtained from a noisy ob-
servation of an underlying “correct” function sin(2 π x) (dashed blue line).
p(x) = w0 + w1 x + · · · + wm xm . (36)
Figure 31: Fitting polynomials (green lines) for polynomial orders 1, 3, 10 (from
left to right).
If we compute the MSE’s for the three orders m = 1, 3, 10, we get MSEtrain =
0.4852, 0.0703, 0.0000 respectively. Some observations:
98
• If we increase the order m, we get increasingly lower MSEtrain .
• For m = 3, we get a polynomial that hits our target sine apparently quite
well.
• For m = 10, we get a polynomial that perfectly matches the training data,
but apparently misses the target sine function (overfitting).
99
1.5 6
150
1 4
1
100
1 1
0.5 2
50
0 0.5
0.5 0 0.5
1 0
1
1
0.5 0.5 0.5
0 0 0 0 0 0
Figure 32: Estimating a pdf from 6 data points. Model flexibility grows from left
to right. Note the different scalings of the z-axis: the integral of the pdf is 1 in
each of the three cases.
presumably underfitting. When the flexibility is too large, each individual training
point can be “lasso-ed” by a sling of the decision boundary, presumably overfitting.
Do I need to repeat that while these graphics seem to indicate under- or over-
fitting, we do not actually know ?
100
Figure 34: A nightmare case of overfitting. Picture spotted by Yasin Cibuk (2018
ML course participant) on http://dominicwilcox.com/portfolio/bed/ – now
no longer accessible –, designed and crafted by artist Dominic Wilcox, 1999. Quote
from the artist’s description of this object: “I used my own body as a template
for the mattress”. From a ML point of view, this means a size N = 1 training
data set.
overfitting problem manifests itself in different geometrical ways. But the flavor is
the same in all cases. The model flexibility determines how “wiggly” the geometry
of the learnt model can become. Very large flexibility ultimately admits to fold
the model snugly around each individual training point. This leads to small, even
zero, training error; but it is likely disastrous for generalization to new test data
points. Very low flexibility can hardly adapt to the structure of the training data
at all, likewise leading to poor test performance (and poor training performance
too). Some intermediate flexibility will strike the best balance.
Properly speaking, flexibility is not a characteristic of the final model that
one obtains after learning, but of the learning algorithm. If one uses the term
precisely, one should speak, for example, of “the flexibility of the procedure to
train a third-order polynomial function”, or of “the flexibility of the PCA-based
linear regression learning scheme which uses m PCA features”.
I introduced this way of speaking about “flexibility” ad hoc. In statistics
and machine learning textbooks you will hardly find this term. Instead, specific
methods to measure and tune the flexibility of a learning algorithm have their
specific names, and it is these names that you will find in the literature. The
most famous among them is model capacity. This concept has been developed in a
field now called statistical learning theory, and (not only) I consider it a highlight
of modern ML theory. We will however not treat the concept of model capacity
in this course, since it is not an easy concept, and it needs to be spelled out in
different versions for different types of learning tasks and algorithms. Check out
101
en.wikipedia.org/wiki/Vapnik-Chervonenkis_theory if you want to get an
impression. Instead, in Sections 6.4.1 and 6.4.2 I will present two simpler methods
for handling modeling flexibility which, while they lack the analytical beauty and
depth of the model capacity concept, are immensely useful in practice.
I emphasize that finding the right flexibility for a learning algorithm is ab-so-lu-
te-ly crucial for good performance of ML algorithms. Our little visual examples do
not do justice to the dismal effects that overfitting may have in real-life learning
tasks where a high dimension of patterns is combined with a small number of
training examples — which is a situation faced very often by ML engineers in
practical applications.
102
helpful if I repeat this here with more detail.
The decision function h obtained by a learning algorithm should be optimized
toward some objective. One introduces the concept of a loss function. A loss
function is a function
L : EY × EY → R≥0 . (37)
The idea is that a loss function measures the “cost” of a mismatch between
the target values y and the values h(x) returned by a decision function. Higher
cost means lower quality of h. We have met two concrete loss functions so far:
This loss is often just called “quadratic loss”. We used it as a basis for
deriving the algorithm for linear regression in Section 3.1.
103
which is just the mean loss calculated over the training examples. Minimizing this
empirical risk is an achievable goal, and a host of optimization algorithms for all
kinds of supervised learning tasks exist which do exactly this, that is, they find
N
X
hopt = argmin 1/N L(h(xi ), yi ). (42)
h∈H i=1
The set H is the hypothesis space – the search space within which a learning may
look for an optimal h.
It is important to realize that every learning algorithm comes with a specific
hypothesis space. For instance, in decision tree learning H is the set of all decision
trees that use a given set of properties and attributes. Or, in linear regression, H
is the set of all affine linear functions from Rn to Rk . Or, if one sets up a neural
network learning algorithm, H is typically the set of all neural networks that have
a specific connection structure (number of neuron layers, number of neurons per
layer); the networks in H then differ from each other by the weights associated
with the synaptic connections.
The empirical risk is often – especially in numerical function approximation
tasks – also referred to as training error.
While minimizing the empirical loss is a natural way of coping with the im-
possibility of minimizing the risk, it may lead to decision functions that combine
a low empirical risk with a high risk. This is the ugly face of overfitting which I
highlighted in the previous subsection. In extreme cases, one may learn a decision
function which has zero empirical risk and yet has a extremely large expected
testing error, which makes it absolutely useless.
There is no easy or general solution for this conundrum. It has spurred statis-
ticians and mathematicians to develop a rich body of theories which analyze the
relationships between risk and empirical risk, and suggest insightful strategies
to manage as well as one can in order to keep the risk within provable bounds.
These mathematical theories, sometimes referred to as statistical learning theory
(or better, theories), are beyond the scope of this lecture.
If you are in a hardcore mood and if you have some background in probability
theory, you can inspect Section 18 of lecture notes of my legacy “Principles of Sta-
tistical Modeling” course (online at https://www.ai.rug.nl/minds/uploads/
LN_PSM.pdf). You will find that the definitions of loss and risk given there in the
spirit of mathematical statistics are a bit more involved than what I presented
above, but the definitions of loss and risk that I gave here are used in textbooks
on machine learning.
104
Let us briefly return to the overfitting diagram in Figure 29. In that demo, the
flexibility regulation (moving left or right on the horizontal axis) was effected by
moving in a model class inclusion hierarchy — more flexible learning algorithms
were more flexible because they searched larger hypothesis spaces H. But similar
diagrams would be obtained in any other ML exercise where other methods for
navigating on the flexibility axis might be used. Because the essential message of
Figure 29 is universal across all machine learning tasks, I redraw that figure and
annotate it generically (Figure 35).
Figure 35: The generic, universal, core challenge of machine learning: finding the
right model flexibility which gives the minimal risk.
• Increasing the model flexibility from very low to very high will lead to a risk
that first decreases (less and less underfitting) and then rises again (more
and more overfitting).
• The best model flexibility is where the risk curve reaches its minimum. The
problem is that one does not know this precious risk curve - it is defined on
the distribution of all future “test” data, which one does not have at training
time.
There are many, quite different, ways of tuning flexibility of a learning algo-
rithm. Note that the word “flexibility” is only intuitive; how this is concretely
formalized, implemented and measured differs between the methods of adjusting
this “flexibility”. I outline the most common ones.
105
6.4.1 Tuning learning flexibility through model class size
In the Digits classification demo from Section 6.1, we tuned flexibility by changing
the dimension of the PCA features. In the polynomial curve fitting demo in Section
6.2.1 we changed the order of polynomials. Let us re-consider these two examples
to get a clearer picture:
• In the Digits example (Figure 29) we found that the number m of principal
component features that were extracted from the raw image vectors was de-
cisive for the testing error. When m was too small, the resulting models were
too simple to distinguish properly between different digit image classes (un-
derfitting). When m was too large, overfitting resulted. Fixing a particular
m determines the class H of candidate decision functions within which the
empirical risk (42) is minimized. Specifically, using a fixed m meant that the
optimal decision function hopt was selected from the set Hm which contains
all decision functions which first extract m principal component features be-
fore carrying out the linear regression. It is clear that Hm−1 is contained
in Hm , because decision functions that only combine the first m − 1 princi-
pal component features into the hypothesis vector can be regarded as special
cases of decision functions that combine m principal component features into
the hypothesis vector, namely those whose linear combination weight for the
m-th feature is zero.
• In the polynomial curve fitting example from Section 6.2.1, the model pa-
rameters were the monomial coefficients w0 , . . . , wm (compare Equation 36).
After fixing the polynomial order m, the optimal decision
Pm function p(x) was
j
selected from the set Hm = {p : R → R | p(x) = j=0 wj x }. Again it is
clear that Hm−1 is contained in Hm .
106
From now on we assume that a class inclusion sequence H1 ⊂ . . . ⊂ HL is given.
We furthermore assume that we have a loss function L and are in possession of a
learning algorithm which for every class Hm can solve the problem of minimizing
the empirical risk
N
X
hopt m = argmin 1/N L(h(xi ), yi ). (43)
h∈Hm i=1
So... how can we find the best model class mopt which gives us the best risk
– note: not the best empirical risk? Or stated in more basic terms, which model
class will give us the smallest expected test error? Expressed formally, how can
we find
mopt = argmin R(hopt m )? (44)
m
107
Such a high-flexibility model type would inevitably lead to overfitting when an
“optimal” model would be learnt using the basic learning equation (42) which I
repeat here for convenience:
N
X
hopt = argmin 1/N L(h(xi ), yi ).
h∈H i=1
1. In the polynomial fit task from Section 6.2.1 one might consider for H all 10th
order polynomials, but penalize the “oscillations” seen in the right panel of
Figure 31, that is, penalize such 10th order polynomials that exhibit strong
oscillations. The degree of “oscillativity” can be measured, for instance, by
the integral over the (square of the) second derivative of the polynomial p,
1 2
d2 p(x)
Z
R(θ) = R((w0 , . . . , w10 )) = dx.
0 dx2
Investing a little calculus (good exercise! not too difficult), it can be seen
that this integral resolves to a quadratic form R(θ) = θ0 C θ where C is an
11 × 11 sized positive semi-definite matrix. That format is more convenient
to use than the original integral version.
2. A popular regularizer that often works well is just the squared sum of all
model parameters, X
R(θ) = w2 .
w∈θ
108
This regularizer favors models with small absolute parameters, which often
amounts to “geometrically soft” models. This regularizer is popular among
other reasons because it supports simple algorithmic solutions for minimizing
risk functions that contain it. It is called the L2 -norm regularizer because
it measures the (squared) L2 -norm of the parameter vector θ.
109
where (xi , yi )i=1,...,N is a set of training data with xi ∈ Rn , yi ∈ R. Like any other
supervised learning algorithm, linear regression may lead to overfitting solutions
wopt . It is always advisable to control the flexibility of linear regression with an
L2 norm regularizer, that is, instead of solving (46) go for
N
1 X
wopt = argmin (w xi − yi )2 + α2 kwk2 (47)
w∈Rn N
i=1
and find the best regularization coefficient α2 . The optimization problem (47)
admits a closed-form solution, namely the ridge regression formula that we have
already met in Equation 21. Rewriting it a little to make it match with the current
general scenario, here it is again:
0 1 1
wopt =( X X 0 + α2 In×n )−1 X Y = (X X 0 + α2 In×n )−1 X Y, (48)
N N
where X = [x1 , . . . , xN ] and Y = (y1 , . . . , yN )0 .
In Section 3.1 I motivated to use the ridge regression formula because it war-
rants numerical stability. Now we see that a more fundamental reason to prefer
ridge regression over the basic kind of regression (46) is that it implements L2 norm
regularization. The usefulness of ridge regression as an allround simple baseline
tool for supervised learning tasks can hardly be overrated.
110
seen an iterative algorithm in this course so far, therefore I cannot demon-
strate this by example here. The “dropout regularization” trick, which is
widely used in deep learning, is of this kind. The effect is that the stronger
the algorithm is noisified, the stronger the regularization, that is the further
one steers to the left end of Figure 35.
Use stochastic ensemble learning. We have seen this strategy in the presen-
tation of random forests. A stochastic learning algorithm is repeatedly ex-
ecuted with different random seeds (called Θ in Section 2.7). The stronger
the randomness of the stochastic learning algorithm, and the more members
are included in the ensemble, the stronger the regularization effect.
111
A clever way to answer this question is to split D into K subsets Dj of equal size
(j = 1, ..., K). Then, for every regularization strength m, run K model learning
procedures, where in the j-th run the subset Dj is withheld as a validation set, and
the remaining K − 1 sets joined together make for a training set. After these K
runs, average the validation errors in order to find a good estimate for the expected
test error at this regularization strength m. After this K-fold cross-validation has
been done for all regularization strengths m = 1, . . . , M , pick that mopt which
had the lowest validation error. Here is the procedure in detail (where I use class
inclusion as the method for navigating through the flexibilities m):
112
Given: A set (xi , yi )i=1,...,N of training data, and a loss function L.
Also given: Some method which allows one to steer the model flexibility
along a regularization strength parameter m = 1, . . . , M. The weakest
regularization should be weak enough to allow overfitting.
Step 1. Split the training data into K disjoint subsets Dj = (xi , yi )i∈Ij of
roughly equal size N 0 = N/K.
Step 2. Repeat for m = 1, . . . , M :
Step 2.1 Repeat for j = 1, . . . , K:
Step 2.2.1 Designate Dj as validation set Vj and the union of the
other Dj 0 as training set Tj .
Step 2.2.1 Compute the model with minimal training error on Tj
X
hopt m j = argmin 1/|Tj | L(h(xi ), yi ),
h∈Hm
(xi ,yi )∈Tj
val
Step 2.2 Average the K validation risks Rm j obtained from the “folds”
carried out for this m, obtaining
X
val val
Rm = 1/K Rm j.
j=1,...,K
Step 3. Find the optimal class by looking for that m which minimizes the
averaged validation risk:
val
mopt = argmin Rm .
m
Step 4. Compute hmopt using the complete original training data set:
X
hmopt = argmin 1/N L(h(xi ), yi ).
h∈Hmopt
i=1,...,N
This procedure contains two nested loops and looks expensive. For economy,
one starts with the low-end m and increases it stepwise, assessing the generaliza-
tion quality through cross-validation for each regularization strength m, until the
113
validation risk starts to rise. The strength mopt reached at that point is likely to
be about the right one.
The best assessment of the optimal class is achieved when the original training
data set is split into singleton subsets – that is, each Dj contains just a single
training example. This is called leave-one-out cross-validation. It looks like a
horribly expensive procedure, but yet it may be advisable when one has only a
small training data set, which incurs a particularly large danger of ending up with
poorly generalizing models when a wrong model flexibility was used.
K-fold cross validation is widely used – it is a factual standard procedure
in supervised learning tasks when the computational cost of learning a model is
affordable.
114
could give us a formal derivation of the distribution of ĥ(x) from PX,Y and knowl-
edge of A, but we don’t need that for our purposes here. The only insight that
we need to take home at this point is that for a fixed x, ĥ(x) is a random variable
whose value is determined by the drawn training sample (xi , yi )i=1,...,N , and which
has an expectation which we write as Eretrain [ĥ(x)] to indicate that the expectation
is taken over all possible training runs with freshly drawn training data.
Understanding this point is the key to understanding the inner nature of un-
der/overfitting.
If you feel that you have made friends with this Eretrain [ĥ(x)] object, we can
proceed. The rest is easy compared to this first conceptual clarification.
Without proof I note the following, intuitively plausible fact. Among all deci-
sion functions (from any candidate space H), the quadratic risk (49) is minimized
by the function
∆(x) = EY |X=x [Y ], (50)
that is, by the expectation of Y given x. This function ∆ : Rn → R, x 7→ E[Y |X =
x] is the gold standard for minimizing the quadratic risk; no learning algorithm
can give a better result than this. Unfortunately, of course, ∆ remains unknown
because the underlying true distribution PX,Y cannot be exactly known.
Now fix some x and ask by how much ĥ(x) deviates, on average and in the
squared error sense, from the optimal value ∆(x). This expected squared error is
We can learn more about this error if we re-write (ĥ(x) − ∆(x))2 as follows:
and
h i
Eretrain ĥ(x) − Eretrain [ĥ(x)] = Eretrain [ĥ(x)] − Eretrain [Eretrain [ĥ(x)]]
= Eretrain [ĥ(x)] − Eretrain [ĥ(x)]
= 0.
115
Inserting (52) into (51) and taking the expectation on both sides of (51) of
finally gives us
116
7 Representing and learning distributions
Almost all machine learning tasks are based on training data that have some
random component. Having completely noise-free data from deterministic sources
observed with high-precision measurements is a rare exception. Thus, machine
learning algorithms are almost all designed to cope with stochastic data. Their
ultimate functionality (classification, prediction, control, ...) will be served well
or poorly to the extent that the probability distribution of the training data has
been properly accounted for. We have seen this in the previous section. If one
(wrongly) believes that there is little randomness in the training data, one will
take the given training points as almost correct – and end up with an overfitted
model. Conversely, if one (wrongly) thinks the training data are almost completely
“white noise randomness”, the learnt model will under-exploit the information in
the training data and underfit. Altogether it is fair to say that machine learning
is the art of probability distribution modeling (plus subsequent use of that learnt
distribution for classification etc.)
In many machine learning algorithms, the distribution model remains implicit –
there is no place or data structure within the algorithm which explicitly models the
data distribution. In some algorithms however, and for some tasks, the probability
distribution of the training data is explicitly modeled. We will encounter some of
these algorithms later in this course (hidden Markov models, Bayesian networks,
Boltzmann machines are of that kind). At any rate, it is part of the standard
knowledge of a machine learning professional to know some ways to represent
probability distributions and to estimate these representations from data.
117
data? Or, expressing the same question in terms of a loss function, which decision
function minimizes the risk connected to the counting loss
(
1 if y 6= c,
L(y, c) =
0 if y = c?
It turns out that the minimal-risk decision function is in fact well-defined and
unique, and it can (and must) be expressed in terms of the distribution of our
data-generating RVs X and Y .
Our starting point is the true joint distribution PX,Y of patterns and labels.
This joint distribution is given by all the probabilities of the kind
P (X ∈ A, Y = c), (54)
where A is some subvolume of P and c ∈ C. The subvolumes A can be n-
dimensional hypercubes within P, but they also can be arbitrarily shaped “volume
bodies”, for instance balls or donuts or whatever. Note that the probabilities
P (X ∈ A, Y = c) are numbers between 0 and 1, while the distribution PX,Y is
the function which assigns to every choice of A ⊆ P, c ∈ C the number P (X ∈
A, Y = c) (probability theory gives us a rigorous formal way to define and handle
this strange object, PX,Y — it is explained with loving care in my lecture notes
for “Principles of Statistical Modeling”).
The joint distribution PX,Y is the “ground truth” – it is the real-world statis-
tical distribution of pattern-label pairs of the kind we are interested in. In the
Digits example, it would be the distribution of pairs made of (i) a handwritten
digit and (ii) a human-expert provided class label. Test digit images and their
class labels would be randomly “drawn” from this distribution.
A decision function h : P → {c1 , . . . , ck } partitions the pattern space P into k
disjoint decision regions R1 , . . . , Rk by
Ri = {x ∈ P | h(x) = ci }. (55)
A test pattern xtest is classified by h as class i if and only if it falls into the decision
region Ri .
Now we are prepared to analyze and answer our ambitious question, namely
which decision functions yield the lowest possible rate of misclassifications. Since
two decision functions yield identical classifications if and only if their decision
regions are the same, we will focus our attention on these regions and reformulate
our question: which decision regions yield the lowest rate of misclassifications, or
expressed in its mirror version, which decision regions give the highest probability
of correct classifications?
Let fi be the pdf for the conditional distribution PX | Y =ci . It is called the
class-conditional distribution.
The probability to obtain a correct classification for a random test pattern,
when the decision regions are Ri , is equal to ki=1 P (X ∈ Ri , Y = ci ). Rewriting
P
this expression using the pdfs of the class conditional distributions gives
118
k
X
P (X ∈ Ri , Y = ci ) =
i=1
k
X
= P (X ∈ Ri | Y = ci ) P (Y = ci )
i=1
Xk Z
= P (Y = ci ) fi (x) dx
i=1 Ri
Xk Z
= P (Y = ci ) fi (x) dx. (56)
i=1 Ri
Note that the integral is taken over a region that possibly has curved boundaries,
and the integration variable x is a vector. The boundaries between the decision
regions are called decision boundaries. For patterns x that lie exactly on such
boundaries, two or more classifications are equally probable. For instance, the
digit pattern shown in the last but third column in the second row in Figure 15
would likely be classified by humans as a “1” or “4” class pattern with roughly
the same probability; this pattern would lie close to a decision boundary.
The expression (56) obviously becomes maximal if the decision regions are
given by
Thus we have found the decision function which is optimal in the sense that it
maximizes the probability of correct classifications: namely
A learning algorithm that finds the optimal decision function (or some function
approximating it) must learn (implicitly or explicitly) estimates of the the class-
conditional distributions PX | Y =ci and the class probabilities P (Y = ci ).
The class probabilities are also called the class priors. Figure 36 visualizes
optimal decision regions and decision boundaries. In higher dimensions, the ge-
ometric shapes of decision regions can become exceedingly complex, fragmented
and “folded into one another” — disentangling them during a learning process is
one of the eternal challenges of ML.
119
Figure 36: Optimal decision regions Ri . A case with a one-dimensional pattern
space P and k = 3 classes is shown. Broken lines indicate decision boundaries.
Decision regions need not be connected!
Quite generally, any machine learning task can be solved “optimally” (in terms
of minimizing some risk) only if the solution takes the true distribution of all
task-relevant RVs into account. As I mentioned before, many learning algorithms
estimate a model of the underlying data distribution only implicitly. But some
ML algorithms generate explicit models of probability distributions, and in the
wider fields of statistical modeling, explicit models of probability distributions are
often the final modeling target.
Representing a probability distribution in mathematical formalism or by an
algorithm is not always easy. Real-world probability distributions can be utterly
complex and high-dimensional objects which one cannot just “write down” in
a formula. Over the last 60 or so years, ML and statistics research has devel-
oped a wide range of formalisms and algorithms for representing and estimating
(“learning”) probability distributions. A machine learning professional should
know about some basic kinds of such formalisms and algorithms. In this section I
present a choice, ranging from elementary to supremely complex, powerful — and
computationally costly.
120
Definition 7.1 Given a discrete sample space S (finite or countable infinite), a
P on S is a function p : S → [0, 1] whose total mass is 1,
probability mass function
that is, which satisfies s∈S p(s) = 1.
Bernoulli distribution. The Bernoulli distribution arises when one deals with
observations that have only two possible outcomes, like tail – head, female – male,
pass – fail, 0 – 1. Putting this in abstract notation, we have a two-element sample
space S = {s1 , s2 }, on which a Bernoulli distribution is defined by its pmf, which
in this case has its own standard terminology:
The factor Ns is called binomial coefficient. Figure 37 shows some binomial pmf’s.
X ∼ Bi(10, 0.25)
121
p
Figure 37: The pmf’s of Bi(20, 0.1) (blue), Bi(20, 0.5) (red), and Bi(20, 0.9)
(green). Figure taken from www.boost.org.
122
The expected number of events E[X] is called the rate of the Poisson distribution,
and is commonly denoted by λ. The pmf of a Poisson distribution with rate λ is
given by
λk e−λ
p(k) = . (59)
k!
Figure 38 depicts the pmf’s for three different rates.
Figure 38: The pmf of the Poisson distribution for various values of the param-
eter λ. The connecting lines bewteen the dots are drawn only for better vi-
sual appearance (image source: https://commons.wikimedia.org/wiki/File:
Poisson_pmf.svg).
123
in Rn . The simplest point distribution is defined on the real line S = R and has
the defining property that for any subset A ⊆ R it holds that
(
1 if 0 ∈ A
P (X ∈ A) =
0 if 0 ∈/ A.
There are several ways to write down such a distribution in mathematical for-
malism. In the machine learning literature (and throughout the natural sciences,
especially physics), the above point distribution would be represented by a weird
kind of pdf-like function, called the Dirac delta function δ (the Wikipedia article
https://en.wikipedia.org/wiki/Dirac_delta_function is recommendable if
you want to understand this function better). The Dirac delta function is used
inside an integral just like a normal pdf is used in (60). Thus, for a subset A ⊆ R
one has Z
P (X ∈ A) = δ(x) dx.
A
n
The Dirac delta is also used in R , where likewise it is a “pdf” which describes a
probability distribution concentrated in a single point, the origin.
If one wants to have a multi-point distribution one can combine Dirac deltas.
For example, if you want to create a probability measure on the real line that
places a probability of 1/2 on the point 1.0, and probabilities 1/4 each on the
points 2.0 and 3.0, you can do this by a linear combination of shifted Dirac deltas:
Z
P (X ∈ A) = 1/2 δ(x − 1) + 1/4 δ(x − 2) + 1/4 δ(x − 3) dx.
A
The uniform distribution. We don’t need to make a big fuzz about this. If
I = [a1 , b1 ]×. . .×[an , bn ] is a n-dimensional interval in Rn , the uniform distribution
on I is given by the pdf
(
1
(b1 −a1 )·...·(bn −an )
if x ∈ I
p(x) = (61)
0 if x ∈
/ I.
124
The exponential distribution. This distribution is defined for S = [0, ∞)
and could be paraphrased as “the distribution of waiting times until the next
of these things happens”. Consider any of the kinds of temporal events listed
for the Poisson distribution, for instance the event “meteorite hits earth”. The
exponential distribution characterizes how long you have to wait for the next
impact, given that one impact has just happened. Like in the Poisson distribution,
such random event processes have an average rate events / unit reference time
interval. For instance, meteorites of a certain minimum size hit the earth with a
rate of 2.34 per year (just guessing). This rate is again denoted by λ. The pdf of
the exponential distribution is given by
Figure 39: The pdf of the exponential distribution for various values of
the parameter λ (image source: https://commons.wikimedia.org/wiki/File:
Exponential_pdf.svg).
The exponential distribution plays a big role in spiking neural networks (SNNs).
Biological neurons communicate with each other by sending short “point-like”
electrical pulses, called spikes. Many people believe that Nature invented commu-
nication by spikes to let the brain save energy – your head shouldn’t heat up (like
your PC does) when you do a hefty thinking job! For the same reason, microchip
engineers have teamed up with deep learning researchers to design novel kinds of
microchips for deep learning applications. These microchips contain neuron-like
processing elements which communicate with each other by spikes. IBM and Intel
have actually built such chips (check out “IBM TrueNorth” and “Intel Loihi” if
you want to learn more). Research about artificial SNNs for machine learning ap-
plications goes hand in hand with research in neuroscience. In both domains, one
often uses models based on the assumption that the temporal pattern in a spike
125
sequence sent by a neuron is a stochastic process called a Poisson process. In a
Poisson process, the waiting times between two consecutive spikes are exponen-
tially distributed. Recordings from real biological neurons often show a temporal
randomness of spikes that can be almost perfectly modeled by a Poisson process.
Learning / estimation: The rate λ is the only parameter characterizing an
exponential distribution. Training data: a sequence t1 , t2 , . . . , tN of time points
where the eventP in question was observed. Then the rate can be estimated by
λ̂ = (1/(N − 1) i=1,...,N −1 ti+1 − ti .
Figure 40: pdf of a normal distribution with mean 2 and standard deviation 1.
126
The majestic power of the normal distribution, which makes her reign almost
universally over almost all natural phenomena, comes from one of the most central
theorems of probability theory, the central limit theorem. It is stated in textbooks
in a variety of (not always exactly equivalent) versions. It says, in brief, that one
gets the normal distribution whenever random effects of many independent small-
sized causes sum up to large-scale observable effects. The following definition
makes this precise (you do not have to understand it — that would need a math-
centered full course in probability theory — but you should at least have seen this
definition):
Definition 7.5 Let (Xi )i∈N be a sequence of independent, real-valued, square in-
tegrable random variables with nonzero variances Var(Xi ) = E[(Xi − E[Xi ])2 ].
Then we say that the central limit theorem holds for (Xi )i∈N if the distributions
PSn of the standardized sum variables
Pn
(Xi − E[Xi ])
Sn = i=1 Pn (64)
σ ( i=1 Xi )
converge weakly to N (0, 1).
Explanations (if you wish to give it a try — these concepts will not be queried in
homeworks or exams):
• A real-valued random variable with pdf p is square Rintegrable if its [un-
centered] second moment, that is the integral E[X 2 ] = R x2 p(x)dx is finite.
for all continuous, bounded functions f : R → R. You will find the nota-
tion of these integrals unfamiliar, and indeed you see here cases of Lebesgue
integrals – a far-reaching generalization of the Riemann integrals that you
know. Lebesgue integrals can deal with a far greater range of functions
than the Riemann integral. Mathematical probability theory is formulated
exclusively with the Lebesgue integral. We cannot give an introduction to
Lebesgue integration theory in this course. Therefore, simply ignore the
precise meaning of “weak convergence” and take home that sequences of
distributions are required to converge to a target distribution in some subtly
defined way.
A sequence (Xi )i∈N of random variables (or, equivalently, its associated se-
quence of distribution (PXi )i∈N ) obeys the central limit theorem under rather weak
conditions – or in other words, for many such sequences the central limit theorem
holds.
127
A simple, important class of (Xi ) for which the central limit theorem holds is
obtained when the Xi are identically distributed (and, of course, are independent,
square integrable and have nonzero variance). Notice that regardless of the shape
of the distribution of each Xi , the distribution of the normalized sums converges
to N (0, 1)!
The classical demonstration of the central limit theorem is the Galton board,
named after Sir Francis Galton (1822–1911), an English multi-scientist. The idea is
to let little balls (or beans, hence this device is sometimes called “bean machine”)
trickle down a grid of obstacles which randomly deflect the ball left or right (Figure
41). It does not matter how, exactly, these deflections act — in the simplest case,
the ball is just kicked right or left by one space grid unit with equal probability.
The deeper the trickling grid, the closer will the resulting distribution be to a
normal distribution. A nice video can be watched at https://www.youtube.com/
watch?v=PM7z_03o_kk.
Figure 41: The Dalton board. Compare text for explanation. Figure taken from
https://janav.wordpress.com/2013/09/26/power-law/.
However, this simple case does not explain the far-reaching, general importance
of the central limit theorem (rather, property). In textbooks one often finds state-
ments like, “if the outcomes of some measurement procedure can be conceived to
be the combined effect of many independent causal effects, then the outcomes will
be approximately normal distributed”. The “many independent causal effects”
that are here referred to are the random variables (Xi ); they will typically not
be identically distributed at all. Still the central limit theorem holds under mild
assumptions. Intuitively, all that one has to require is that none of the individual
128
random variables Xi dominates all the others – the effects of any single Xi must
asymptotically be “washed out” if an increasing number of other Xi0 is entered
into the sum variable Sn . In mathematical textbooks on probability you may find
numerous mathematical conditions which amount to this “washing out”. A special
case that captures many real-life cases is the condition that the Xi are uniformly
bounded, that is, there exists some b > 0 such that all possible values that Xi
can take are in the interval [−b, b]. However, there exist much more general (non-
trivial to state) conditions that likewise imply the central limit theorem. For our
purposes, a good enough take-home message is
1. Transform the problem from its original version N (µ, σ 2 ) to the standard
normal distribution N (0, 1), by using
b−µ
Z b Z
1 (x−µ)2 σ 1 (x)2
√ e− 2σ2 dx = √ e− 2 dx. (66)
a 2πσ a−µ
σ
2π
2. Compute the numerical value of the r.h.s. in (66) by using the cumulative
density function of N (0, 1), which is commonly denoted by Φ:
b−µ
b−µ a−µ
Z
σ 1 (x)2
√ e− 2 dx = Φ( ) − Φ( ).
a−µ
σ
2π σ σ
129
estimation formulas for these two parameters are
1 X
µ̂ = xi ,
N i
1 X
σ̂ 2 = (xi − µ̂)2 .
N −1 i
The n–dimensional normal distribution. If data points are not just real
numbers but vectors x = (x1 , . . . , xn )0 ∈ Rn , whose component RVs Xi fulfill
the central limit theorem, the joint distribution of the RVs X1 , . . . , Xn is the
multidimensional normal distribution. It has the pdf
1 1 0 −1
p(x) = exp − (x − µ) Σ (x − µ) . (67)
(2π)n/2 det(Σ)1/2 2
Here µ is the expectation E[(X1 , . . . , Xn )0 ] and Σ is the covariance matrix of the
n component variables, that is Σ(i, j) = E[(Xi − E[Xi ])(Xj − E[Xj ])]. Figure
42 shows the pdf of a 2-dimensional normal distribution. In geometrical terms, a
multidimensional normal distribution is shaped as an ellipsoid, whose main axes
coincide with the eigenvectors ui of the covariance matrix Σ. Like in PCA one
can obtain them from the SVD: if U DU 0 = Σ is the singular value decomposition
of Σ, the eigenvectors ui are the columns of U .
x2
u2 u1
x1
130
binations of Gaussians to approximate complex distributions - we will see this
later in today’s lecture. At the advanced end, there is an entire modern branch
of ML, Gaussian processes, where complex distributions (and hyperdistributions)
are modeled by infinitely many, interacting Gaussians. This is beyond the scope
of our course.
Learning / estimation: Given a set of n-dimensional training data points
(xi )i=1,...,N , the expectation µ and the covariance matrix Σ can be estimated from
the training data in the obvious way:
X X
µ̂ = 1/N xi and Σ̂ = 1/(N − 1) (xi − µ̂)(xi − µ̂)0
i i
.
... and many more! The few common, named distributions that I displayed
in this section are only meant to be illustrative picks from a much, much larger
reservoir of well-known, completely analyzed, tabulated, pre-computed, and in-
dividually named distributions. The online book “Field Guide to Continuous
Probability Distributions” (Crooks 2017) attempts a systematic overview. You
should take home the following message:
131
5. finally allows you to state something like, “given the observation data,
with a probability of 0.95, the true distribution θtrue differs from the
estimate θ̂ by less than 0.01 percent.”
• Professionally documented statistical analyses will always state not only the
estimated model, but also in addition quantify how accurate the model esti-
mate is. This can take many forms, like error bars drawn around estimated
parameters or stating “significance levels”.
132
Figure 43: A. A richly structured 2-dimensional pdf (darker: larger values of pdf;
white: zero value of pdf). B. Same, discretized to a pmf. C. A MoG coverage of
the pmf with 20 Gaussians, showing the ellipsoids coresponding to standard devi-
ations. D. Contour plot of the pdf represented by the mixture of Gaussians. (The
picture I used here is an iconic photograph from the history of aviation, July 1909.
It shows Latham departing from Calais, attempting to cross the Channel in an
aeroplane for the first time in history. The motor of his Antoinette IV monoplane
gave up shortly before he reached England and he had to water. A few days later
his rival Blériot succeeded. https://en.wikipedia.org/wiki/Hubert Latham
133
where θj = (µj , Σj ) are the parameters of the jth Gaussian, and where the mixture
coefficients P (j) satisfy
m
X
0 ≤ P (j) and P (j) = 1.
j=1
Pm
PmFor notational simplification we will mostly write j=1 p(x|j) P (j) instead of
j=1 p(x|θj ) P (j).
MoG’s are used in unsupervised learning situations: given a data set (xi )i=1,...,N
of points in Rn , one wants to find a “good” MoG approximation of the distribution
from which the xi have been sampled. But... what does “good” mean?
This is the point to introduce the loss function which is invariably used when
the task is to find a “good” distribution model for a set of training data points.
Because this kind of loss function and this way of thinking about optimizing distri-
bution models is not limited to MoG models but is the universal way of handling
distribution modeling throughout machine learning, I devote a separate subsection
to it and discuss it in general terms.
134
This pdf f (D|θ) value gives the probability density of getting the sample D
if the underlying distribution is the one modeled by θ. Conversely, one says that
this value f (D|θ) is the likelihood of the distribution θ given the data D, and
writes L(θ | D). The words “probability” and “likelihood” are two different words
which refer to the same mathematical object, namely f (D|θ), but this object is
read from left to right when speaking of “probability”:
• “the likelihood of the model θ given a dataset D”, written L(θ | D).
If you meet someone who is able to correctly use the words “probability” vs.
“likelihood”, you can trust that person — s/he has received a good education in
machine learning.
The likelihood pdf (69) is not suitable for doing numerical computations, be-
cause a product of very many small numbers (the p(xi | θ)) will lead to numerical
underflow and become squenched to zero on digital computers. To avoid this, one
always works with the log likelihood
N
Y N
X
L(θ|D) = log p(xi | θ) = log p(xi | θ) (70)
i=1 i=1
which obviously is equivalent. If one writes out the L(θ | D) one sees the structural
similarity of this unsupervised learning task with the supervised loss-minimization
task which we met earlier in this course (e.g. (42)):
N
X
ML
θ = argmin − log p(xi | θ).
θ∈H i=1
135
7.3.2 Maximum-likelihood estimation of MoG models by an EM algo-
rithm
Back to our concrete business of modeling a distribution represented by a dataset
D = (xi )i=1,...,N by a MoG. In order to to find a maximum-likelihood solution for
this MoG modeling task, one has to answer two questions:
1. What is the appropriate number m of mixture components? Setting m de-
termines the class of candidate solutions in which an optimization algorithm
can search for the maximum-likelihood solution. The set Hm contains all
mixtures of m Gaussians. We have here a case of a model class inclusion
sequence as discussed in Section 6.4. For determining an appropriate model
flexibility m the routine way would be to carry out a cross-validation.
2. If an appropriate m has been determined, how should the parameters θj
(where j = 1, . . . , m) of the participating Gaussians be set in order to solve
the optimization problem
θML = argmax L(θ | D)? (73)
θ∈Hm
136
• for estimating hidden Markov models (HMMs) of stochastic timeseries; HMMs
are the most widely used models for stochastic processes with memory;
• for estimating models of complex systems that are described by many inter-
acting random variables, some of which are not observable — an example
from human-machine interfacing: modeling the decisions expected from a
user assuming that these decisions are influenced by emotional states that
are not directly observable; such models are called Bayesian networks or
more generally, graphical models;
• for a host of other practical modeling tasks, like our MoG optimization task.
137
Figure 44: A point set sampled from a mixture of Gaussians. Left: what we
(don’t) know if we don’t know from which Gaussian a point was drawn. Right:
the “hidden” variable made visible. (Figure copied from an online presentation of
Christopher M. Bishop, no longer accessible)
The MoG case has a mix of types: X gives values in the continuous space (Rn )N
and Y gives values in the discrete space {1, . . . , m}N . This makes it necessary to
use a mix of pdf’s and pmf’s when working out the MoG case in detail.
The difficulty we are facing is that the pdf value pX (D|θ) depends on the values
that the hidden variables take, that is,
Z
L(θ | D) = log pX (D|θ) = log pX,Y (D, y | θ) dy.
SY
138
In the MoG case, the integral over SY becomes a sum over SY :
X
pX (D|θ) = P (Y = (j1 , . . . , jN )) pX|Y =(j1 ,...,jN ) (D)
(j1 ,...,jN )∈{1,...,m}N
X Y
= P (Yi = ji ) pXi |Yi =ji (xi ),
(j1 ,...,jN )∈{1,...,m}N i=1,...,N
where pX|Y =(j1 ,...,jN ) is the pdf of the conditional distribution of X given that
the hidden variables take the value (j1 , . . . , jN ) and pXi |Yi =ji is the pdf of the
conditional distribution of Xi given that the hidden variable Yi takes the value ji .
Now let q be any pdf for the hidden variables Y . Then we can obtain a lower
bound on log pX (D|θ) by
Z
L(θ | D) = log pX,Y (D, y | θ) dy =
SY
pX,Y (D, y | θ)
Z
= log q(y) dy
SY q(y)
pX,Y (D, y | θ)
Z
≥ q(y) log dy
SY q(y)
Z Z
= q(y) log pX,Y (D, y | θ) dy − q(y) log q(y) dy (75)
SY SY
=: F(q, θ),
(76)
where the inequality follows from a version of Jensen’s inequality which states
that
E[f ◦ Y ] ≤ f (E[Y ])
for any concave function f (the log is concave).
EM algorithms maximize the lower bound F(q, θ) by alternatingly and itera-
tively maximizing F(q, θ) first with respect to q, then with respect to θ, starting
from an initial guess q (0) , θ(0) which is then updated by:
139
F(pY |X=D,θ(k) , θ(k) ) =
Z Z
(k)
= pY |X=D,θ(k) (y) log pX,Y (D, y | θ ) dy − pY |X=D,θ(k) (y) log pY |X=D,θ(k) (y) dy
SY SY
pX,Y (D, y | θ(k) )
Z
= pY |X=D,θ(k) (y) log dy
SY pY |X=D,θ(k) (y)
Z
= pY |X=D,θ(k) (y) log pX (D | θ(k) ) dy
SY
pX,Y (D, y|θ(k) )
Z
= (k) )
log pX (D | θ(k) ) dy
SY p X (D | θ
log pX (D | θ(k) )
Z
= (k)
pX,Y (D, y|θ(k) ) dy
pX (D | θ ) SY
= log pX (D | θ(k) )
= L(θ(k) | D).
Z
(k+1)
θ = argmax q (k+1) (y) log pX,Y (D, y | θ) dy
θ
ZSY
= argmax pY |X=D,θ(k) (y) log pX,Y (D, y | θ) dy (80)
θ SY
How the M-step is concretely computed (i.e. how the argmax problem (80)
is solved) depends on the particular kind of model. Can be tricky and worth a
publication to find an M-step algorithm for your new kind of model.
Because we have F = L(θ(k) | D) before each M-step, and the E-step does
not change θ(k) , and F cannot decrease in an EM-double-step, the sequence
140
L(θ(0) | D), L(θ(1) | D), . . . monotonously grows toward a supremum. The itera-
tions are stopped when L(θ(k) | D) = L(θ(k+1) | D), or more realistically, when a
predefined number of iterations is reached or when the growth rate falls below a
predetermined threshold. The last parameter set θ(k) that was computed is taken
as the outcome of the EM algorithm.
It must be emphasized that EM algorithms steer toward a local maximum of
the likelihood. If started from another initial guess, another final parameter set
may be found. Here is a summary of the EM principle:
E-step: Estimate the distribution pY |X=D,θ(k) (y) of the hidden variables, given
data D and the parameters θ(k) of a preliminary model. This can be intu-
itively understood as inferring knowledge about the hidden variables, thereby
completing the data.
There are other ways to compute maximum likelihood solutions in tasks that
involve visible and hidden variables. Specifically, one can often invoke a gradient
descent optimization. A big advantage of EM over gradient descent is that EM
does not need a careful tuning of algorithm control parameters (like learning rates,
an eternal bummer in gradient descent methods), simply because there are no
tuning parameters. Furthermore, EM algorithms are typically numerically robust,
which cannot be said about gradient descent algorithms.
Now let us put EM to practice for the MoG estimation. For better didactic
transparency, I will restrict my treatment to a special case where we require all
Gaussians to be spheric, that is, their covariance matrices are of the form Σ =
σ 2 In where σ 2 is the variance of the Gaussian in every direction and In is the n-
dimensional identity matrix. The general case of mixtures of Gaussians composed
of member Gaussians with arbitrary Σ is described in textbooks, for instance Duda,
P. E. Hart, and Stork 2001. And anyway, you would probably use a ready-made
online tool for MoG estimation... Here we go.
A MoG pdf with m spherical components is given by the vector of parameters
θ = (µ1 , . . . , µm , σ12 , . . . , σm
2
, P (1), . . . , P (m))0 . This gives the following concrete
optimization problem:
Assume that we are after iteration k and want to estimate θ(k+1) . In the E-step
we have to compute the conditional distribution of the hidden variable Y , given
141
data D and the preliminary model
(k) 2 (k)
θ(k) = (µ1 , . . . , µ(k)
m , σ1
2 (k)
, . . . , σm , P (k) (1), . . . , P (k) (m))0 .
Unlike in the treatment that I gave for the general case, where we assumed Y
to be continuous, now Y is discrete. Its conditional distribution is given by the
probabilities P (Yi = j | Xi = xi , θ(k) ), where i = 1, . . . , N and j = 1, . . . , m. These
probabilities are
p(k) (xi | Yi = j) P (k) (j)
P (Yi = j | Xi = xi , θ(k) ) = (k)
, (82)
pX (xi )
(k) 2 (k)
where p(k) (xi | Yi = j) is the pdf of the j-th Gaussian with parameters µj , σj
and m
(k)
X
pX (xi ) = P (k) (j) p(k) (xi | Yi = j).
j=1
In the M-step we have to find maximum likelihood estimates for all parameters
in θ. I do not give a derivation here but just report the results, which are intuitive
enough:
PN (k)
(k+1) i=1 P (Yi = j | Xi = xi , θ ) xi
µj = P N
,
(k) )
i=1 P (Yi = j | Xi = xi , θ
PN (k) (k+1)
2 (k+1) 1 i=1 P (Yi = j | Xi = xi , θ )kµj − xi k 2
σj = PN , and
n i=1 P (Yi = j | Xi = xi , θ
(k) )
N
1 X
P (k+1) (j) = P (Yi = j | Xi = xi , θ(k) ).
N i=1
142
Figure 45: A two-dimensional dataset.
which makes H the indicator function of a unit hypercube centered at the origin.
Using H, we get the n-dimensional analog of the staircase pdf in Figure 47 for a
sample D = (xi )i=1,...,N by
N
(D) 1 X 1 x − xi
p (x) = H , (84)
N i=1 dn d
observing that the volume of such a cube is dn . The superscript (D) in p(D) is
meant to indicate that the pdf depends on the sample D.
Clearly, given some sample (xi )i=1,...,N , we do not believe that such a rugged
staircase reflects the true probability distribution the sample was drawn from. We
would rather prefer a smoother version. This can be easily done if we use smoother
kernel functions. A standard choice is to use multivariate Gaussians with diagonal
143
Figure 46: The EM algorithm at work on the Old Faithful dataset.
covariance matrix and uniform standard deviations σ =: d for H. This turns (84)
into
N
kx − xi k2
(D) 1 X 1
p (x) = exp −
N i=1 (2πd2 )n/2 2d2
N
1 X 1 1 1 x − xi 2
= exp − k k (85)
N i=1 dn (2π)n/2 2 d
where the second line brings the expression to a format that is analog to (84).
It is clear that any nonnegative kernel function H which integrates to unity
can be used in an equation of this sort such that the resulting p(D) will be a pdf.
The scaling factor d determines the width of the Parzen window and thereby the
amount of smoothing. Figure 48 illustrates the effect of varying d.
Comments:
• Parzen window representations of pdfs are ”non-parametric” in the sense
that the shape of such a pdf is determined by a sample (plus, of course, by
the shape of the kernel function, which however mainly serves a smoothing
purpose). This can render Parzen window representations computationally
expensive, because if the sample size is large, a large number of data points
have to be stored (and accessed if the pdf is going to be used).
• The basic Parzen windowing scheme, as introduced here, can be refined in
many ways. A natural way to improve on it is to use different widths d
144
Figure 47: Rectangular Parzen window representation of a distribution given by a
sample of 5 real numbers. The sample points are marked by colored circles. Each
data point lies in the middle of a square ”Parzen window”, that is, a rectangular
pdf centered on the point. Weighted by 1/5 (colored rectangles) and summed
(solid black staircase line) they give a pdf.
0.6 3
0.2 8 8 8
0.4 2
0.1 6 0.2 6 6
1
0 0 0
0 4 0 4 0 4
2 2 2
4 2 4 2 4 2
6 6 6
8 8 8
10 0 10 0 10 0
FIGURE 4.4. Three Parzen-window density estimates based on the same set of five samples, using the window
functions
Figure 48:in Fig.
The 4.3.effect
As before,
of the vertical axes
choosing have beenwidths
different scaled to show
d in the structure of eachadistribution.
representing 5-point, 2-
From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright ⃝c 2001 by John
dimensional sample by Gaussian windows. (Taken from the online set of figures
Wiley & Sons, Inc.
of the book by Duda, Hart & Stork, ftp://ftp.wiley.com/public/sci_tech_
med/pattern/)
for different sample points xi . One then makes d narrow in regions of the
sample set which are densely populated, and wide in regions that are only
thinly covered by sample points. One way of doing that (which I invented
while I was writing this — there are many ways to go) would be to (i) choose
a reasonably small integer K; (ii) for each sample point xi determine its K
nearest neighbors xi1 , ..., xiK ; (iii) compute the mean squared distance δ of
xi from these neighbors, (iv) set d proportional to this δ for this sample
point xi .
145
V is the validation set, T the training set, and p(T ) (v | d) the Parzen pdf
obtained from the training set using width d.
Still this will not usually solve all underflow problems, because the sum-exp
terms within the log still may underflow. Here is a trick to circumvent this
problem, known as the ”log-sum-exp” trick. Exploit the following:
log(exp(−A) + exp(−B)) =
= log (exp(−A + C) exp(−C) + exp(−B + C) exp(−C))
= −C + log (exp(−A + C) + exp(−B + C)) ,
146
8 Bayesian model estimation
Machine learning is based on probability, and probability is ... ? — We don’t know
what probability is! After centuries of thinking, philosophers and mathematicians
have not arrived at a general consensus of how to best understand randomness.
The proposals can be broadly grouped into objectivistic and subjectivistic inter-
pretations of “probability”.
According to the objectivistic view, probability resides physically in the real
world — it is a phenomenon of nature, as fundamentally a property of physical
reality as for instance “time” or “energy”.
According to the subjectivistic view, probability describes an observer’s sub-
jective opinion on something observed — a degree of belief, of uncertainty, of
plausibility of judgement, or missing information etc.
Calling the two views “objectivistic” versus “subjectivistic” is the philosophers’
terminology. Statisticians and machine learners rather call it frequentist statistics
versus Bayesian statistics.
The duality of understandings of “probability” has left a strong mark on ma-
chine learning. The two ways of thinking about probability lead to two different
ways of designing learning algorithms. Both are important and both are in daily
use. A machine learning professional should be aware of the fundamental intuitions
behind the two sorts of algorithms, and when to use which. In this section I ex-
plain the fundamental ideas behind the two understandings of probability, outline
the general principles of constructing Bayesian algorithms, and demonstrate them
in a case study of great importance in bioinformatics, namely the classification of
proteins.
147
6) = 1/5 would then become “measurable in principle” by
If one looks at this “definition” with a critical mind one will find that it is
loaden with difficulties.
Second, it does not inform us about how, exactly, the “repeated trials” of exe-
cuting X should be done in order to be “unbiased”. What does that mean
in terms of experimental procedures? This is a very critical issue. To appre-
ciate its impact, consider the example of a bank who wishes to estimate the
probability that a customer will fail to pay back a loan. In order to get an
estimate of this probability, the bank can only use customer data collected in
the past, but wants to base creditworthyness decisions for future customers
on those data. Picking only past data points to base probability estimates on
hardly can qualify as an “absolutely unbiased” sampling of data, and in fact
148
the bank may grossly miscalculate credit risks when the general customer
body or their economical conditions change over time. These difficulties
have of course been recognized in practical applications of statistics. Text-
books and statistics courses for psychologists and economists contain entire
chapters with instructions on how to collect “unbiased samples”.
Third, if one repeats the repeated measurement, say by carrying out one mea-
surement sequence giving x1 , x2 , . . . and then another one giving x01 , x02 , . . .,
the values P̂N from Equation (88) are bound to differ between the two se-
ries. The limit indicated in Equation (88) must somehow be robust against
different versions of the P̂N . Mathematical probability theory offers several
ways to rigorously define limits of series of probability quantities which we
do not present here. Equation (88) is suggestive only and I marked it with
a (∗ ) to indicate that it is not technically complete.
Among these three difficulties, only the second one is really problematic. The
first one is just a warning that in order to measure a probability with increasing
precision we need to invest an increasingly large effort, — but that is the same
for other measurables in the sciences. The third difficulty can be fully solved by a
careful definition of suitable limit concepts in probability theory. But the second
difficulty is fundamental and raises its ugly head whenever statistical assertions
about reality are made.
In spite of these difficulties, the objectivist view on probability in general, and
the frequentist account of how to measure it in particular, is widely shared among
empirical scientists. It is also the view of probability which is commonly taught
in courses of statistics for mathematicians. A student of mathematics may not
even have heard of Bayesian statistics at the end of his/her studies. In machine
learning, the frequentist view often leads to learning algorithms which are based
on the principle of maximum likelihood estimation of distributions — as we have
seen in Section 7.3.1. In fact, all what I wrote in these lecture notes up to now
was implicitly based on a frequentist understanding of probability.
149
various kinds of uncertainty in their knowledge and judgements. The formalisms
developed by subjectivists can by and large be seen as generalizations of classical
logic. Classical logic only knows two truth values: true or false. In subjectivist
versions of logic formalisms, a proposition can be assigned graded degrees of “be-
lief”, “plausibility”, etc. For a first impression, contrast a classical-logic syllogism
like
therefore, B is true
This example is taken from Jaynes 2003, where a situation is described in which
a policeman sees a masked man running away from a juweler’s shop whose window
was just smashed in. The plausibility rule captures the policeman’s inference that
the runner is a thief (A) because if a person is a thief, it is more likely that the
person will run away from a smashed-in shop window (B ) than when the person
isn’t a thief. From starting points like this, a number of logic formalisms have been
devised which enrich/modify classical two-valued logic in various ways. If you want
to explore these areas a little further, the Wikipedia articles probabilistic logic,
Dempster-Shafer theory, fuzzy logic, or Bayesian probability are good entry points.
In some of these formalisms the Kolmogorov axioms of frequentist probability re-
appear as part of the respective mathematical apparatus. Applications of such
formalisms arise in artificial intelligence (modeling reasoning under uncertainty),
human-machine interfaces (supporting discourse generation), game theory and
elsewhere.
The discipline of statistics has almost entirely been developed in an objectivist
spirit, firmly rooted in the frequentist interpretation of probability. Machine learn-
ing also in large parts roots in this view. However, a certain subset of machine
learning models and computational procedures have a subjectivist component.
These techniques are referred to as Bayesian model estimation methods. Bayesian
modeling is particulary effective and important when training datasets are small.
I will explain the principle of Bayesian model estimation with a super-simple syn-
thetic example. A more realistic but also more complex example will be given in
Section 8.3.
150
Consider the general statistical modeling task, here discussed for real-valued
random variables. A measurement which yields real-valued outcomes (like mea-
suring the speed of a diving falcon) is repeated N times, giving a measurement
sequence x1 , . . . , xN . The ith measurement value is obtained from a RV Xi . These
RVs Xi which model the individual measurements are i.i.d. We assume that the
distribution of each Xi can be represented by a pdf pXi : R → R≥0 . The i.i.d.
property of the family (Xi )i=1,...,N implies that all these pXi are the same, and we
call that pdf pX . We furthermore assume that pX is a parametric pdf, that is, it
is a function which is parametrized by a parameter vector θ, for which we often
write pX (θ). Then, the statistical modeling / machine learning task is to estimate
θ from the sample data x1 , . . . , xN . We have seen that this set-up naturally leads
to maximum-likelihood estimation algorithms which need to be balanced with
regards to bias-variance using some regularization and cross-validation scheme.
For concreteness let us consider a case where N = 2, that is two observa-
tions (only two!) have been collected, forming a training sample D = (x1 , x2 ) =
(0.9, 1.0). We assume that the pdf pX is a normal distribution
√ with unit standard
deviation, that is, the pdf has the form pX (x) = 1/ 2π exp(−(x − µ)2 /2). This
leaves the expectation µ as the only parameter that has to be estimated, thus
θ = µ. The learning task is to estimate µ from D = (x1 , x2 ) = (0.9, 1.0).
The classical frequentist answer to this question is to estimate µ by the sample
mean (which, by the way, is the maximum-likelihood estimator for the expectation
µ). That is, one computes
N
1 X
µ̂ = xi , (89)
N i=1
which in our example gives µ̂ = (0.9 + 1.0)/2 = 0.95.
This is the best a classical-frequentist modeler can do. In a certain well-defined
sense which we will not investigate, the sample mean is the optimal estimator for
the true mean of a real-valued distribution. But “best” is not “good”: with only
two data points, this estimate is quite shaky. It has a high variance: if one would
repeat the observation experiment, getting a new sample x01 , x02 , very likely one
would obtain a quite different sample and thus an estimate µ̂0 that is quite different
from µ̂.
Bayesian model estimation shows a way how to do better. It is a systematic
method to make use of prior knowledge that the modeler may have beforehand.
This prior knowledge takes the form of “beliefs” about which parameter vectors θ
are more or less plausible. This belief is cast in the form of a probability distribu-
tion over the space Θ of possible parameter vectors θ. In our super-simple example
let us assume that the modeler knows or believes that the true expectation µ can’t
be negative and it can’t be greater than 1. Furthermore, let us assume that the
modeler is lacking any more specific insight — all values of θ (here: θ = µ) are
equally plausible to the modeler. Thus he/she will encode his/her prior belief
as the uniform distribution h over [0, 1]. This distribution is called the Bayesian
151
prior distribution.
This kind of prior knowledge is often available, and it can be quite weak (as in
our example). Abstracting from our example, this kind of knowledge means to fix
a “belief profile” (in the mathematical format of a probability distribution) over
the space Θ of possible parameters θ for a model family. In our mini-example the
modeler felt confident to restrict the possible range of the single parameter θ1 = µ
to the interval [0, 1], with a uniform (= non-committing) distribution of “belief”
over this interval.
Before I finish the treatment of our baby example, I present the general schema
of Bayesian model estimation.
To start the Bayesian model estimation machinery, available prior knowledge
about parameters θ is cast into the form of a distribution over parameter space.
For a K-parametric pdf pX , the parameter space is RK . Two comments:
• The distribution for model parameters θ is not a distribution in the classical
sense. It is not connected to a random variable and does not model a real-
world outcome of observations. Instead it captures subjective beliefs that
the modeler has about how the true distribution PXi of data points should
look like. It is here that subjectivistic aspects of “probability” intrude into
an otherwise classical-frequentist picture.
• Each parameter vector θ ∈ RK corresponds to one specific pdf pX (θ) — one
of the many candidate distributions in Θ. A distribution over parameter
vectors θ ∈ Θ ⊆ RK is thus a distribution over distributions. It is called a
hyperdistribution.
In order to proceed with our discussion of general principles, we need to lift our
view from the pdf’s pXi (θ), which model the distribution of single data points, to
the N -dimensional pdf pN Xi : RN → R≥0 for the distribution of the product
the N
RV i Xi . Assuming independence of the Xi , it can be written as
pN Xi ((x1 , . . . , xN )) = pX1 (x1 ) · . . . · pXN (xN ). (90)
or in another notation (observing that all pdfs pXi are identical, so we can use the
generic RV X with pX = pXi for all i), as
Y
pNi X ((x1 , . . . , xN )) = pX (xi ). (91)
i
152
• If θ is fixed, pNi X (D | θ) can be seen as a function of data vectors D. This
function is a pdf over the training sample data space. For each possible
training sample D = (x1 , . . . , xN ) it describes how probable this particular
outcome is, assuming the true distribution of X is pX (θ).
• If, conversely, D is fixed, then pNi X (D | θ) can be seen as a function of θ.
Seen as a function of θ, pNi X (D | θ) is not something like a pdf over θ-
space. Its integral over θ will not usually be one. Seen as a function of θ,
pNi X (D | θ) is called a likelihood function — given data D, it reveals certain
models θ as being more likely than others. A model (characterized by θ)
“explains” given data D better if pNi X (D | θ) is higher. We have met the
concept of a likelihood function before in Section 7.3.1.
We thus have two sources of information about the sought-after, unknown true
distribution pX (θ): the likelihood pNi X (D | θ) of θ given data D, and the prior
plausibility encoded in h(θ). These two sources of information are independent
of each other: the prior plausibility is settled by the modeler before data have
been observed, and should not be informed by data. Because the two sources of
information come from “independent” sources of information (belief and data), it
makes sense to combine them by multiplication and consider the product
pNi X (D | θ) h(θ).
This product combines the two available sources of information about the
sought-after true distribution pX (θ). When data D are given, this product is a
function of model candidates θ. High values of this product mean that a candidate
model θ is a good estimate, low values mean it’s bad — in the combined light of
both observed data and prior beliefs.
With fixed D, the product pNi X (D | θ) h(θ) is a non-negative function on the
K-dimensional parameter space θ ∈ RK . It will not in general integrate to unity
and thus is not a pdf. Dividing this product by its integral however gives a pdf,
which we denote by h(θ | D):
pNi X (D | θ) h(θ)
h(θ | D) = R (92)
RK
pNi X (D | θ) h(θ) dθ
153
• Here I have considered real-valued distributions that can be represented
by pdfs throughout. If some of the concerned distributions are discrete or
cannot be represented by pdfs for some reason, one gets different versions of
(92).
• If one wishes to obtain a single, definite model estimate from a Bayesian
modeling exercise, a typical procedure is to compute the mean value of the
posterior. The resulting model θ̂ is called the posterior mean estimate
Z
PME
θ̂ = θ = θ h(θ | D) dθ.
RK
154
h(μ | D)
1 : posterior mean
h(μ) estimate, ≈ 0.565
: sample mean,
= 0.95
p iX
(D | μ)
...GHGK...
...AHGK...
155
...KHGV...
...THAN...
...WHAE...
...AHAG...
...ALGA...
156
(actually 19 parameters would be enough — why?).
We start with the probability of observed data, given that the true distribution
is θ. This corresponds to the pdf factor pNi X (D | θ) in (93), but here we have
discrete counting data, and hence we use a pmf instead of a pdf. The observation
data D consist in counts n1 , . . . , n20 of the different amino acids found at a given
site in the training population. These count vectors are distributed according to
the multinomial distribution
20
N! Y n
P (D | θ) = θj j , (94)
n1 ! · · · n20 ! j=1
αj −1
where Z(α) = H 20
R Q
j=1 θj dθ is the normalization constant which ensures that
the integral of h over H is one.
Fortunately the normalization denominator p(D) in (93) need not be analyzed
in more detail because it will later cancel out.
Now we have everything together to calculate the Bayesian posterior distribu-
157
tion on H:
P (D | θ) h(θ|α)
p(θ | D, α) =
p(D)
20 20
1 N! Y nj 1 Y αj −1
= θ θ
p(D) n1 ! · · · n20 ! j=1 j Z(α) j=1 j
20
1 N! 1 Y nj +αj −1
= θ
p(D) n1 ! · · · n20 ! Z(α) j=1 j
1 N! Z(D + α)
= h(θ | D + α), (96)
p(D) n1 ! · · · n20 ! Z(α)
In order to get the posterior mean estimate, we integrate over the model can-
didates θ with the posterior distribution. I omit the derivation (can be found in
Durbin et al. 2000) and only report the result:
Z
PME nj + αj
θj = θj h(θ | D + α) dθ = , (98)
H N +A
where N = n1 + · · · n20 and A = α1 + · · · + α20 . If one compares this to the
maximum-likelihood estimates
nj
θjML = , (99)
N
we can see that the αj parameters of the Dirichlet distribution can be understood
as “pseudo-counts” that are added to the actually observed counts. These pseudo-
counts reflect the subjective intuitions of the biologist, and there is no formal rule
of how to set them correctly.
Adding the αj pseudocounts in (98) can also be considered just as a regular-
ization tool in a frequentist setting. One would skip the Bayesian computations
that give rise to the Bayesian posterior (97) and directly use (98), optimizing α
to navigate on the model flexibility scale in a cross-validation scheme. With α set
to all zero, one gets the maximum-likelihood estimate (99) which will usually be
overfitting; with large values for α one smoothes out the information contained in
the data — underfitting.
You may ask, why go through all this complex Bayesian thinking and calcu-
lating, and not just do a regularized maximum-likelihood estimation with a solid
cross-validation scheme to get the regularizers αj ’s right? The reason is that the
two methods will not give the same results, although both ultimately use the same
n +α
formula θ̂j = Nj +Aj . In the frequentist, regularized, maximum-likelihood approach,
158
the only information that is made use of is the data D, which is scarce and the
careful cross-validation will merely make sure that the estimated model θ̂ = θML
will be the best (with regards to not over- or underfitting) among the ones that
can be estimated from D. In contrast, a Bayesian modeler inserts additional in-
formation by fixing the αj ’s beforehand. If the Bayesian modeler has the right
intuitions about these αj ’s, the found model θ̂ = θPME will generalize better than
θML , possibly much better.
The textbook of Durbin et al, from which this example is taken, shares some
thoughts on to how a biosequence modeler should select the pseudo-counts. The
proper investment of such soft knowledge makes all the difference in real-life ma-
chine learning problems when data are not abundant.
159
neurosciences posit that human cognitive processing (Clark 2013; Tenen-
baum, Griffiths, and Kemp 2006) and even the brain “hardware” (Friston
2003) are organized in functional hierarchies whose processing is informed
by Bayesian priors.
160
9 Sampling algorithms
Many tasks that arise in machine learning (and in mathematical modeling of com-
plex systems in general) require one to “draw” random samples from a probability
distribution. Algorithms which compute “random” samples from a distribution
are needed, for instance (very incomplete listing),
• when ensembles of models are computed for better accuracy and generaliza-
tion (as in random forests),
• when certain stochastic neural network models are trained and exploited
(Hopfield networks, Boltzmann machines),
• when training and evaluating many sorts of graphical models, like Bayesian
networks (next chapter in these LNs).
Sampling algorithms are an essential “fuel” which powers many modern com-
putational techniques. Entire branches of physics, chemistry, biology (and meteo-
rology and economics and ...) could only grow after efficient sampling algorithms
became available.
Designing algorithms which produce (pseudo-)random numbers from a given
distribution are not easy to design. Even something that looks as elementary as
sampling from the uniform distribution — with pseudo-random number generating
algorithms — becomes mathematically and algorithmically involved if one wants
to do it well.
In this section I describe a number of design principles for sampling algorithms
that you should know about.
161
xn+1
xn+1
xn xn
A !→∞ !→∞
B
C
Figure 50: Given: a distribution PX of a RV X with sample space SX . In our
graphics: SX is a sector of the real line, and the distribution on SX is represented
by a pdf (blue line). During a sampling process, a potentially infinite sequence
of examples x1 , x2 , . . . , xn , xn+1 , . . . from SX is generated. The sampling process
must re-generate the “profile” of the pdf in the long run (when n → ∞, see panel
C). In the short run, the sampling process can, but need not, generate examples
by independent “draws” from the distribution. In A, it looks like the sampling
created the first examples independently, while in B, it looks like the sampling
first put new examples close to already created ones. Both is perfectly ok for a
sampling process. The only condition that must be fulfilled is that in the limit of
n → ∞, the sampled examples pile up in perfect proportion to the pdf.
162
In contrast, the random variables X1 , X2 , . . . of a “sampler” for PX need not
individually have the same distribution as X, and they need not be independent
(like in Figure 50B). For instance, let PX be the uniform distribution on E = [0, 1].
Here are some samplers:
• All Xi are i.i.d. with each of them being uniformly distributed on [0, 1]. This
is a dream of a sampler. But nobody knows how to build such a sampler
without using quantum computers.
• The subset X2i of RVs with even indices are identically and uniformly dis-
tributed on [0, 1]. The RVs X2i+1 repeat the value of X2i . Here the RVs Xi
are identically but not independently distributed.
• The subset X2i of RVs with even indices are identically and uniformly dis-
tributed on [0, 1/2] and the subset of X2i+1 are identically and uniformly
distributed on [1/2, 1]. Here the RVs Xi are independently but not identi-
cally distributed.
• Let Yi be an i.i.d. sequence where each Yi draws a value yi from N (0, 1),
and let ε be a small positive real number. X1 is always evaluating to 0, and
following Xi are inductively and stochastically defined by xi = xi−1 + ε yi (
mod 1). This gives a random walk (more specifically, a Brownian motion
process) whose paths are slowly and randomly migrating across [0, 1]. This
kind of samplers will turn out to be the most important type when it comes
to sampling from complex distributions — this is a Markov chain Monte
Carlo (MCMC) sampler.
163
0 and some maximal value. This function can easily be used as a subroutine in a
little program which creates pseudorandom numbers in [0, 1]).
It is by no means easy to program a good pseudo-random number generator –
in fact, designing such generators is an active field of research. If you are interested
– the practical guide to using pseudorandom number generators by Jones 2010 is
fun to read and very illuminating.
Assume you have a sampler Ui for the uniform distribution on [0, 1], but you
want to sample from another distribution PX on the measure space E = R, which
has a pdf f (x). Then you can use the sampler Ui indirectly to sample from PX by
a coordinate transformation, as follows.
First, compute
Rx the cumulative density function ϕ : R → [0, 1], which is defined
by ϕ(x) = −∞ f (u) du, and its inverse ϕ−1 . The latter may be tricky or impossible
to do analytically – then numerical approximations must be used. Now obtain a
sampler Xi for PX from the uniform sampler Ui by
Xi = ϕ−1 ◦ Ui .
j(x)
AU (w)
i
f(x)
f(x)
A
Xi(w) = ja-1(Ubi(w))
x
Figure 51: Sampling by transformation from the uniform distribution: Let [a, b]
be an interval on the sample space R, and A the area under the pdf f in this
interval. A sampler for the distribution given by the pdf f must visit [a, b] with
a probability A. But A is also the increment that the cumulative density ϕ(x)
builds up from a to b. The uniform sampler Ui for [0, 1] visits the segment A (right
end in the graphic) with probability A. When Ui picks a value from the segment
A, Xi = ϕ−1 ◦ Ui will give a value in the interval [a, b].
The remainder of this subsection is optional reading and will not be tested in
exams (but may turn up in the exercise sheet).
Out of curiosity, I explored a little bit how one can sample from the standard
normal distribution N (0, 1). Computing it by transformation from the uniform
distribution, as outlined above, appears not to be done because the inverse cumula-
tive density function ϕ−1 for N (0, 1) can only be represented through a power series
which converges slowly (my source, I must confess, is Wikipedia). Instead, special
algorithms have been invented for sampling from N (0, 1), which exploit mathemat-
ical properties of this distribution. One algorithm (see http://en.wikipedia.
164
org/wiki/Normal_distribution#Generating_values_for_normal_random_variables
) which I found very elegant is the Box-Muller algorithm, which produces a N (0, 1)
distributed RV C from two independent uniform-[0,1]-distributed RVs A and B:
√
C = −2 ln A cos(2 π B).
Then, for sampling one value (x1 , x2 ), sample two values u1 , u2 from the uniform
sampler U , and transform
x1 = ϕ−1
1 (u1 ), x2 = ϕ−1
2 (u2 | x1 ).
A widely used method for drawing a random vector x from the n-dimensional
multivariate normal distribution with mean vector µ and covariance matrix Σ
works as follows:
3. Output x = µ + A(z1 , . . . , zn )0 .
165
and for which sampling by transformation works, and with a simple trick get from
that a sampler for the pdf of interest, f . To understand this rejection sampling
(also known as importance sampling) we first need to generalize the notion of a
pdf:
If one divides a proto-pdf g0 by its integral, one obtains a pdf g. Now consider
a pdf f on Rn that you want to sample from, but sampling by transformation
doesn’t work. However, you find a proto-pdf g0 ≥ f for whose associated pdf g
you know how to sample from, that is, you have a sampler for g. With that in
your hands you can construct a sampler Xi for f as follows. In order to generate
a random value x for Xi , carry out the following procedure:
While found x == 0 do
g0(x)
f(x)
x~ x
Figure 52: The principle of rejection sampling. Candidates x̃ are first sampled
from g, then accepted (“painted orange”) with probability f (x̃)/g0 (x̃).
166
The computational efficiency of rejection sampling clearly depends on how close
g0 is to f . If the ratio f /g0 is on average small, there will be many rejections which
slow down the algorithm. In high-dimensional spaces it is often quite difficult to
avoid this.
9.4 Proto-distributions
The pdf’s of parametrized continuous distributions over Rk are typically repre-
sented by a formula of the kind
1
p(x | θ) = R p0 (x | θ), (100)
p (x | θ) dx
Rk 0
where p0 : Rk → R≥0 defines the shape of the pdf and the factor 1/( Rk p0 (x | θ) dx)
R
167
This is also a good moment to mention the softmax function. This is a ubiqui-
tously used machine learning trick to transform any vector y = (y1 , . . . , yn )0 ∈ Rn
into an n-dimensional probability vector p by
1
p= P (exp(α y1 ), . . . , exp(α yn ))0 .
i=1,...,n exp(α yi )
The softmax is standardly used, for instance, to transform the output vector
of a neural network into a probability vector, which is needed when one wants to
interpret the network output probabilistically.
The factor α ≥ 0 determines the entropy of the resulting probability vector.
If α = 0, p is the uniform distribution on {1, . . . , n}, and in the limit α → ∞, p
becomes the binary vector which is zero everywhere except at the position i where
y is maximal.
Notes:
1. The word “proto-distribution” is not in general use and can be found in the
literature only rarely.
168
they play a leading role. In machine learning, Boltzmann distributions oc-
cur, for instance, in Markov random fields which are a generalization of
Markov chains to spatial patterns that are used (among other applications)
in image analysis. They are a core ingredient for Boltzmann machines (Ack-
ley, Hinton, and Sejnowski 1985), a type of neural network which I consider
one of the most elegant and fundamental models for general learning sys-
tems. A computationally simplified version, restricted Boltzmann machines,
became a starting point for what today is known as deep learning (Hinton
and Salakuthdinov 2006). Furthermore, Boltzmann distributions also are
instrumental for free-energy models of intelligent agents, a class of models
of brain dynamics popularized by Karl Friston (for example, Friston 2005)
which explain the emergence of adaptive, hierarchical information repre-
sentations in biological brains and artificial intelligent “agents”, and which
today are a mainstream approach in the cognitive neurosciences and cog-
nitive science to understanding adaptive cognition. And last but not least,
the Boltzmann distribution is the root mechanism in simulated annealing
(Kirkpatrick, Gelatt, and Vecchi 1983), a major general-purpose strategy
for finding minimal-cost solutions in complex search spaces.
Sadly in this course there is no time for treating energy-based models of
information processing. I present this material in my course on neural net-
works in the Summer semester.
169
tory reading. And if you want to inform yourself about the use of MCMC in deep
learning (advanced stuff), a talk of Radford Neal that he recently delivered in
honor of Geoffrey Hinton (Neal 2019) gives you an orientation. And if you think
of entering a spectacular academic career, it is not a bad idea to write a PhD
thesis which “only” gives the first readable, coherent, comprehensive tutorial in-
troduction to some already existing kind of mathematical modeling method which,
until you write that tutorial, has been documented only in scattered technical pa-
pers written from different perspectives and using different notations, making that
knowledge inaccessible to a wider audience.
• right-infinite: n = 1, 2, 3, . . ., or
• left-right infinite: n ∈ Z.
For defining samplers, on has to use n ∈ N: a sampler must, by definition, be able
to be run for arbitrarily long times.
A Markov chain on S is fully characterized by the initial distribution PX1
(needed for starting finite and right-infinite chains but not needed for left-right
infinite chains which have no start) and the conditional transition distributions
170
for which we also write (following Neal’s notation)
Tn (x | y). (104)
2. If xn has been generated, generate xn+1 by a random draw from the distri-
bution PXn+1 |Xn =xn which is specified by the transition kernel T (x|xn ).
A note on notation: a mathematically correct and general definition and no-
tation for transition kernels on arbitrary observation spaces S requires tools from
measure theory which we haven’t introduced. Consider the notation T (x|y) as a
somewhat sloppy shorthand. When dealing with discrete distributions where S is
finite, say, S = {s1 , . . . , sk }, then consider T (x|y) as a k × k Markov transition
matrix M where M (i, j) = P (Xn+1 = sj | Xn = si ). The i-th row of M has the
vector of the probabilities by which the process will transit from state sj to the
states s1 , . . . , sk indexing the columns.
An example: consider the two-state observation space S = {x1 , x2 }. Then
the initial distribution PX1 is given by a 2-dimensional probability vector. For
instance, (0.3, 0.7)0 would mean that the process starts with an observation x1
with probability 0.3. A transition matrix might be
0.1 0.9
M= ,
0.4 0.6
whose first row means that if at time n the process has generated state x1 , at time
n + 1 one will observe x1 again with probability 0.1 and x2 with probability 0.9.
When dealing with continuous distributions of next states x (given y) which
have a pdf, regard T (x|y) as denoting the conditional pdf pX|Y =y of x. Note that
when we write T (x|y), we refer not to a single pdf but to a family of pdf’s; for
every y we have another conditional pdf T (x|y).
If a Markov chain with finite state set S = {s1 , . . . , sk } and Markov transition
matrix M is executed m times, the transition probabilities to transit from state
si to state sj after m steps can be found in the m-step transition matrix M m :
171
where M m = M · M · . . . · M (m times).
We now consider the sample space S = Rk — this being the state space needed
for most practical uses of MCMC sampling — and we assume that all distributions
of interest are specified by pdf’s. We consider a homogeneous Markov chain. Its
transition kernel T (x|y) can be identified with the pdf pXn+1 |Xn =y , and in the
remainder of this section, we will write T (x|y) for this pdf. Such a Markov chain
with continuous sample space Rk is specified by an initial distribution on Rk which
we denote by its pdf g (1) . The pdf’s g (n+1) of distributions of subsequent RVs Xn+1
can be calculated from the pdf’s g (n) of the preceding RV Xn by
Z
(n+1)
g (x) = T (x|y) g (n) (y) dy. (106)
Rk
Please make sure you understand this equation. It is the key to everything
which follows.
For the theory of MCMC sampling, a core concept is an invariant distribution
of a homogenous Markov chain.
Definition 9.4 Let g be the pdf of some distribution on Rk , and let T (x|y) be the
(pdf of the) transition kernel of a homogeneous Markov chain with values in Rk .
Then g is the pdf of an invariant distribution of T (x|y) if
Z
g(x) = T (x|y) g(y) dy. (107)
Rk
Except for certain pathological cases, a transition kernel generically has at least
one invariant distribution.
Furthermore, it is often the case that there exists exactly one invariant distri-
bution g of T (x|y), and the sequence of distributions g (n) converges to g from any
initial distribution. We will call the transition kernel T (x|y) ergodic if it has this
property. The (unique) invariant distribution g of an ergodic Markov chain is also
called its asymptotic distribution or its stationary distribution or its equilibrium
distribution.
172
first, the amount of computation required to simulate each transition; second, the
time for the chain to converge to the equilibrium distribution, which gives the
number of states that must be discarded from the beginning of the chain; third, the
number of transitions needed to move from one state drawn from the equilibrium
distribution to another state that is almost independent, which determines the
number of states taken from the chain at equilibrium that are needed to produce
an estimate of a given accuracy. The latter two factors are related...”
A brief explanation: The best possible sampler for g would be one where each
new sampled value is independent from the previous ones — in other words, one
would like to have an i.i.d. sampler. If observations xn obtained from a sampler
depend on previous observations xn−1 , xn−2 , . . ., there is redundant information in
the sample path. Typically paths (xn )n=1,2,... obtained from running an MCMC
sampler will have more or less strong dependencies between values observed at
nearby times. The first values x2 , x3 , . . . values will depend, to a decreasing de-
gree, on the arbitrary initial value x1 and should be discarded. After this initial
“washout” phase, one usually keeps only those values xn , xn+d , xn+2d , . . . whose
distance d from each other is large enough to warrant that the dependency of
xn+d on xn has washed out to a negligible amount, that is, P (Xn+d = x | Xn =
y) ≈ P (Xn+d = x).
A standard way to construct an MCMC transition kernel T (x|y) which leads
to a Markov chain that has the target distribution g as its invariant distribution is
to ensure that the Markov chain (Xn )n=1,2,... has the property of detailed balance
with respect to g. Detailed balance connects X1 , X2 , . . . to g in a strong way.
It says that if we pick some state x ∈ Rk with the probability given by g and
multiply its probability g(x) with the transition probability density T (y|x) —
that is, we consider the probability density of transiting from x to y weighted
with the probability density of x — then this is the same as the reverse weighted
transiting probability density from y to x:
173
9.5.3 The Gibbs sampler
Let g be a pdf on Rk , from which we want to sample. For i = 1, . . . , k and x ∈ Rk ,
where x = (x1 , . . . , xk )0 , let
gi (· | x) R → R≥0
:
g((x1 , . . . , xi−1 , y, xi+1 , . . . , xk )0 )
gi (y |x) = R
R
g((x1 , . . . , xi−1 , z, xi+1 , . . . , xk )0 ) dz
be the conditional density function of the coordinate i given the values of x on the
other coordinates. Let g (1) be the pdf of an initial distribution on Rk , and let an
(1) (1)
initial value x(1) = (x1 , . . . , xk )0 be drawn from g (1) . We define a Markov chain
X1 , X2 , . . . through transition kernels as follows. The idea is to cycle through the
k coordinates and at some time νk + i (0 ≤ ν, 1 ≤ i ≤ k) change the previous state
(νk+i−1) (νk+i−1) 0
x(νk+i−1) = (x1 , . . . , xk ) only in the i-th coordinate, by sampling from
gi . That is, at time νk + i we set
(νk+i−1) (νk+i−1) (νk+i−1) (νk+i−1) 0
x(νk+i) = (x1 , . . . , xi−1 , y, xi+1 , . . . , xk )
equal to the previous state except in coordinate i where the value y which is freshly
sampled from gi .
This method is known as the Gibbs sampler. It uses k different transition
kernels T1 , . . . , Tk , where Ti is employed at times νk + i and updates only the i-th
coordinate. Of course, constructing such a Gibbs sampler requires one to have
i.i.d. samplers for the 1-dimensional coordinate distributions gi in the first place.
This is, however, a simpler problem than the original multi-dimensional one.
This Markov chain is not homogeneous because we cycle through different
transition kernels. However, we can condense a sequence of k successive updates
into a single update that affects all coordinates by putting T = Tk ◦ · · · ◦ T1 , which
yields a homogeneous Markov chain (Yn )n=1,2,... with transition kernel T whose
path is derived from a path (xn )n=1,2,... of the “cycling” Markov chain by
y(1) = x(1) ,
y(2) = x(k+1) ,
y(3) = x(2k+1) , . . . .
174
Bm of positive radius, the first containing x and the last containing y, such that
Bm intersects with Bm+1 . However, if the support of g is not k-dimensionally
connected, then T may or may not be ergodic, and determining whether it is
needs to be done on a case by case basis. For instance, if g is a distribution
on R2 whose support lies exclusively in the first and third orthant, T would not
be ergodic, because the Gibbs sampler, when started from a point in the third
orthant, would be unable to jump into the first orthant. This situation is depicted
in Figure 53.
Figure 53: A bipartite pdf g where the Gibbs sampler would fail.
The Gibbs sampler is practically applicable only if one can easily sample from
the 1-dimensional conditional distributions gi . Therefore, the Gibbs sampler is
mostly employed in cases where these gi are parametric, analytical distributions,
or in cases where S is finite and the gi thus become simple probability vectors (and
would be represented by pmfs, not pdfs). The Gibbs sampler is attractive for its
simplicity. A number of extensions and refinements of the basic idea is presented
in Neal 1993.
in two substeps, which together ensure detailed balance w.r.t. the conditional
distribution gi :
175
Step 1: Randomly choose a candidate value y ∗ for y. Usually a proposal dis-
tribution Si (y ∗ | x(νk+i−1) ) is used (it may depend on x(νk+i−1) ), which is
symmetric in the sense that
(νk+i−1) (νk+i−1) (νk+i−1) (νk+i−1) 0
Si (y |(x1 , . . . , xi−1 , y 0 , xi+1 , . . . , xk ) =
0 (νk+i−1) (νk+i−1) (νk+i−1) (νk+i−1) 0
Si (y |(x1 , . . . , xi−1 , y, xi+1 , . . . , xk )
for all y, y 0 , x.
(νk+i)
Step 2: Randomly accept or reject y ∗ as the new value for xi , in a fashion
(νk+i)
that ensures detailed balance. In case of acceptance, xi is set to y ∗ ;
(νk+i) (νk+i−1)
in case of rejection, xi is set to xi , i.e., the observation vector
is not changed at all and is identically repeated in the sampling sequence.
The probability for accepting y ∗ is determined by an acceptance probability
Ai (y ∗ |x).
∀x = (x1 , . . . , xk )0 ∈ Rk , ∀y, y 0 ∈ R :
gi (y | x) Ti (y 0 | (x1 , . . . , xi−1 , y, xi+1 , . . . , xk )0 ) =
gi (y 0 | x) Ti (y | (x1 , . . . , xi−1 , y 0 , xi+1 , . . . , xk )0 ). (109)
There are several ways to ensure this. For instance, it is not difficult to see
(exercise!) that if Si (y ∗ | x(νk+i−1) ) is symmetric, the acceptance function
gi (y ∗ | (x1 , . . . , xk )0 )
Ai (y ∗ | (x1 , . . . , xk )0 ) = (110)
gi (y ∗ | (x1 , . . . , xk )0 ) + gi (xi | (x1 , . . . , xk )0 )
gi (y ∗ | (x1 , . . . , xk )0 )
∗ 0
Ai (y | (x1 , . . . , xk ) ) = min 1, . (111)
gi (xi | (x1 , . . . , xk )0 )
176
That is, whenever gi (y ∗ | (x1 , . . . , xk )0 ) ≥ gi (xi | (x1 , . . . , xk )0 ), — i.e. the pro-
posed observation in component i has no lower probability than the current one —,
accept with certainty; else accept with probability gi (y ∗ | (x1 , . . . , xk )0 )/gi (xi | (x1 , . . . , xk )0 ).
For symmetric proposal distributions, the Metropolis acceptance function also
implies detailed balance. It is clear that in the rejection case, where the current
observation is not changed, the detailed balance condition trivially holds. In the
acceptance case we verify (108) as follows:
Notes:
1. Both the Boltzmann and the Metropolis acceptance functions also work with
proto-pdf’s gi . This is an invaluable asset because often it is infeasible to
compute the normalization factor 1/Z needed to turn a proto-pdf into a pdf.
2. Like the Gibbs sampler, this local Metropolis algorithm can be turned into a
global one, then having a homogeneous Markov chain with transition kernel
T , by condensing one k-cycle through the component updates into a single
observation update.
4. The Metropolis algorithm is more widely applicable than the Gibbs sampler
because it obviates the need to design an i.i.d. sampler for the coordinate
distributions. The price one has to pay is a higher computational cost,
because the rejection events lead to duplicate sample points which obviously
leads to an undesirable “repetition redundancy” in the sample that has to
be compensated by a larger sample size.
177
6. The quality of Metropolis sampling depends very much on the used proposal
distribution. Specifically, the variance of the proposal distribution should
neither be too small (then exploration of new states is confined to a narrow
neighborhood of the current state, implying that the Markov chain traverses
the distribution very slowly) nor too large (then one will often be propelled
far out in the regions of g where it almost vanishes, leading to numerous
rejection events). A standard choice is a normal distribution centered on the
current state.
178
the same family (cichlids from central African lakes except one cichlid species
from America that served as a control) were represented by DNA sequences of
length 1044. 567 of these 1044 sites were identical across all considered species
and thus carried no information about phylogeny. The remaining N = 477 sites
represented the data D that was entered into the analysis. Reminder: the DNA
symbol alphabet is Σ = {A, C, G, T }.
Task. Infer from data D the most likely phylogenetic tree, assuming that the
considered living species have a common ancestor from which they all descended.
Modeling assumptions. Mutations act on all sites independently. Mutations
occur randomly according to a “molecular clock”, i.e. a probability distribution
of the form
P clock (y | x, t, θ) (112)
specifing the probability that the symbol y ∈ {A, C, G, T } occurs at a given site
where t years earlier the symbol x occurred. θ is a set of parameters specifying
further modeling assumptions about the clock mechanism. Mau et al. used the
molecular clock model proposed in Hasegawa, Kishino, and Yano 1985 which uses
two parameters θ = (φ, κ), the first quantifying an overall rate of mutation, and
the second a difference of rates between the more frequent mutations that leave the
type of nucleic acid (purine or pyrmidine) unchanged (“transitions”) vs. change
the type (“transversions”). All that we need to know here, not pretending to be
biologists, is that (112) can be efficiently computed. Note that θ is not known
beforehand but has to be estimated/optimized in the modeling process, based on
the data D.
Representing phylogenetic trees. A phylogenetic tree is a binary tree Ψ.
The nodes represent species; leaves are living species, internal nodes are extinct
species, the root node is the assumed common ancestor. Mau et al. plot their trees
bottom-up, root node at the bottom. Vertical distances between nodes metrically
represent the evolutionary timespans t between nodes. Clades are subsets of the
leaves that are children of a shared internal node. Figure 54 shows a schematic
phylogenetic tree and some clades.
A given evolutionary history can be represented by trees in 2n−1 different but
equivalent ways (where n is the number of living species), through permuting
the two child branches of an internal node. For computational purposes a more
convenient representation than a tree graph is given by
The graph distance is the length of the connecting path between the two leaves.
In the example tree from Figure 54, the distances between leaves make the distance
vector a = 2(t1 + t2 + t3 + t4 , t1 , t1 + t2 + t3 + t4 + t5 , t1 + t2 + t3 , t1 + t2 , t1 + t2 +
t3 + t4 + t5 + t6 ). A pair (σ, a) is a compact representation for a phylogenetic tree.
179
Figure 54: An examplary phylogenetic tree (from the Mau et al paper). {4,7}, {1
4 7}, {2, 3, 6} are examples of clades in this tree.
We approach this task sideways, assuming first that we know the symbols yν at the
i-th site of the internal nodes ν of Ψ. Let % be the root node and π0 a reasonable
distribution of symbols in % in site i (for instance the global distribution of all
symbols in all sites of all sequences in D). Then we get the probability of observed
data Di joined with the hypothetical data of these (yν )ν∈I (where I is the set of
internal nodes of Ψ) by
Y
P (Di , (yν )ν∈I | Ψ, θ) = π0 (y% ) P clock (yν | ypar(ν) , tν , θ), (115)
νis non-root node of Ψ
where par(ν) is the parent node of ν and tν is the timespan between par(ν) and
ν. From (115) we could obtain (114) by summing over all possible assignments of
symbols to internal nodes, which is clearly infeasible. Fortunately there is a cheap
180
recursive way to obtain (114), which works top-down from the leaves, inductively
assigning conditional likelihoods Lν (y) = P (Di ν | Ψ, θ, node ν = y) to nodes ν,
where y ∈ Σ and Di ν is the subset of the Di which are siblings of node ν, as
follows:
(
1, if y = yν
Case 1: ν ∈ / I : Lν (y) =
0, else
! !
X X
Case 2: ν ∈ I : Lν (y) = Lλ (z)P clock (z|y, tλ , θ) Lµ (z)P clock (z|y, tµ , θ) ,
z∈Σ z∈Σ
where λ, µ are the two children of ν, tλ is the timespan from ν to λ, and tµ is the
timespan from ν to µ. Then (114) is obtained from
X
Li (Ψ, θ) = π0 (z) L% (z),
z∈Σ
O(N |Σ|l) flops are needed to compute L(Ψ, θ) – in our example, N = 477, |Σ| = 4,
l = 32.
The posteriori distribution of trees and mutation parameters. We
are actually not interested in the likelihoods L(Ψ, θ) but rather in the distribution
of Ψ, θ (a Bayesian hyperdistribution!) given D. Bayes theorem informs us that
this desired distribution is proportional to the likelihood times the prior (hyper-
)distribution P (Ψ, θ) of Ψ, θ:
Lacking a profound theoretical insight, Mau et al. assume for P (Ψ, θ) a very simple,
uniform-like distribution (such uninformedness is perfectly compatible with the
Bayesian approach!). Specifically:
1. They bound the total hight of trees by some arbitrary maximum value, that
is, all trees Ψ with a greater hight are assigned P (Ψ, θ) = 0.
181
The structure of P (Ψ, θ | D). Before we set forth to compute the posterior
hyperdistribution P (Ψ, θ | D), let us take a closer look at the structure of the
mathematical “space” in which the parameter pairs (Ψ, θ) lie.
Remember that a tree Ψ is specified by (σ, a), where σ is a permutation vector
of (1, . . . , l) and a is a numerical vector of length l − 1. Noticing that there are l!
permutations, a pair (σ, a) reflects a point in a product space {1, . . . , l!} × Rl−1 ;
together with the two real-valued parameters comprised in θ this brings us to a
space {1, . . . , l!} × Rl+1 . Specifically, this space is a product of a discrete space
(the finite but very large set {1, . . . , l!}) with a continuous space (Rl+1 ). As a
consequence, one cannot mathematically describe a probability measure on this
space with a pmf, nor with a pdf! And thus, one cannot “compute” a pmf or pdf
for P (Ψ, θ | D). So — what can we compute for, about, with, or on P (Ψ, θ | D)?
The answer is: we can get arbitrarily precise and exhaustive information about
P (Ψ, θ | D) by ... and only by ... you guess right: sampling.
The Metropolis algorithm at work. Mau et al. use the Metropolis algo-
rithm (in a global version) to sample trees Ψ and evolutionary clock parameters
θ from P (Ψ, θ | D). A crucial design task is to find a good proposal distribution
S((Ψ∗ , θ∗ )|(Ψ, θ)). It should lead from any plausible (Ψ, θ) [“plausible” means that
g((Ψ, θ)) is not very small] to another plausible (Ψ∗ , θ∗ ) which should be however
as distinct from (Ψ, θ) as possible. The way how Mau et al. go about this task is
one of the core contributions of their work.
The authors alternate between updating only θ and only Ψ. Proposing θ∗ from
θ is done in a straightforward way: the new ∗ -parameters are randomly drawn from
a rectangular distribution centered on the current settings θ.
The tricky part is to propose an as different as possible, yet “plausibility-
preserving” new tree Ψ∗ from Ψ. Mau et al. transform Ψ = (σ, a) into Ψ∗ =
(σ ∗ , a∗ ) in two steps:
1. The current tree Ψ is transformed into one of its 2n−1 equivalent topological
versions by randomly reversing with 0.5 probability every of its internal
branches, getting Ψ0 = (σ 0 , a0 ).
2. In Ψ0 the evolutionary inter-species time spans t are varied by changing the
old values by a random increment drawn from the uniform distribution over
[−δ, δ], where δ is a fixed bound (see Figure 55). This gives Ψ∗ = (σ 0 , a∗ ).
Mau et al. show that this method yields a symmetric proposal distribution, and
that every tree Ψ0 can be reached from every other tree Ψ in a bounded number
of such transformations – the Markov chain is thus ergodic.
Concretely, the Metropolis algorithm was run for 1,100,000 steps (20 hours
CPU time on a 1998 Pentium 200 PC). The first 100,000 steps were discarded
to wash out possible distortions resulting from the arbitrary starting tree. The
remaining 1,000,000 trees were subsampled by 200 (reflecting the finding that
after every 200 steps, trees were empirically independent [zero empirical cross-
correlation at 200 step distance]), resulting in a final tree sample of size 5000.
182
Figure 55: Proposal candidate trees, attainable from the current tree, are found
within timeshift intervals of size 2δ, centered at the current internal nodes, that
constrain the repositioning of the internal nodes. Note that if the two rightmost
internal nodes are shifted such that their relative heights become reversed (dashed
blue circles), the topology of the tree would change (dashed blue lines). Figure
adapted from the Mau et al paper.
Final findings. Recall that the overall goal of this entire effort was to de-
terming which tree topology is best explaining the current genetic Data D, under
the assumptions of the Bayesian prior P (Ψ, θ). After the sampling was done, 5000
trees had been collected whose sample frequencies reflect their Bayesian posterior
probabilities. The rest is easy: sort these sampled trees into different subsets, each
of which is defined by a specific tree topology, then interpret the relative sizes of
these sets as probability estimates.
The 600 (!) most frequent topologies make up for 90% of the total probability
mass. This high variability however is almost entirely due to minute variations
within 6 clades (labelled A, ..., F) that are stable across different topologies. Figure
56 shows the two most frequently found topologies resolved at the clade level, with
a very strong posterior probability indication for the first of the two.
For quality control, this was repeated 10 times, with no change of the final
outcome, which makes the authors confident of their work’s statistical reliability.
183
Figure 56: The two clade tree structures of highest posterior probability found by
Mau et al.
10 Graphical models
My favourite motivational example for introducing graphical models is the Space
Shuttle. In the launch phase, the engineering staff in the control center and the
astronauts on board have to make many critical decisions under severe time pres-
sure. If anything on board of the roaring Space Shuttle seems to go wrong, it
must be immediately determined, for instance, whether to shut down an engine
(means ditching the Space Shuttle in the ocean but might save the astronauts) or
not (might mean explosion of the engine and death of astronauts — or it might be
just the right thing to do if it’s a case of sensor malfunction and in fact all is fine
with the engine). The functioning of the Space Shuttle is monitored with many
dozens of sensors which under the enormous vibration stress are more prone to
malfunction than you would believe. They are delivering a massive flow of informa-
tion which is impossible for human operators to evaluate for fast decision-making.
So-called decision support systems are needed which automatically calculate prob-
abilities for the actual system state from sensor readings and display to the human
operators a condensed, human-understandable view of its most decision-critical
findings. Read the original paper of Horvitz and Barry 1995 on the probabilistic
modeling of the combined SpaceShuttle-Pilots-GroundControlStaff system if you
want to see how thrilling and potentially life-saving statistical modeling can be.
Such decision support systems are designed around hidden and visible random
variables. The visible RVs represent observable variables, for instance the pressure
or temperature readings from Space Shuttle engine sensors, but also actions taken
by the pilot. The hidden RVs represent system state variables which cannot be di-
rectly measured but which are important for decision making, for instance a binary
RV “Sensor 32 is functioning properly — yes / no”, or “pilot-in-command is aware
of excess temperature reading in engine 3 — yes / no”. There are many causal
chains between the hidden and visible RVs which lead to chains of conditional
probability assessments, for instance “If sensor reading 21 is ’normal’ and sensor
reading 122 is ’normal’ and sensor reading 32 is ’excess temperature’, the proba-
bility that sensor 32 is misreading is 0.6”. Such causal dependencies between RVs
are mathematically represented by arrows between the participating RVs, which
184
in total gives a directed graph whose nodes are the RVs. In such a graph, each
RV X has its own sample space SX which contains the possible values that X
may take. When the graph is used (for instance, for decision support), for each
RV a probability distribution PX on SX is computed. If X1 , . . . , Xk are RVs with
arrows fanning in on a RV Y , the probability distribution PY on SY is calculated
as a conditional distribution which depends on the distributions PX1 , . . . , PXk .
These calculations quickly become expensive when the graph is large and richly
connected (exact statistical inference in such graphs is NP-hard) and require ap-
proximate solution strategies — for instance, through sampling methods. Despite
the substantial complexity of algorithms in the field, such graphical models are to-
day widely used. Many special sorts of graphical models have traditional special
names and have been investigated in a diversity of application contexts long be-
fore the general, unifying theory of graphical models was developed. Such special
kinds of graphical models or research traditions include
• decision support systems, not only in Space Shuttle launching but also in
economics or warfare to name only two;
• Markov random fields which model the interactions between local phenom-
ena in spatially extended systems, for instance the pixels in an image;
185
as a starting point, Pearl nonetheless found ways to achieve exactly that, infer-
ring causation from (conditional) probability. For a long time, these insights were
hardly taken up by machine learners, although they had an impact in medical
and epidemiologial statistics. This has changed. Causal modeling is becoming an
important topic in deep learning. Specifically, structuring modular neural network
systems along the lines of causal chains (and cycles) is emerging as a possible key
for achieving artifical general intelligence — machine learning systems that can
cope with many tasks of many sorts (like humans and unlike current deep learning
systems), using only a few training examples (like humans and unlike current deep
learning systems), and generalizing very robustly to out-of-distribution input (like
humans and unlike current deep learning systems). Schölkopf, Locatello, et al.
2021 presents a very well-written tutorial on causal modeling. I will resume this
topic in my final lecture, which will be on new developments in machine learning.
In this section I will give an introduction to the general theory of graphical
models. In my condensed treatment I lean heavily on the two tutorial texts by
Pearl and Russell 2003 and K. Murphy 2001. This material is very useful by
itself (e.g. if you want to help Space Shuttles to be launched safely), and it is also
requisite background for understanding causal modeling.
Graphical models are a heavyweight branch of machine learning and would best
be presented in an entire course of their own. The course Probabilistic Graphical
Models of Stefano Ermon at Stanford University is a (beautifully crafted) exam-
ple. The course homepage https://cs228.stanford.edu/ gives links to litera-
ture and programming toolboxes — and it serves you a transparently written set
of lecture notes on https://ermongroup.github.io/cs228-notes/ which culmi-
nates in explaining deep variational autoencoders, a powerful recent deep learning
method which includes a number of techniques from graphical models.
186
the following observations:
X3 : indicates whether the lawn sprinkler is on, has values {0, 1},
X4 : indicates whether the pavement (close to the lawn) is wet, values {0, 1},
X5 : indicates whether the pavement is slippery, values from {0, 1}, too.
There are certain causal influences between some of these random variables.
For instance, the season co-determines the probabilities for rain; the sprinkler
state co-determines whether the pavement is wet (but one would not say that
the wetness of the pavement has a causal influence on the sprinkler state), etc.
Such influences can be expressed by arranging the Xi in a directed acyclic graph
(a DAG), such that each random variable becomes a node, with an edge (i, j)
indicating that what is measured by Xi has some causal influence on what is
measured by Xj . Of course there will be some subjective judgement involved
in claiming a causal influence between two observables, and denying it for other
pairs – such dependency graphs are not objectively “true”, they are designed
to represent one’s view of a part of the world (Note that these are not the causal
links that are studied in causal modeling. In causal modeling, causal influences are
not stipulated by a subjective researcher but become mathematically identified).
Figure 57 is the DAG for our example.
Figure 57: A simple Bayesian network. Image taken from Pearl and Russell 2003.
Definition 10.1 A directed acyclic graph with nodes labelled by RVs {X1 , . . . , Xn }
(each with its own sample space Si ) is a Bayesian network (BN) for the joint
187
distribution PX of X = X1 ⊗ . . . ⊗ Xn if every Xi is conditionally independent of
its non-descendants in the graph given its parents.
For convenience of notation we often identify the nodes Xi of a BN with their
indices i. The descendants of a node i in a DAG G are all nodes j that can be
reached from i on a forward path in G. Descendance is a transitive relation: if
j is a descendant of i and k a descendant of j, then k is a descendant of i. The
non-descendants of i are all j that are not descendants of i.
The parents of a node i are all the immediate graph predecessors of i, that is,
all j such that (j, i) is an edge in the graph.
For instance, in the DAG shown in Figure 57, X3 has parent X1 and descen-
dants X4 , X5 , and the condition stated in the definition requires that P (X3 |X1 )
is independent of X2 , that is P (X3 |X1 ) = P (X3 |X1 , X2 ): our judgement whether
the sprinkler is on or off is not influenced by our knowledge whether it rains or
not, given that we know what season it is. Personal note: I always found this a
weird Californian way of behaving: those guys would set the sprinkler on or off
just depending on the season; I guess in Summer all Californian sprinklers sprin-
kled all Summer, come rain come shine (in the year 2000 when that tutorial was
written; today Californians surely are more water-aware).
The independence relations expressed in a BN in terms of parents and non-
descendants need not be the only independence relations that are actually true
in the joint distribution. In our example, for instance, it may be the case that
the pavement is always slippery because it was made from polished marble. Then
X5 would be unconditionally independent from any of the other variables. The
complete independence relations between the variables figuring in a BN depend
on the particulars of the joint distribution and need not all be represented in the
graph structure. A BN is only a partial model of the independence relations that
may be present in the concerned variables.
Let’s squeeze in a short remark on notation here. By P (X4 | X2 , X3 ) we denote
the conditional distribution of X4 given X2 and X3 . In rigorous mathematical texts
one would denote this distribution by PX4 | X2 ,X3 , but I will follow the notation in
the Pearl/Russell tutorial in this section. For discrete random variables, as in
our example, such conditional distributions can be specified by a table — for
P (X4 | X2 , X3 ) such a table might look like
X2 X3 P (X4 = 0) P (X4 = 1)
0 0 1.0 0.0
0 1 0.1 0.9 (116)
1 0 0.1 0.9
1 1 0.01 0.99
To specify a BN, one must supply such conditional distributions for every
RV Xi in the network. They give the probabilities for values of Xi for all value
188
combinations of its parents. If a RV Xi has no parents (that is, it is a root node
in the DAG), then this conditional distribution is conditioned on nothing — it is
just a plain probability distribution of the RV Xi .
For continuous-valued random variables, such conditional distributions cannot
in general be specified in a closed form (one would have to specify pdf’s for each
possible combination of values of the conditioning variables), except in certain spe-
cial cases, notably Gaussian distributions. One must then supply a computable
mathematical function which allows one to compute all concrete probability den-
sities like p(X4 = x4 | X2 = x2 , X3 = x3 ), where p is a pdf.
I use lowercase P (x4 | x2 , x3 ) as a shorthand for P (X4 = x4 | X2 = x2 , X3 =
x3 ). This denotes a single probability number for particular values of our random
variables. – End of the remark on notation.
A Bayesian network can be used for reasoning about uncertain causes and
consequences in many ways. Here are three kinds of arguments that are frequently
made, and for which BNs offer algorithmic support:
Prediction. “If the sprinkler is on, the pavement is wet with a probability P (X4 =
1|X3 = 1) = ###”: reasoning from causes to effects, along the arrows of
the BN in forward direction. Also called forward reasoning in AI contexts.
Abduction. “If the pavement is wet, it is more probable that the season is spring
than that it is summer, by a factor of ### percent”: reasoning from effects
to causes, that is, diagnostic reasoning, backwards along the network links.
(By the way, for backward reasoning you need Bayes’ formula, which is what
gave Bayesian networks their name.)
Explaining away, and the rest. “If the pavement is wet and we don’t know
whether the sprinkler is on, and then observe that it is raining, the probability
of the sprinkler being on, too, drops by ### percent: in “explaining away”
there are several possible causes C1 , . . . , Ck for some observed effect E, and
when we learn that actually cause Ci holds true, then the probabilities drop
that the other causes are likewise true in the current situation. There are
many other variants of reasoning “sideways”.
Bayesian networks offer inference algorithms to carry out such arguments and
compute the correct probabilities, ratios of probabilities, etc. These inference
algorithms are not trivial at all, and Bayesian networks only began to make their
way into applications after efficient inference algorithms had been discovered in the
mid-80’ies. I will explain a classical (and still widely used) algorithm in this section
(the join tree algorithm for exact inference); it is still widely used and efficient in
BNs whose connectivity is not too dense. Because exact statistical inference in
BNs is NP-hard, one has to take resort to approximate algorithms in many cases.
There are two main families of such algorithms, one based on sampling and the
other on variational approximation. I will give a hint on inference by sampling
and omit variational inference.
189
In a sense, the most natural (the only?) thing you can do when it comes to
handle the interaction between many random variables is to arrange them in a
causal influence graph. So it is no surprise that related formalisms have been
invented independently in AI, physics, genetics, statistics, and image processing.
However, the most important algorithmic developments have been in AI / machine
learning, where feasible inference algorithms for large-scale BNs have first been
investigated. The unrivalled pioneer in this field is Judea Pearl http://bayes.
cs.ucla.edu/jp_home.html, who laid the foundations for the algorithmic theory
of BNs in the 1980’s. These foundations were later developed to a more general
theory of graphical models, a development promoted (among others) by Michael
I. Jordan https://people.eecs.berkeley.edu/~jordan/. Michael Jordan not
only helped to build the general theory of graphical models but also had a shaping
influence on other areas in machine learning. He is a machine learning superpower.
The list of his past students and postdocs on his homepage is awe-inspiring and
reads like a Who’s Who of machine learning.
In the world of BNs (and graphical models in general) there exist two funda-
mental tasks:
190
because the pmf must sum to 1). This pmf would not be given in vector format
but as a 5-dimensional array of size 4×2×2×2×2. We will now investigate brute-
force methods for elementary inference in this joint probability array and see how
the graph structure leads to a dramatic reduction in computational complexity
(which however still is too high for practical applications with larger BNs – more
refined algorithms to be presented later).
By a repeated application of the factorization formula (174) (in Appendix B),
the joint distribution of our five random variables is
O
P (X) = P ( Xi ) =
i=1,...,5
P (X1 ) P (X2 |X1 ) P (X3 |X1 , X2 ) P (X4 |X1 , X2 , X3 ) P (X5 |X1 , X2 , X3 , X4 ).(117)
P (X) = P (X1 ) P (X2 |X1 ) P (X3 |X1 ) P (X4 |X2 , X3 ) P (X5 |X4 ). (118)
For representing the factors on the right-hand side of (118) by tables like (116),
one would need tables of sizes 1×4, 4×2, 4×2, 4×2, and 2×2, respectively. Because
the entries per row in each of these tables must sum to 1, one entry per row is
redundant, so these tables are specified by 3, 4, 4, 4 and 2 parameters, respectively.
All in all, this makes 17 parameters needed to specify P (X), as opposed to the 63
parameters needed for the naive representation P (X) in a 5-dimensional array.
In general, the number of parameters required to specify the joint distribution
of n discrete random variables with maximally ν values each, arranged in a BN with
a maximum fan-in of k, is O(n ν k ) as opposed to the raw number of parameters
O(ν n ) needed for a naive characterization of the joint distribution. This is a
reduction from a space complexity that is exponential in n to a space complexity
that is linear in n! This simple fact has motivated many a researcher to devote
his/her life to Bayesian networks.
Any reasoning on BNs (predictive, abductive, sidestepping or other) boils down
to calculate conditional or marginal probabilities. For instance, the abductive
question, “If the pavement is wet, by which factor y is it more probable that the
season is spring than that it is summer”, asks one to compute
P (X1 = spring | X4 = 1)
y= . (119)
P (X1 = summer | X4 = 1)
Such probability ratios are often sought in diagnostic reasoning — as for in-
stance in “by which factor is it more probable that my symptoms are due to cancer,
than that they are due to having eaten too much sauerkraut yesterday”.
Any conditional probability P (y1 , . . . , ym | z1 , . . . , zl ), where the Yi and Zj are
among the RVs in the BN, can be computed from the joint distribution of all
191
variables in the BN by first transforming P (y1 , . . . , ym | z1 , . . . , zl ) into a fraction
of two marginal probabilities,
P (y1 , . . . , ym , z1 , . . . , zl )
P (y1 , . . . , ym | z1 , . . . , zl ) =
P (z1 , . . . , zl )
and then computing the denominator and the enumerator by marginalization from
the joint distribution of all RVs in the BN, exploiting efficient BN factorizations of
the kind exemplified in Equation 118. The probability P (X1 = spring | X4 = 1),
for instance, can be computed by
where I abbreviated “spring” to “s” in order to squeeze the expression into a single
line.
(120) can be computed by a brute-force evaluation of the concerned summa-
tions. The sum to be taken in the denominator would run over 4×2×2×2 =
32 terms, each of which is a product of 5 subterms; we thus would incur 128
multiplications. It is apparent that this approach generally incurs a number of
multiplications that is exponential in the size of the BN.
A speedup strategy is to try pulling the sum into the product as far as possible,
and evaluate the resulting formula from the inside of bracketing levels. This is
called the method of variable elimination. For example, an equivalent formula for
the sum in the denominator of (120) would be
!!!
X X X
P (x1 ) P (x2 |x1 )P (x3 |x1 )P (X4 = 1|x2 , x3 ) P (x5 |X4 = 1) .
x1 x2 ,x3 x5
To evaluate this expression from the inside out, we note that the sum over x5 in
the innermost term is 1 and need not be explicitly calculated. For the remaining
calculations, 15 sums and 36 multiplications are needed, as opposed to the 31
sums and 128 multiplications needed for the naive evaluation of the denominator
in (120). However, finding a summation order where this pulling-in leads to the
minimal number of summations and multiplications is again NP-hard, although
greedy algorithms for that purpose are claimed to work well in practice.
In such a situation, computer scientists can choose between three options:
192
1. Restrict the problem by suitable constraints, earning a tractable problem.
For BNs, this was the first strategy that was used with success: early (now
classical) inference algorithms were defined only for BNs with tree graphs.
2. Use heuristic algorithms, i.e. algorithms that embody human insight for com-
putational shortcuts. Heuristic algorithms need not always be successful in
leading to short runtimes; if they do, their result is perfectly accurate. The
goal is to find heuristics which lead to fast runtimes almost always and which
run too long only rarely. The “join tree” algorithm which we will study later
contains heuristic elements.
3. Use approximate algorithms, i.e. algorithms that yield results always and fast,
but with an error margin (which should be controllable). For BN inference,
a convenient class of approximate algorithms is based on sampling. In order
to obtain an estimate of some marginal probability, one samples from the
distribution defined by the BN and uses the sample as a basis to estimate
the desired marginal. There is an obvious tradeoff between runtime and
precision. Another class of approximate algorithms is to use variational
inference. Variational algorithms need some insight into the shape of the
conditional distributions in a BN. Their speedup results from restricting the
admissible shapes of these distributions to analytically tractable classes. I
will not go into this topic in this course — it’s not easy. Jordan, Ghahramani,
et al. 1999 give a tutorial introduction.
193
Step 2: Add some further undirected edges to Gm which triangulate Gm , ob-
taining Gt . This may destroy some of the original independence relations
between {X1 , . . . , Xn } but will not introduce new ones. This step has no
unique single solution, and heuristics enter the game to decide which trian-
gulation to use.
Step 3: Detect all maximal cliques Ci in Gt . While this is NP-complete for gen-
eral undirected graphs, it can be done efficiently for triangulated undirected
graphs (Note: a clique in an undirected graph is a subset of nodes that are
all pairwise connected to each other. A subset of a clique is again a clique.
A clique C is maximal if no vertex outside C has edges to all vertices in C.)
Step 4: Build an undirected join tree T with nodes Ci . This is the desired target
structure. It represents an elegant factorization of the joint probability P (X)
which in turn can be processed with a fast, local inference algorithm known
as message passing. Again there is no unique way to create the join tree and
heuristics are used to obtain one that leads to fast inference algorithms.
We will go through these steps in a mostly informal manner. The purpose
of this presentation is not to provide you with a detailed recipe for building ex-
ecutable code. There exist convenient BN toolboxes that you would rather use.
The purpose of this subsection is to provide you with a navigation guide to gain
an overview of the general picture, then study the more detailed paper Huang and
Darwiche 1994, then start using (and understanding) a toolbox.
Undirected graphical models. Before we can delve into the join tree algo-
rithm, we have to introduce the concept of undirected graphical models (UGMs),
because these will be constructed as intermediate data structures when a BN is
transformed to a join tree.
The essence of BNs is that conditional independency relationships between
RVs are captured by directed graph structures, which in turn guide efficient in-
ference algorithms. But it is also possible to use undirected graphs. This leads
to undirected graphical models (UGMs), which have a markedly different flavour
from directed BNs. UGMs originated in statistical physics and image processing,
while BNs were first explored in Artificial Intelligence. A highly readable non-
technical overview and comparison of directed and undirected models is given in
Smyth 1997.
We will use the following compact notation for statistical, conditional indepen-
dence between two sets Y and Z of random variables, given a another set S of
random variables:
Definition 10.2 Two sets Y and Z of random variables are independent given
S, if
P (Y, Z | S) = P (Y | S) P (Z | S).
We write Y⊥Z | S to denote this conditional independence.
194
It is easy to see that if Y⊥Z | S and Y0 ⊆ Y, Z0 ⊆ Z, then Y0 ⊥Z0 | S.
In an UGM, independence relations between sets of random variables are de-
fined in terms of a graph separation property:
Definition 10.3 Let G = (V, E) be an undirected graph with vertices V and edges
E ⊆ V × V . Let Y, Z, S ⊂ V be disjoint, nonempty subsets. Then Y is separated
from Z by S if every path from some vertex in Y to any vertex in Z contains a
node from S. S is called a separator for Y and Z.
Step 1: The moral UGN. After this quick introduction to UGMs we return
to the join tree algorithm for BNs. The first step is to transform the directed
graph structure of a BN into an UGM structure. This can be done in many ways.
The art lies in transforming a BN into an UGM such that as few as possible of the
valuable independence relations expressed in the BN get lost. The standard thing
to do as a first step is to moralize the directed BN graph Gd into an undirected
graph Gm by
See Figure 58 for an example. The peculiar name “moralizing” comes from the
act of “marrying” previously unmarried parents. The moral UGM Gm implies the
same conditional independence relations as the BN Gd from which it was derived
(proof omitted here).
195
Figure 58: A BN and its associated moral UGM. Image taken from Huang and
Darwiche 1994.
Figure 59: Left: A triangulated version Gt of the moral UGM Gm from Figure 58.
Right: the 6 cliques in Gt . Image adopted from Huang and Darwiche 1994.
196
into junction trees in Step 4.
Step 3: finding the cliques. For a change, finding all maximal cliques in a
triangulated graph is not NP-hard — efficient algorithms for finding all maximal
cliques in a triangulated graph are known. In our running example, we get 6
cliques, namely all the ”triangles” ABD, ACE, ADE, DEF, CGE, and EGH (Figure
59 (right)). Our example conveys maybe a wrong impression that we always get
cliques of size 3 in triangulated graphs. In general, one may find maximal cliques
of any size greater ≥ 2 in such graphs.
Step 4: Building the join tree. This is a more interesting and involved step.
After BNs and UGMs, join trees are our third graph-based representation of in-
dependence relations governing a set X of random variables. We will first discuss
join trees in their own right, and then consider how a join tree can be obtained
from the cliques of a triangulated UGM.
197
Figure 60: A join tree derived from the triangulated UGN shown in Figure 59.
Image taken from Huang and Darwiche 1994.
In join trees, the belief potentials are the marginal distributions of their vari-
ables:
Proposition 10.3 Let T be a join tree for P (X), and let K = {X1 , . . . , Xk } ⊆
X = {X1 , . . . , Xk , Xk+1 , . . . , Xn } be a cluster or sepset label set. Then for any
value instantiation x1 , . . . , xk of the variables from K, it holds that
X
P (x1 , . . . , xn ) = ϕK (x1 , . . . , xk ), (123)
xk+1 ,...,xn
to denote marginalization.
Ok., now that we know what a join tree is, we return to Step 4: constructing
a join tree from a triangulated moral UGM Gt . A join tree is specified through
(i) its labelled graph structure and (ii) its belief potentials. We first treat the the
question how one can derive the join tree’s graph structure from Gt .
There is much freedom in creating a join tree graph from Gt . One goal for
optimizing the design is that one strives to end up with clusters that are as small
as possible (because the computational cost of using a join tree for inference will
turn out to grow exponentially in the maximal size of clusters). On the other hand,
in order to compute belief potentials later, any clique in Gt must be contained in
some cluster. This suggests to turn the cliques identified in Step 3 into the cluster
nodes of the join tree. This is indeed done in a general recipe for constructing a
join tree from a triangulated UGM Gt , which I rephrase from Huang and Darwiche
1994:
198
1. Begin with an empty set SEP, and a completely unconnected graph whose
nodes are the m maximal cliques Ci found in Step 3.
3. From SEP iteratively choose m − 1 sepsets and use them to create connec-
tions in the node graph, such that each newly chosen sepset connects two
subgraphs that were previously unconnected. This necessarily yields a tree
structure.
A note on the word “tree structure”: in most cases when a computer scientist
talks about tree graphs, there is a special “root” node. Here we use a more general
notion of trees to mean undirected graphs which (i) are connected (there is a path
between any two nodes) and (ii) where these paths are unique, that is between
any two nodes there is exactly one connecting path. In such tree graphs any node
can be designated as “root” if one wishes to see the familiar appearance of a tree.
This general recipe leaves much freedom in choosing the sepsets from SEP. Not
all choices will result in a valid join tree. In order to ensure the join tree properties,
we choose, in every step from 3., the candidate sepset that has the largest mass
(among all those which connect two previously unconnected subgraphs). The mass
of a sepset is the number of variables it contains.
This is not the only possible way of constructing a join tree, and it is still
underspecified (there may be several maximal mass sepsets at our disposal in a
step from 3.) Huang and Darwiche propose a full specification that heuristically
optimizes the join tree with respect to the ensuing inference algorithms.
If the original BN was not connected, some of the sepsets used in the join tree
will be empty; we get a join forest then.
We now turn to the second subtask in Step 4 and construct belief potentials
ϕC , ϕS for the clusters and sepsets, such that Equations 121 and 122 hold.
Belief potentials which account for (121) and (122) are constructed in two
steps. First, the potentials are initialized in a way such that (122) holds. Second,
by a sequence of message passes, local consistency (121) is achieved.
1. For each clique or sepset K (we use symbol K for cliques C or sepsets S),
set ϕK to the constant function
ϕK ≡ 1.
199
clique must exist. Multiply ϕCX by P (X | ΠX ):
ϕCX ← ϕCX P (X | ΠX ).
This is a shorthand notation. In more detail: first, interpret P (X | ΠX )
as a real-valued function f of all variables contained in CX , as follows. Let
CX = {X1 , . . . , Xk , Xk+1 , . . . , Xl } where ΠX = {X1 , . . . , Xk } are the parents
of X = Xk+1 . Then set f (x1 , . . . , xk , xk+1 , . . . , xl ) = P (Xk+1 = xk+1 | X1 =
x1 , . . . , Xk = xk ). Second, for all value combinations x̄ = (x1 , . . . , xl ), update
ϕCX (x̄) ← ϕCX (x̄) · f (x̄).
After this initialization, the conditional distributions P (Xk | ΠXk ) of all vari-
ables (and hence the information from the BN) have been multiplied into the
clique potentials, and (122) is satisfied:
Q
Q P (Xk |CXk )
i ϕCi k=1,...,n Y
Q = = P (Xk |ΠXk ) = P (X),
j ϕSj 1 k=1,...,n
where i ranges over all cliques, j over all sepsets, and k over all RVs.
After having initialized the join tree potentials, we make them locally consistent
by propagating the information, which has been locally multiplied-in, across the
entire join tree. This is done through a suite of message passing operations, each of
which makes one clique/sepset pair consistent. We first describe a single message
pass operation and then show how they can be scheduled such that a message pass
does not destroy consistency of clique/sepset pairs that have been made consistent
in an earlier message passing.
200
A technical detail: if ϕold
S (s) = 0, it can been shown that also ϕS (s) = 0; in
this case set ϕS (s)/ϕold
S (s) = 0.
After this step, C is consistent with S in the sense of (121). To also make
D consistent with S, a message passing in the reverse direction must be carried
out. An obvious condition is that this reverse-direction pass must preserve the
consistency of C with S. This is warrented if a certain order of passes is observed,
to which we now turn our attention.
This connection will be hit twice by a message pass, one in each direction.
Assume that the first of the two passes went from C to D. After this pass, we
have potentials ϕ0C , ϕ0S , ϕ0D , and C is consistent with S:
X
ϕ0S = ϕ0C .
C\S
At some later time, another message pass sweeps back from D to C. Before this
happens, the potential of D might have been affected by some other passes, so it
is ϕ1D when the pass from D to C occurs. After this pass, we have
X ϕ1S
ϕ1S = ϕ1D and ϕ1C = ϕ0C .
ϕ0S
D\S
201
for all connections in the tree, we must make sure that after some connection C –
S – D has been passed back and forth, neither C nor D take part in any further
passes, as this might again disrupt the already achieved consistency. The following
global scheduling scheme assures this condition:
1. To start, single out any clique C and call it the “center”.
2. In a first phase (“collect evidence” in the Huang/Darwiche paper), carry out
all passes that are oriented towards the center. Carry out these passes in
any order such that a pass “leaves” some node on some connection only after
all other “incoming” connections have been used for passes.
3. In a second phase (“distribute evidence”), carry out all passes that are ori-
ented away from the center, in an inside-out spreading order.
Figure 62 shows a possible global scheduling for our example join tree.
Figure 62: A scheduling for the global propagation of message passing. The center
is ACE. Figure taken from the Huang/Darwiche paper.
After all of this toiling, we have a magic join tree T — a tree graph adorned
with belief potentials that are locally consistent (Equation 121) and globally rep-
resent the joint distribution P (X) (Equation 122). The join tree is now ready for
use in inference tasks.
202
This follows from Proposition 10.3.
where CE has labels (E, A1 , . . . , Al ), that is, ϕCE ΛE simply resets ϕCE to
zero for all arguments that have a different value for E than the observed
one. With the new potentials, the tree globally encodes P (X) 1e , where
1e : SN Xi → {0, 1} is the indicator function of the set {(x1 , . . . , xn ) ∈
SN Xi | xi = ej for Xi = Ej , j = 1, . . . , k}:
Q
ϕ
Qi Ci = P (X) ΛE1 . . . ΛEk = P (X) 1e =: P (X, e).
j ϕSj
203
Note furthermore that X
= P (e).
X\E
ϕK = P (K, e).
204
Figure 63: Example of estimating a probability table from frequency counts.
When data like in Figure 63 are available for all nodes in a BN, estimating
the local conditional probabilities by the obvious frequency counting ratios gives
maximum likelihood estimates of the local conditional probabilities. It can be
shown that this is also the maximum likelihood estimate of the joint distribution
P (X) (not a deep result, the straightforward derivation can be found in the online
lecture notes of Ermon 2019, chapter “Learning in directed models”).
It is very often the case that for some of the variables, neither empirical obser-
vations nor an expert’s opinion is available, either because simply the observations
have not been carried out or because these quantities are in principle unobservable.
Such unobservable variables are called hidden variables. To get an impression of
the nature and virtues of hidden variables, consider the BN in Figure 64.
Figure 64: A BN for use by a social worker (apologies to professionals in the field)
205
However, nobody has yet found a convincing way to directly measure self-
confidence — it is an explanatory concept, and becomes a hidden variable in a
statistical model. While all other variables in this BN can be readily measured,
self-confidence can’t. Yet the augmented BN is, in an intuitive sense, more valuable
than the first one, because it tries to reveal a causal mechanism whereas the former
one only superficially connects variables by arrows that can hardly be understood.
Besides being intellectually more pleasing, the second BN offers substantial
computational savings: its join tree (construct it!) is much more lightweight than
the first BN’s, so statistical inference algorithms will run much faster.
Generalizing from this simplistic example, it should be clear that hidden vari-
ables are a great asset in modeling reality. But — they are hidden, which means
that the requisite probability tables cannot be directly estimated from empirical
data.
When there are hidden variables in a BN for whose conditional distribution no
data are available, one uses EM algorithms to estimate their probability tables.
The basic version of EM for BNs which seems standard today has been introduced
in a paper by Lauritzen 1995. The algorithm is also described in a number of online
tutorials and open access papers, for instance Mouafo et al. 2016. The algorithm
is complex. In the E-step it uses inference in join trees as a subroutine.
Markov random fields, a quite generic concept/word used in statistics but also
in a more special sense in image processing;
I originally planned to have a section on these models, exploring their deep con-
nection to physics and giving a demo in an image denoising task. But the BN part
of this section has grown sooooo long that I think it would be too much for one ses-
sion of this course. If you are interested — the online course notes of Ermon 2019
206
include an easy-reading fast introduction to Markov random fields in the chapter
with the same name, and my legacy lecture notes on “Statistical and Algorithmi-
cal Modeling” (https://www.ai.rug.nl/minds/uploads/LN_AlgMod.pdf) have
a substantial intro Section 6.3 on UGMs.
207
11 Online adaptive modeling
Often an end-user of a machine learning model finds himself/herself in a situation
where the distribution of the input data needed by the model change over time.
The learnt model will then become inaccurate because it was trained with training
data that had another distribution of the input data. It would be desirable if the
learnt model could adapt itself to the new situation. Two examples:
• A speech recognition system on a smartphone is used while walking in a
city and doing some shopping — the background noise will change every
few seconds — and the recognition system has to adapt its “de-noising”
continuously to changing sorts of noise.
• A credit risk prediction system is used month after month — but when an
economical crisis changes, the loantakers’ payback morale, the system should
change its prediction biases.
Never-ending adaptation of machine learning systems has always been an issue
for machine learning. Currently this theme is receiving renewed attention in deep
learning in a special, somewhat restricted version, where it is discussed under the
headline of continual learning (or continuous learning). Here one seeks solutions
to the problem that, if an already trained neural network is subsequently trained
even more on new incoming training data, the previously learnt competences will
be destructively over-written by the process of continuing learning on new data.
This phenomenon is known since long as catastrophic forgetting. Methods for
counter-acting catastrophic forgetting are developing fast. If you are interested
— the PhD thesis of Xu He ((He 2022), added to Brightspace underneath the
lecture notes) starts with a very readable overview on today’s continual learning
landscape.
In this section I will however not deal with continual (deep) learning, for two
reasons: (i) this material is advanced and requires substantial knowledge of deep
learning methods, (ii) the currently available methods still fall short of the desired
goal to enable ongoing, “life-long” learning.
Instead, I will present methods which have since long been explored and suc-
cessfully used in the field of signal processing and control. This material is not
normally treated in machine learning courses — I dare say, mostly because ma-
chine learners are just not aware of this body of knowledge, and maybe also if they
are, they find these methods too “linear” (no neural networks involved!). But I
find this material most valuable to know,
• because these techniques are broadly applicable, especially in application
contexts that involve signal processing and control — like robotics, industrial
engineering applications, and modeling biological systems;
• because these techniques are mathematically elementary and transparent
and give you a good understanding of conditions when gradient descent
208
optimization of loss functions becomes challenged — it’s a perfect primer
to get into the learning algorithms of deep learning - you will understand
better why they sometimes become veeeeery slow or numerically instable;
• and finally, because I think that machine learning is an interdisciplinary
enterprise and I believe that these signal processing flavored methods will
become important in a young and fast-growing field of research called neuro-
morphic computing which has a large overlap with machine learning — my
own field since a few years.
Throughout this section, I am guided by the textbook Adaptive Filters: Theory
and Applications (Farhang-Boroujeny 1998).
x y
H
209
Note that here we consider the vector w as a column vector (often in the
literature, this specific weight vector is understood as row vector — as we did in
our earlier treatment of linear regression). Transversal filters are linear filters. A
filter H is called linear if for all a, b ∈ R and signals x1 , x2
H(a x1 + b x2 ) = a H(x1 ) + b H(x2 ). (126)
The proof that transversal filters are linear is an easy exercise.
I remark in passing that the theory of signals and systems works with complex
numbers throughout both for signals and filter parameters; for us it is however
good enough if we only use real-valued signals and model parameters.
The unit impulse δ = (δ(n)) is the signal that is zero everywhere except for
n = 0, where δ(0) = 1. The unit impulse response of a filter H is the signal H(δ).
For a transversal filter Hw , the unit impulse response is the signal which repeats
w = (w1 , . . . , wL )0 at times n = 0, 1, . . . , L − 1:
(
wi if n = i − 1 (i = 1, . . . , L),
(Hw (δ))(n) = (127)
0 else.
Figure 67 shows the structure of a transversal filter in the graphical notation
used in signal processing.
x(n –L+1)
Figure 67: A transversal filter (black parts) and an adaptive linear combiner (blue
parts). Boxes labeled z −1 are what signal processing people call “unit delays” —
elementary filters which delay an input signal by one timestep. The triangular
boxes mean “multiply with”. In such box-and-arrow flow diagrams in the field of
signal processing, diagonal arrows spearing through some box indicate that what
is in the box is becoming adapted on the basis of the information that arrives with
the arrow.
The theme of this section is online adaptive modeling. In the context of fil-
ters, this means that the filter changes over time in order to continue complying
210
with new situational conditions. Concretely, we consider scenarios of supervised
training adaptation, where a teacher output signal is available. We denote this
“desirable” filter output signal by (d(n)). The objective of online adaptive fil-
tering is to continually adapt the filter weights wi such that the filter output
y(n) = w0 (x(n), . . . , x(n − L + 1))0 stays close to the desired output d(n).
The situational conditions may change both with respect to the input signal
x(n), which might change its statistical properties, and/or with respect to the
teacher d(n), which might also change. For getting continual optimal performance
this implies that the filter weights must change as well: the filter weight vector w
becomes a temporal variable w(n). Using the shorthand x(n) = (x(n), . . . , x(n −
L + 1))0 for the last L inputs up to the current x(n), this leads to the following
online adaptation task:
Given at time n: the filter weight w(n−1) calculated at the end of the previous
timestep; new input and desired output data points x(n), d(n).
at the next timestep can be expected to be smaller than without the adap-
tation. Notes:
211
the past. This will automatically happen if the modification ∆w (n) is
based only on the current error ε(n) = d(n) − w0 (n − 1) x(n). Informa-
tion from earlier errors will already be incorporated in w(n − 1). This
leads to an error feedback based weight adaptation which is schemati-
cally shown in the blue parts of Figure 67. A diagonal arrow through
a box is the way how parameter adaptation is depicted in such signal
processing diagrams. Such continually adapted transversal filters are
called adaptive linear combiners in the Farhang/Boroujeny textbook.
• The error signal ε2 (n) which the adaptive filter tries to keep low need
not always be obtained by comparing the current model output
w(n − 1)0 x(n) with a teacher d(n). Often the “error” signal is obtained
in other ways than by comparison with a teacher. In all cases, the
objective for the adaptation algorithm is to keep the “error” amplitude
at low levels. The error signal itself is the only source of information
to steer the adaptation (compare Figure 67). Several examples in the
next subsection will feature “error” signals which are not computed by
comparison with a teacher signal.
x d
212
The following game is played here. Some target system is monitored while it
receives an input signal x and generates an output g which is observed with added
observation noise ν; this observed signal output becomes the teacher d. The model
system is a transversal filter Hw whose weights are adapted such that the model
system’s output y stays close to the observed output d of the target system.
Obtaining and maintaining a model system Hw is a task which occurs ubiq-
uitously in systems engineering. The model system can be used for manifold
purposes because it allows to simulate the target system. Such simulation mod-
els are needed, for instance, for making predictions about the future behavior
of the original system, or for assessing whether the target system might be run-
ning into malfunction modes. Almost every control or predictive maintenance
task in electric or mechanical engineering requires system models. I proceed to
illustrate the use of system models with two concrete examples taken from the
Farhang/Boroujeny book.
y
x
Figure 69: Geological exploration via impulse response of learnt earth model. A.
Physical setup. B. Analysis of impulse response.
213
(explosion or large vibrating mass). An earth microphone is placed at a distant
point B, picking up a signal d. A model Hw (a “dummy earth”) is learnt. This
model captures how the (simulated) earth transmits the acoustic signal from the
generator to the microphone, - capturing all the delays and deformations that
the input signal has suffered when it arrives at the microphone (communication
engineers would call this earth model a channel model). After the model Hw has
been learnt, one analyses the impulse response w of this model (which, as we have
briefly mentioned earlier, in these simple transversal filters happens to coincide
with the tap weight vector). The peaks of w give indications about reflecting
layers in the earth crust between A and B, which correspond to different delayed
responses pj of the input impulse.
r u y (should be ~ r)
𝐲"
214
11.2.2 Inverse system identification
This is the second most basic task: given an unknown system H which on input
d produces output x, learn an inverse system that on input x produces output d
(note the reversal of variable roles and names). A typical setup is shown in Figure
71.
d x y
&"'
𝑧 "∆ 𝐻
𝑧 "∆
215
s x y 𝐬"
Figure 72: Schema of adaptive online channel equalization. Delays (which one
would insert for stability) are omitted. The box on the right with the step function
indicates a filter that converts the continuous-valued equalizer output y to a binary
signal — assuming that we want a channel for binary bitstream signals.
The “feedback controller” box in Figure 73 contains one of them without further
specification.
The scheme shown in the figure (proposed in Jordan and Wolpert 1999 in
a nonlinear control context, using neural networks) trains an open-loop inverse
(feedforward) controller in conjunction with the operation of a fixed, untrainable
feedback controller.
e = ufb(n)
𝑧 "∆
Figure 73: Schema of feedback error learning for a composite control system.
216
therefore produces a zero output ufb (n) — that is, the feedforward controller
sees zero error ε and does not change.
• When the overall control system does not work perfectly, the feedback con-
troller will output a nonzero control output ufb (n), which acts as an error
signal for adaptating of the feedforward controller. The feedforward con-
troller tries to minimize this error — that is, it changes its way to generate
output uff (n) such that the feedback controller’s output is minimized, that is,
such that (r(n) − y(n))2 is minimized, that is, such that the control improves
(an admittedly superficial explanation).
• When the plant characteristics change, or when external disturbances set in,
the feedback controller jumps to action, inducing further adaptation of the
feedforward controller.
s s + n0 𝐬"
Denoising has many applications, for instance in airplane cockpit crew com-
munication (cancelling the acoustic airplane noises from pilot intercom speech),
postprocessing of live concert recordings, or (like in one of the suggested semester
projects) cancelling the mother’s ECG signal from the unborn child’s in prena-
tal diagnostics. In the Powerpoint file denoisingDemo.pptx which you find on
Nestor together with the lecture notes, you can find an acoustic demo that I once
produced.
Explanations:
• The “error” which the adaptive denoising filter tries to minimize is s+ν0 −y,
where y is the filter output.
217
• The only information that the filter has to achieve this error minimization
is its input ν1 . Because this input is (ideally) independent of s, but related
to ν0 via some noise-to-noise filter, all that the filter can do is to subtract
from s + ν0 whatever it finds correlates in s + ν0 with ν1 . Ideally, this is ν0 .
Then, the residual “error” ŝ would be just s — the desired de-noised signal.
More sophisticated methods for denoising are known today, often based on
mathematical principles from independent component analysis (ICA). You find an
acoustic demo on https://cnl.salk.edu/~tony/ica.html (the author of this
page, Anthony J. Bell, is an ICA pioneer).
But reality strikes: The optimization problem (?) often cannot be solved di-
rectly, for instance because it is analytically intractable (the case for neural
network training) or because the training data come in as a time series and
their statistical properties change with time (as in adaptive online modeling).
218
Second best approach: Design an iterative algorithm which produces a sequence
of models (= parameter vectors) θ(0) , θ(1) , θ(2) , . . . with decreasing empirical
risk Remp (θ(0) ) > Remp (θ(1) ) > . . .. The model θ(n+1) is typically computed
by an incremental modification of the previous model θ(n) . The first model
θ(0) is a guess provided by the experimenter.
In neural network training, the hope is that this series converges to a model
θ(∞) = limn→∞ θ(n) whose empirical risk is close to the minimal possible em-
pirical risk. In online adaptive modeling, the hope is that if one incessantly-
iteratively tries to minimize the empirical risk, one stays close to the moving
target of the current best model (we’ll see how that works).
This scheme only treats the approach where one tries to minimize the training
loss. We know that in standard supervised learning settings this invites overfitting
and that one should better employ a regularized loss function together with some
cross-validation procedure, a complication that we ignore here. In online adaptive
modeling, the empirical risk Remp (θ) is time-varying, another complication that
for the time being we will ignore. Such complications notwithstanding, the general
rationale of iterative supervised learning algorithms is to compute a sequence of
models with decreasing empirical risk.
In standard (not online adaptive) settings, such iterative algorithms, if they
converge, can find only locally optimal models. A model θ(∞) is locally optimal
if every slight modification of it will lead to a higher empirical risk. The final
converged model limn→∞ θ(n) will depend on the initial guess θ(0) — coming up
with a method for good initial guesses was the crucial innovation that started the
deep learning revolution (Hinton and Salakuthdinov 2006).
219
wordings.
Performance surfaces can be folded in structures of stunning complexity. In
neural network training (next section), they are tremendously complex. Figure 75
shows a neural network error landscape randomly picked from the web.
The gradient ∇R(θ) is the vector which points from θ in the direction of
the steepest ascent (“uphill”) of the performance surface. The negative gradient
−∇R(θ) is the direction of the steepest descent (Figure 76).
The idea of model optimization by gradient descent is to iteratively move
toward a minimal-risk solution θ(∞) = limn→∞ θ(n) by always “sliding downhill”
in the direction of steepest descent, starting from an initial model θ(0) . This idea
is as natural and compelling as can be. Figure 77 shows one such itinerary.
The general recipe for iterative gradient descent learning goes like this:
220
𝜃2
𝜃 −∇R(𝜃)
𝜃1
Figure 76: A performance surface for a 2-dimensional model family with param-
eters θ1 , θ2 , with its contour plot at the bottom. For a model θ (yellow star in
contour plot) the negative gradient is shown as black solid arrow. It marks the
direction of steepest descent (broken black arrow) on the performance surface.
The adaptation rate (or learning rate) µ is set to a small positive value.
An obvious weakness of this elegant and natural approach is that the final
model θ(∞) depends on the choice of the initial model θ(0) . In complex risk land-
scapes (as the one shown in Figure 75) there is no hope of guessing an initial
model which guarantees to end in the global minimum. This circumstance is gen-
erally perceived and accepted. There is a substantial mathematical literature that
amounts to “if the initial model is chosen with a good heuristic, the local minimum
that will be reached will be a rather good one with high probability”.
221
𝜃(0)
𝜃(1)
𝜃(2)
𝜃(∞)
Figure 77: A gradient descent itinerary, re-using the contour map from Figure 76
and starting from the initial point shown in that figure. Notice the variable jump
length and the sad fact that from this initial model θ(0) the global minimum is
missed. Instead, the itinerary slides toward a local minimum at θ(∞) . The blue
arrows show the negative gradient at raster points. They are perpendicular to the
contour lines and their length is inversely proportional to the spacing between the
contour lines.
222
We will see that the real headaches with gradient descent optimization are of
a different nature — specifically, there is a inherently difficult tradeoff between
speed of convergence (one does not want to invest millions or even billions of
iterations) and stability (the iterations must not lead to erratic large-sized jumps
that lead away from the downhill direction). The adaptation rate µ plays a key role
in this difficult tradeoff. For good performance (stability plus satisfactory speed
of convergence), the adaptation rate must be adapted online while the iterations
proceed. And doing that sagaciously is not trivial. The situation shown in Figure
77 is deceptively simplistic; say goodbye to all hope that gradient descent works
as smoothly as that in real-life applications.
These difficulties raise their ugly head already in the simplest possible risk
surfaces, namely the ones that arise with iterative solutions to linear regression
problems. Making you aware of these difficulties in an analytically tractable,
transparent setting is one of the two main reasons why I believe a machine learning
student should know about adaptive online learning of transversal filters. You will
learn a lot about the challenges in neural network training as a side effect. The
other main reason is that adaptive online learning of transversal filters is really,
really super useful in many practical tasks.
223
over time. If this condition is met, the expectations
where the expectation is taken with respect to time, are well-defined and indepen-
dent of n. R is called (in the field of signal processing) the correlation matrix of
the input process; it has size L × L. p is an L-dimensional vector.
In a stationary process there is no need to have different filter weights at
different times. We can thus, for now, drop the time dependence from w(n) and
consider unchanging weights w. For any such weight vector w we consider the
expected squared error
This expected squared error for a filter Hw is the risk that we want to minimize.
We now take a close look at the performance surface, that is at the graph of the risk
function R : RL → R. Its geometrical properties will turn out key for mastering
the adaptive filtering task. Figure 78 gives a visual impression of the performance
surface for the case of L = 2 dimensional weight vectors w = (w1 , w2 )0 .
I mention without proof some basic geometric properties of the performance
surface. The function (133) is a (multidimensional) quadratic function of w.
The general form of a multidimensional quadratic function F : Rk → R for k-
dimensional vectors x is F (x) = a + x0 b + x0 Cx, where a ∈ R is a constant
offset, x0 b is the linear term, and x0 Cx is the quadratic term. C must be a posi-
tive definite matrix or negative definite matrix. The graph of a multidimensional
quadratic function shares many properties with the graph of the one-dimensional
quadratic function f (x) = a + bx + cx2 . The one-dimensional quadratic function
graph has the familiar shape of a parabola, which depending on the sign of c is
opened upwards or downwards. Similarly, the graph of k-dimensional quadratic
function has the shape of a k-dimensional paraboloid, which is opened upwards
if C is positive definite and opened downwards if C is negative definite. A k-
dimensional paraboloid is a “bowl” whose vertical cross-sections are parabolas.
The contour curves of a k-dimensional paraboloid (the horizontal cross-sections)
are k-dimensional ellipsoids whose main axes lie in the directions of the k orthog-
onal eigenvectors u1 , . . . , uk of C.
In the 2-dimensional case of a performance surface shown in Figure 78, the
paraboloid must open upwards because the risk R is an expected squared error
and hence cannot be negative; in fact, the entire surface must be nonnegative. The
figure also shows a projection of the ellipsoid contour curves of the paraboloid on
the w1 -w2 plane, together with the eigenvectors of R.
224
R(w) = E[e2(n)]
Rmin w2 opt
w2
w1 opt
w(n+1) w(n+2)
w1 w(n)
Figure 78: The performance surface in the case of two-dimensional weight vectors
(black parts of this drawing taken from drip.colorado.edu/~kelvin/links/
Sarto_Chapter2.ps many years ago, page no longer online). An iterative al-
gorithm for weight determination would try to determine a sequence of weights
. . . , w(n) , w(n+1) , w(n+2) , . . . (green) that moves toward wopt (blue). The eigenvec-
tors uj (red) of the correlation matrix R lie on the principal axes of the hyperel-
lipsiods given by the level curves of the performance surface.
225
If you consider the definitions of R and p (Equations 132 and 131), you will
find that this solution for wopt is the empirical risk version of the linear regression
solution (19) for the minimal empirical risk from a quadratic error loss.
In order to appreciate the challenges inherent in iterative gradient descent on
quadratic performance surfaces we have to take a closer look at the shape of the
hyperparaboloid.
First we use (133) and the Wiener-Hopf equation to express in the expected
residual error Rmin (see Figure 78) that we are left with when we have found wopt :
0 0
Rmin = E[(d(n))2 ] − 2 wopt p + wopt R wopt
0
= E[(d(n))2 ] − wopt p
0
= E[(d(n))2 ] − wopt R wopt
= E[(d(n))2 ] − p0 R−1 p.
(136)
For a more convenient analysis we rewrite the error function R in new coordi-
nates such that it becomes centered at the origin. Observing that the paraboloid
is centered on wopt , that it has “elevation” Rmin over the weight space, and that
0
the shape of the paraboloid itself is determined by wopt R wopt , we find that we
can rewrite (133) as
R(v) = Rmin + v0 R v, (137)
where we introduced shifted (and transposed) weight coordinates v = (w − wopt ).
Differentiating (137) with respect to v yields
0
∂R ∂R ∂R
= ,..., = 2 R v. (138)
∂v ∂v1 ∂vL
Since R is positive semi-definite, its SVD factorization is R = UDU0 =
UDU−1 , where the columns of U are made from L orthonormal eigenvectors
of R and D is a diagonal matrix containing the corresponding eigenvalues (which
are real and non-negative) on its diagonal. Note that the eigenvectors uj of R
lie on the principal axes of the hyperellipsoid formed by the contour lines of the
performance surface (see Figure 78, red arrows).
By left-multiplication of the shifted coordinates v = (w − wopt ) with U0 we
finally get normal coordinates ṽ = U0 v. The coordinate axes of the ṽ system are
in the direction of the eigenvectors of R, and (138) becomes
∂R
= 2 D ṽ = 2 (λ1 ṽ1 , . . . , λL ṽL )0 , (139)
∂ ṽ
from which we get the second derivatives
∂ 2R
2
= 2 (λ1 , . . . , λL )0 , (140)
∂ ṽ
226
that is, the eigenvalues of R are (up to a factor of 2) the curvatures of the per-
formance surface in the direction of the central axes of the hyperparabeloid. We
will shortly see that the computational efficiency of gradient descent on the per-
formance surface depends critically on these curvatures.
where the uj are again the orthonormal eigenvectors of R. Inserting (143) gives
us X (0)
w(n) = wopt + ṽj uj (1 − 2µλj )n . (145)
j=1,...,L
227
(n)
Figure 79: The development of ṽj [plotted in the y-axis] versus n [x-axis]. The
qualitative behaviour depends on the stepsize parameter µ. a. Overdamped case:
0 < µ < 1/(2λj ). b. Underdamped case: 1/(2λj ) < µ < 1/λj . c. Unstable with
(0)
µ < 0 and d. unstable with 1/λj < µ. All plots start with ṽj = 1.
This representation reveals that the convergence of w(n) toward wopt is gov-
erned by an additive overlay of L exponential terms, each of which describes
convergence in the direction of the eigenvectors uj and is determined in its con-
vergence speed by λj and the stepsize parameter µ. One speaks of the L modes of
convergence with geometric ratio factors (1 − 2µλj ). If all eigenvalues are roughly
equal, convergence rates are roughly identical in the L directions. If however two
eigenvalues are very different, say λ1 λ2 , and µ is small compared to the eigen-
values, then convergence in the direction of u1 will be much slower than in the
direction of u2 (see Figure 80).
Next we turn to the question how the error R evolves over time. Recall from
(137) that R(v) = Rmin + v0 R v, which can be re-written as R(v) = Rmin + ṽ0 D ṽ.
Thus the error in the n-th iteration is
X (0)
R(n) = Rmin + ṽ0(n) D ṽ(n) = Rmin + λj (1 − 2µλj )2n ṽj . (146)
j
For suitable µ (considering (144)), R(n) converges to Rmin . Plotting R(n) yields
a graph known as learning curve. Equation 146 reveals that the learning curve is
the sum of L decreasing exponentials (plus Rmin ).
How this learning curve looks like depends on the size of µ relative to the
eigenvalues λj . If 2µλj is close to zero for all j, the learning curve separates into
228
Figure 80: Two quite different modes of convergence (panel a.) versus rather
similar modes of convergence (panel b.). Plots shows contour lines of performance
surface for two-dimensional weights w = (w1 , w2 ). Violet dotted lines indicate
some initial steps of weight evolution.
sections that each are determined by the convergence of one of the j components.
Figure 81 shows a three-mode learning curve for the case of small µ, rendered in
linear and logarithmic scale.
Figure 81: A learning curve with three modes of convergence, in linear (a.) and
logarithmic (b.) scaling. This plot shows the qualitative behavior of modes of
convergence when µ is small. Rmin is assumed zero in these plots.
This separation of the learning curve into approximately linear sections (in
logarithmic rendering) can be mathematically explained as follows. Each of the
terms (1 − 2µλj )2n is characterized by a time constant τj according to
If 2µλ is close to zero, exp(2µλ) is close to 1 + 2µλ and thus log(1 − 2µλ) ≈
−2µλ. Using this approximation, solving (147) for τj yields for the j-th mode a
time constant of
1
τj ≈ .
4µλj
That is, the convergence rate (i.e. the inverse of the time constant) of the j-th
mode is proportional to λj (for small µ).
229
However, this analysis is meaningless for larger µ. If we want to maximize the
speed of convergence, we should use significantly larger µ as we will presently see.
The final rate of convergence is dominated by the slowest mode of convergence,
which is characterized by the geometrical sequence factor
max{|1 − 2µλj | j = 1, . . . , L} = max{|1 − 2µλmax |, |1 − 2µλmin |}. (148)
230
revolution is based, among other factors, on an assortment of “tricks of the trade”
to overcome the limitations of large eigenvalue spreads by clever modifications
of gradient descent, which cannot work in its pure form. If you are interested
— Section 8 in the deep learning “bible” (I. Goodfellow, Bengio, and Courville
2016) is all about these refinements and modifications of, and alternatives to, pure
gradient descent.
The gradient descent update formula w(n+1) = w(n) −µ ∇R(w(n) ) then becomes
where in the last step we simplified the notation εw(n) (n) to ε(n). Inserting this
into (151) gives
231
which is the weight update formula of one of the most compact, cheap powerful
and widely used algorithms I know of. It is called the least means squares (LMS)
algorithm in signal processing, or Widrow-Hoff learning rule in neuroscience where
the same weight adaptation rule has been (re-)discovered independently from the
signal processing tradition. In fact, this online weight adaptation algorithm for
linear regression has been independently discovered and re-discovered many times
in many fields.
For completeness, here are all the computations needed to carry out one full
step of online filtering and weight adaptation with the LMS algorithm:
One fact about the LMS algorithm should always be kept in mind: being a stochas-
tic version of steepest gradient descent, the LMS algorithm inherits the problems
connected with the eigenvalue spread of the input process Xn . If its eigenvalue
spread is very large, the LMS algorithm will not work satisfactorily.
As an aside, in my working with recurrent neural networks, I once tried out
learning algorithms related to LMS. But the input signal to this learning algorithm
had an eigenvalue spread of 1014 to 1016 which resulted from the extremely multi-
curved geometry of the neural network’s cost landscape, so the beautiful LMS
algorithm was entirely inapplicable.
Because of its eminent usefulness (if the input vector correlation matrix has
a reasonably small eigenvalue spread), the LMS algorithm has been analysed in
minute detail. I conclude this section by reporting the most important insights
without mathematical derivations. At the same time I introduce some of the
standard vocabulary used in the field of adaptive signal processing.
For starters, we again assume that [X L D] is a stationary processes (recall
(130)). The evolution w(n) of weights is now also a stochastic process, because
the LMS weight update depends on the stochastic vectors x(n). One interesting
question is how fast the LMS algorithm converges in comparison with the ideal
steepest gradient descent algorithm ṽ(n+1) = (I−2µD) ṽ(n) from (141). Because we
now have a stochastic update, the vectors ṽ(n) become random variables and one
can only speak about their expected value E[ṽ(n) ] at time n. (Intuitive explanation:
this value would be obtained if many (infinitely many in the limit) training runs
of the adaptive filter would be carried out and in each of these runs, the value
of ṽ(n) at time n would be taken, and an average would be formed over all these
ṽ(n) ). The following can be shown (using some additional assumptions, namely,
that µ is small and that the signal (x(n)) has no substantial autocorrelation for
time spans larger than L):
232
Rather to our surprise, if the LMS algorithm is used, the weights converge —
on average across different trials — as fast to the optimal weights as when the
ideal algorithm (141) is employed. Figure 82 shows an overlay of the deterministic
development of weights according to (141) with one run of the stochastic gradient
descent using to the LMS algorithm.
Figure 82: Illustrating the similar performance of deterministic (pink) and stochas-
tic (red) gradient descent.
The fact that on average the weights converge to the optimal weights by no
means implies that R(n) converges to Rmin . To see why, assume that at some time
n, the LMS algorithm actually would have found the correct optimal weights, that
is, w(n) = wopt . What would happen next? Well, due to the random weight ad-
justment, these optimal weights would become misadjusted again in the next time
step! So the best one can hope for asymptotically is that the LMS algorithms lets
the weights w(n) jitter randomly in the vicinity of wopt . But this means that the
effective best error that can be achieved by the LMS algorithm in the asymptotic
average is not Rmin but Rmin + Rexcess , where Rexcess comes from the random scin-
tillations of the weight update. It is intuitively clear that Rexcess depends on the
stepsize µ. The larger µ, the larger Rexcess . The absolute size of the excess error
is not so interesting as is the ratio M = Rexcess /Rmin , that is the relative size of
the excess error compared to the minimal error. The quantity M is called the
misadjustment and describes what fraction of the residual error Rmin + Rexcess can
be attributed to the random oscillations effected by the stochastic weight update
[i.e., Rexcess ], and what fraction is inevitably due to inherent limitations of the
filter itself [i.e., Rmin ]. Notice that Rexcess can in principle be brought to zero by
tuning down µ toward zero — however, that would be at odds with the objective
of fast convergence.
Under some assumptions (notably, small M) and using some approximations
(Farhang-Boroujeny, Section 6.3), the misadjustment can be approximated by
M ≈ µ trace(R), (155)
233
where the trace of a square matrix is the sum of its diagonal elements. The
misadjustment is thus proportional to the stepsize and can be steered by setting
the latter, if trace(R) is known. Fortunately, trace(R) can be estimated online
from the sequence (x(n)) simply and robustly [how? — easy exercise].
Another issue that one has always to be concerned about in online adaptive
signal processing is stability. We have seen in the treatment of the ideal case that
the adaptation rate µ must not exceed 1/λmax in order to guarantee convergence.
But this result does not directly carry over to the stochastic version of gradient
descent, because it does not take into account the stochastic jitter of the gradient
descent, which is intuitively likely to be harmful for convergence. Furthermore,
the value of λmax cannot be estimated robustly from few data points in a practical
situation. Using again middle-league maths and several approximations, in the
book of Farhang-Boroujeny the following upper bound for µ is derived:
1
µ≤ . (156)
3 trace(R)
If this bound is respected, the LMS algorithm converges stably.
In practical applications, one often wishes to achieve an initial convergence
that is as fast as possible: this can be done by using µ close to the stability
boundary (156). After some time, when a reasonable degree of convergence has
been attained, one wishes to control the misadjustment M; then one switches into
a control mode where µ is adapted dynamically according to µ = M/trace(R),
which follows from (155).
Up to here we have been analyzing LMS under the assumption of a stationary
process. But the real working arena for LMS are nonstationary processes, where
the objective is to track the changing statistics of the [X L D]n process by contin-
ually adapting the model w(n) . In this situation one still uses the same LMS rule
(153). However, roughly speaking, the modeling error R(n) is now a sum of three
components: R(n) = Rmin (n) + Rexcess (n) + Rlag (n), all of which are temporally
changing. The new component Rlag (n) reflects the fact that iterative model adap-
tation always needs time to converge to a certain error level — when the signal
statistics change with time, the model adaptation always lags behind the changing
statistics. A rigorous analysis of this situation is beyond the scope of this course.
In practice one must tune µ online such that a good compromise is maintained
between, on the one hand, making Rlag (n) small (which means speeding up con-
vergence by increasing µ, in order to not lag behind the changing input statistics
too much), and on the other hand, minimizing Rexcess (n) (which means small µ);
all this while watching out for staying stable.
The LMS algorithm has been since 50 years or so the workhorse of adaptive
signal processing and numerous refinements and variants have been developed.
Here are some:
1. An even simpler stochastic gradient descent algorithm than LMS uses only
the sign of the error in the update: w(n+1) = w(n) + 2 µ sign(ε(n))x(n). If µ
234
is a power of 2, this algorithm does not need a multiplication (a shift does
it then) and is suitable for very high throughput hardware implementations
which are often needed in communication technology. There exist yet other
“sign-simplified” versions of LMS [cf. Farhang-Boroujeny p. 169].
4. Block implementations: for very long filters (say, L > 10, 000) and high
update rates, even LMS may become too slow. Various computationally
efficient block LMS algorithms have been designed in which the input stream
is partitioned into blocks, which are processed in the frequency domain and
yield weight updates after every block only [Farhang-B. p. 247ff].
To conclude this section, it should be said that besides LMS algorithms there
is another major class of online adaptive algorithms for transversal filters, namely
recursive least squares (RLS) adaptation algorithms. RLS algorithms are not
steepest gradient-descent algorithms. The background metaphor of RLS is not to
minimize Rn (w)
P but to minimize 2the accumulated squared error up to the current
time, ζ(n) = i=1,...,n (d(i)−y(i)) , so the performance surface we know from LMS
plays no role for RLS. The main advantages and disadvantages of LMS vs. RLS
are:
1. LMS has computational cost O(L) per update step, where L is filter length;
RLS has cost O(L2 ). Also the space complexity of RLS is an issue for long
filters because it is O(L2 ).
3. RLS has a single mode of convergence and converges faster than LMS, very
much faster when the input signal has a high eigenvalue spread.
4. RLS is more complicated than LMS and more difficult to implement in ro-
bust, stable ways.
235
5. In applications where fast tracking of highly nonstationary systems is re-
quired, LMS may have better tracking performance than RLS (says Farhang-
Boroujeny).
The use of RLS in signal processing has been boosted by the development of
fast RLS algorithms which reach a linear time complexity in the order of O(20 L)
[Farhang-B. Section 13].
Both LMS and RLS algorithms play a role in a field of recurrent neural net-
works called reservoir computing (Jaeger 2007), which happens to be one of my
personal playgrounds. In reservoir computing, the training of neural networks is
reduced to computing a linear regression. Reservoir computing has recently be-
come particularly relevant for low-power microchip hardware implementations of
neural networks. If time permits I will give an introduction to this emerging field
in a tutorial or extra session.
236
12 Feedforward neural networks: the Multilayer
Perceptron
Artificial neural networks (ANNs) have been investigated for more than half a
century in two scientific domains:
In machine learning, ANNs are used for creating complex information process-
ing architectures whose function can be shaped by training from sample data.
The goal here is to solve complex learning tasks in a data engineering spirit,
aiming at models that combine good generalization with highly nonlinear
data transformations.
Historically these two branches of ANN research had been united. The ancestor
of all ANNs, the perceptron of Rosenblatt 1958, was a computational model of
optical character recognition (as we would say today) which was explicitly inspired
by design motifs imported from the human visual system (check out Wikipedia
on “perceptron”). In later decades the two branches diverged further and further
from each other, despite repeated and persistent attempts to re-unite them. Today
most ANN research in machine learning has more or less lost its connections to
its biological origins. In this course we only consider ANNs in machine learning.
Even if we only look at machine learning, ANNs come in many kinds and
variations. The common denominator for most (but not all) ANNs in ML can be
summarized as follows.
• The units of an ANN are connected to each other by links called “synaptic
connections” (an echo of the historical past) or just “connections” or “links”.
237
connection graph. Often it is more convenient to split the global weight
matrix in submatrices, one for each “layer” of weights. In what follows I will
use the generic symbol θ for the vector of all weights in an ANN.
• The external functionality of an ANN results from the combined local in-
teractions between its interconnected units. Very complex functionalities
may thus arise from the structured local interaction between large numbers
of simple processing units. This is, in a way, analog to Boolean circuits –
and indeed some ANNs can be mapped on Boolean circuits. In fact, the
famous paper which can be regarded as the starting shot for computational
neuroscience, A logical calculus of the ideas immanent in nervous activity
(McCulloch and Pitts 1943), compared biological brains directly to Boolean
circuits.
• The hallmark of ANNs is that its functionality is learnt from training data.
Most learning procedures that are in use today rely on some sort of iterative
model optimization with a flavor of gradient descent.
This basic scenario allows for an immense spectrum of different ANNs, which
can be set up for tasks as diverse as dimension reduction and data compression,
approximate solving of NP-hard optimization problems, time series prediction,
nonlinear control, game playing, dynamical pattern generation and many more.
In this course I give an introduction to a particular kind of ANNs called feed-
foward neural networks. Often they are also called — for historical reasons –
multilayer perceptrons (MLPs).
MLPs are used for the supervised learning of input-output tasks where both
input and output come in vector formats. In such tasks the training sample is of
238
the kind (ui , yi )i=1,...,N , where u ∈ Rn , y ∈ Rk are drawn from a joint distribution
PU,Y .
Note that in this section my notation departs from the one used in earlier
sections: I now use u instead of x to denote input patterns, in order to avoid
confusion with the network states x.
The MLP is trained to produce outputs y ∈ Rk upon inputs u ∈ Rn in a way
that this input-output mapping is similar to the relationships ui 7→ yi found in
the training data. Similarity is measured by a suitable loss function.
Supervised learning tasks of this kind – which we have already studied in
previous sections – are generally called function approximation tasks or regression
tasks. It is fair to say that today MLPs and their variations are the most widely
used workhorse in machine learning when it comes to learning nonlinear function
approximation models.
An MLP is a neural network structured equipped with n input units and k
output units. An n-dimensional input pattern u is set as the state vector for the
input neurons, then the MLP does some interesting internal processing, at the end
of which the k-dimensional result vector of the computation can be read from the
activation of the k output units. An MLP N with n input units and k output
units thus instantiates a function N : Rn → Rk . Since this function is shaped
by the synaptic connection weights θ, one could also write Nθ : Rn → Rk if one
wishes to emphasize the dependance of N ’s functionality on its weights.
The learning task is defined by a loss function L : Rk × Rk → R≥0 . As we have
seen before, a convenient and sometimes adequate choice for L is the quadratic
loss L(N (u), y) = kN (u) − yk2 , but other loss functions are also widely used.
Chapter 6.2 in the deep learning bible (I. Goodfellow, Bengio, and Courville 2016)
gives advice about which loss functions should be used in which task settings.
Given the loss function, the goal of training an MLP is to find a weight vector
θopt which minimizes the empirical loss, that is
N
1 X
θopt = argmin L(Nθ (ui ), yi ), (157)
θ∈H N i=1
239
Time series prediction: inputs are vector encodings of a past history of a temporal
process, outputs are vector encodings of future observations of the process.
Examples are stock market timeseries or weather data recordings.
Denoising, restoration and pattern completion: inputs are patterns that are cor-
rupted by noise or other distortions, outputs are cleaned-up or repaired or
completed versions of the same patterns. Important applications can be
found for instance in satellite sensing, medical imaging or audio processing.
Data compression: Inputs are high-dimensional patterns, outputs are low-dimensional
encodings which can be restored to the original patterns using a decoding
MLP. The encoding and decoding MLPs are trained together.
Process control: In control tasks the objective is to send control inputs to a tech-
nological system (called “plant” in control engineering) such that the plant
performs in a desired way. The algorithm which computes the control inputs
is called a “controller”. Control tasks range in difficulty from almost trivial
(like controlling a heater valve such that the room temperature is steered to
a desired value) to almost impossible (like operating hundreds of valves and
heaters and coolers and whatnots in a chemical factory such that the chemi-
cal production process is regulated to optimal quality and yield). The MLP
instantiates the controller. Its inputs are settings for the desired plant be-
havior, plus optionally observation data from the current plant performance.
The outputs are the control actions which are sent to the plant.
This list should convince you that “function approximation” is a worthwhile
topic indeed, and spending effort on learning how to properly handle MLPs is a
good professional investment for any engineer or data analyst.
240
output
!!!!!!!!!x1Kk x2Kk ... xLkKk
neurons
wijk
last hidden
1 x1Kk-1 x2Kk-1 ... xL k-1Kk-1 layer of
neurons
first hidden
1 x11 x21 ... xL11 layer of
neurons
wij1
2. For layers 0 < m < K, assume that the activations xm−1 j of units in layer
m − 1 have already been computed (or have been externally set to the input
values, in the case of m − 1 = 0). Then the activation xm i is computed from
the formula
m−1
LX
!
m m−1
xm
i = σ wij xj + wi0m
. (158)
j=1
That is, xm
i is obtained from linearly combining the activations of the lower
m m
layer with combination weights wij , then adding the bias wi0 ∈ R, then
wrapping the obtained sum with the activation function σ. The activation
function is a nonlinear, “S-shaped” function which I explain in more detail
m
below. It is customary to interpret the bias wi0 as the weight of a synaptic
link from a special bias unit in layer m − 1 which always has a constant
activation of 1 (as shown in Figure 83).
Equation 158 can be more conveniently written in matrix form. Let xm =
m 0 m 0
(xm m m
1 , . . . , xLm ) be the activation vector in layer m, let b = (w1 , . . . , wLm )
be the vector of bias weights, and let Wm = (wij m
)i=1,...,Lm ; j=1,...,Lm−1 be the
connection weight matrix for links between layers m − 1 and m. Then (158)
241
becomes
xm = σ Wm xm−1 + bm ,
(159)
where the activation function σ is applied component-wise to the activation
vector.
242
1
0.5
ï0.5
ï1
ï3 ï2 ï1 0 1 2 3
Figure 84: The tanh (blue), the logistic sigmoid (green), and the rectifier function
(red).
sweeps forward through the layers of the network. The activation vector xm in
each layer m is directly triggered by the activations xm−1 according to (159).
The data transformation from xm−1 to xm is a relatively “mild” one: just an
affine linear map Wm xm−1 + bm followed by a wrapping with the gentle sigmoid
σ. But when several such mild transformations are applied in sequence, very
complex “foldings” of the input vector u can be effected. Figure 85 gives a visual
impression of what a sequence of mild transformations can do. Also the term
“feedforward neural network” becomes clear: the activation wave spreads in a
single sweep unidirectionally from the input units to the output units.
methods/celebrity-chefs-pizza-dough-recipes.aspx
http://www.motherearthliving.com/cooking-
Figure 85: Illustrating the power of iterating a simple transformation. The baker
transformation (also known as horseshoe transformation) takes a 2-dimensional
rectangle, stretches it and folds it back onto itself. The bottom right diagram
visualizes a set that is obtained after numerous baker transformations (plus some
mild nonlinear distortion). — Diagrams on the right taken from Savi 2016.
243
12.2 Universal approximation and “deep” networks
One reason for the popularity of MLPs is that they can approximate arbitrary
functions f : Rn → Rk . Numerous results on the approximation qualities of
MLPs have been published in the early 1990-ies. Such theorems have the following
general format:
Theorem (schematic). Let F be a certain class of functions f : Rn → Rk .
Then for any f ∈ F and any ε > 0 there exists an multilayer perceptron N with
one hidden layer such that kf − N k < ε.
Such theorems differ with respect to the classes F of functions that are approx-
imated and with respect to the norms k · k that measure the mismatch between
two functions. All practically relevant functions belong to classes that are covered
by such approximation theorems. In a summary fashion it is claimed that MLPs
are universal function approximators. Again, don’t let yourself be misled by the
dryness of the word “function approximator”. Concretely, the universal function
approximation property of MLPs would spell out, for example, to the (proven)
statement that any task of classifying pictures can be solved to any degree of
perfection by a suitable MLP.
The proofs for such theorems are typically constructive: for some target func-
tion f and tolerance ε they explicitly construct an MLP N such that kf −N k < ε.
However, these constructions have little practical value because the constructed
MLPs N are far too large for any feasible implementation. You can find more
details concerning such approximation theorems and related results in my legacy
ML lecture notes https://www.ai.rug.nl/minds/uploads/LN_ML_Fall11.pdf,
Section 8.1.
Even when the function f that one wants to train into an MLP is very complex
(highly nonlinear and with many “baker folds”), it can in principle be approxi-
mated with 1-hidden-layer MLPs. However, when one employs MLPs that have
many hidden layers, the required overall size of the MLP (quantified by total
number of weights) is dramatically reduced (Bengio and LeCun 2007). Even for
super-complex target functions f (like photographic image caption generation),
MLPs of feasible size exist when enough layers are used (one of the subnetworks
in the TICS system described in Section 1.2.1 used 17 hidden layers, but today’s
deep networks easily have 50-100 layers). This is the basic insight and motivation
to consider deep networks, which is just another word for “many hidden layers”.
Unfortunately it is not at all easy to train deep networks. Traditional learning
algorithms had made non-deep (“shallow”) MLPs popular since the 1980-ies. But
these shallow MLPs could only cope with relatively simple learning tasks. At-
tempts to scale up to larger numbers of hidden layers and more complex data sets
largely failed, due to numerical instabilities, too slow convergence, or poor model
quality. Since about 2006 an accumulation of clever “tricks of the trade”, plus the
availability of powerful, yet affordable, GPU hardware has overcome these hurdles.
This area of training deep neural networks is now one of the most thriving fields
of ML and has become widely known under the name deep learning.
244
12.3 Training an MLP with the backpropagation algorithm
In this section I give an overview of the main steps in training a non-deep MLP
for a mildly nonlinear task — tasks that can be solved with one or two hidden
layers. When it comes to unleash deep networks on gigantic training corpora for
hypercomplex tasks, the basic recipes given in this section will not suffice. You
would need to train yourself first in the art of deep learning, investing at least a
full university course’s effort (this is an advertisment of Matthia Sabatelli’s Deep
Learning course in summer). However, what you can learn in this section is a
necessary minimal outfit for surviving in the wilderness of deep learning.
1. Get a clear idea of the formal nature of your learning task. Do you want
a model output that is a probability vector? or a binary decision? or a max-
imally precise transformation of the input? how should “precision” best be
measured? and so forth. Only proceed with using MLPs if they are really
looking like a suitable model class for your problem. MLPs are raw power
cannons. If you fire them on feeble datasets you make a professional error:
simpler models (like random forests) will give better result at lower cost and
better controllability and interpretability!
5. Fix an MLP architecture. Decide how many hidden layers the MLP shall
have, how many units each layer shall have, what kind of sigmoid is used
245
and what kind of output function and loss function. The structure should be
rich enough that data overfitting becomes possible and your regularization
method can kick in.
6. Set up a cross-validation scheme in order to optimize the generalization
performance of your MLPs. Recommended: implement a semi-automated k-
fold cross-validation scheme which is built around two subroutines, (1) “train
an MLP of certain structure for minimal training error on given training
data”, (2) “test MLP on validation data”. If, however, you are aiming for
the high end with very large datasets and very large networks, classical cross-
validation is not feasible due to the cost of training just a single model; the
only escape then is early stopping.
7. Implement the training and testing subroutines. The training will be
done with a suitable version of the error backpropagation algorithm, which
will require you to meddle with some global control parameters (like learning
rate, stopping criterion, initialization scheme, and more).
8. Do the job. Enjoy the powers, and marvel at the wickedness, of ANNs.
You see that “neural network training” is a multi-faceted thing and requires
you to consider all the issues that always jump at you in supervised machine
learning tasks. An ANN will not miraculously give good results just because it
has “neurons inside”. The actual “learning” part, namely solving the optimization
task (157), is only a subtask, albeit a conspicuous one because it is done with an
algorithm that has risen to fame, namely the backpropagation algorithm.
246
denoting the empirical risk of the model θ(n) , and with
emp 0
emp ∂R (n) ∂Remp (n)
∇R (Nθ(n) ) = (θ ), . . . , (θ ) (162)
∂w1 ∂wl
being the gradient of the empirical risk with respect to the parameter vector
θ = (w1 , . . . , wl )0 at point θ(n) , update θ(n) by
Stop when a stopping criterion chosen by you is met. This can be reaching a
maximum number of iterations, or the empirical risk decrease falling under
a predermined threshold, or some early stopping scheme.
and this is also how it is actually computed: the gradient ∇L(Nθ (ui ), yi ) is eval-
uated for each training data example (ui , yi ) and the obtained N gradients are
averaged.
This means that at every gradient descent iteration θ(n) → θ(n+1) , all training
data points have to be visited individually. In MLP parlance, such a sweep through
all data points is called an epoch. In the neural network literature one finds
statements like “the training was done for 120 epochs”, which means that 120
average gradients were computed, and for each of these computations, N gradients
for individual training example points (ui , yi ) were computed.
When training samples are large — as they should be — one epoch can clearly
be too expensive. Therefore one often takes resort to minibatch training, where
for each gradient descent iteration only a subset of the total sample S is used.
247
The backpropagation algorithm is a subroutine in the gradient descent game.
It is a particular algorithmic scheme for calculating the gradient ∇L(Nθ (ui ), yi )
for a single data point (ui , yi ). Highschool-math calculations of this quantity incur
a cost of O(l2 ) (where l is the number of network weights). When l is not extremely
small (it will almost never be extremely small — a few hundreds of weights will be
needed for simple tasks, and easily a million for deep networks applied to serious
real-life modeling problems), this cost O(l2 ) is too high for practical exploits (and it
has to be paid N times in a single gradient descent step!). The backprop algorithm
is a clever scheme for computing and storing certain auxiliary quantities which cuts
the cost down from O(l2 ) to O(l).
Here is how backprop works.
1. BP works in two stages. In the first stage, called the forward pass, the
current network Nθ is presented with the input u, and the output ŷ = Nθ (u)
is computed using the “forward” formulas (159) and (160). During this
forward pass, for each unit xm
i that is not a bias unit and not an input unit,
the quantity X
m m−1
am
i = wij xj (163)
j=0,...,Lm−1
2. Some bites of math in between. Applying the chain rule of calculus we have
Define
∂L(Nθ (u), y)
δim = . (165)
∂am i
∂L(Nθ (u), y)
m
= δim xm−1
j . (167)
∂wij
3. Computing the δ’s for output units. Output units xK i are typically set up
differently from hidden units, and their corresponding δ values must be com-
puted in ways that depend on the special architecture. For concreteness here
248
I stick with the simple linear units introduced in (160). The potentials aK
i
are thus identical to the output values ŷi and we obtain
∂L(Nθ (u), y)
δiK = . (168)
∂ ŷi
This quantity is thus just the partial derivative of the loss with respect to
the i-th output, which is usually simple to compute. For the quadratic loss
L(Nθ (u), y) = kNθ (u) − yk2 , for instance, we get
4. Computing the δ’s for hidden units. In order to compute δim for 1 ≤ m < K
we again make use of the chain rule. We find
which is justified by the fact that the only path by which am i can affect
L(Nθ (u), y) is through the potentials am+1
l of the next higher layer. If we
substitute (165) into (170) and observe (163) we get
X ∂am+1
δim = δlm+1 l
∂ami
l=1,...,Lm+1
m+1
σ(am
P
X ∂ j=0,...,Lm wlj j )
= δlm+1
∂ami
l=1,...,Lm+1
X ∂ wlim+1 σ(am
i )
= δlm+1 m
∂ai
l=1,...,Lm+1
X
= σ 0 (am
i ) δlm+1 wlim+1 . (171)
l=1,...,Lm+1
This formula describes how the δim in a hidden layer can be computed by
“back-propagating” the δlm+1 from the next higher layer. The formula can be
used to compute all δ’s, starting from the output layer (where (168) is used
— in the special case of a quadratic loss, Equation 169), and then working
backwards through the network in the backward pass of the algorithm.
When the logistic sigmoid σ(a) = 1/(1 + exp(−a)) is used, the computation
of the derivative σ 0 (am
i ) takes a particularly simple form. Observing that for
this sigmoid it holds that σ 0 (a) = σ(a) (1 − σ(a)) leads to
σ 0 (am m m
i ) = xi (1 − xi ).
249
Although simple in principle, and readily implemented, using the backprop
algorithm appropriately is something of an art, even in basic shallow MLP training.
I hint at some difficulties:
250
13 Open doors in ML — leading where?
This section is optional reading and will not be queried in exercises or exams.
Data inefficiency. High-end “cognitive” ML solutions (like GPT3 for text gen-
eration) need vast amounts of training data, orders of magnitude more than
a human ever sees in a lifetime.
251
Lack of explainability. When a trained neural network outputs a result on a
query input, it is unclear on what grounds this output was derived. NNs
are very black boxes. To achieve explainable AI is a burning issue, because
human-understandable explanations are often critically needed. Examples:
decisions made by self-driving cars (legal liabilities!), medical dignosis sup-
port systems (life vs. death; a blind recommendation of a pill is not good
enough), financial investement recommendations (investors want to have
good reasons to spend millions), crime prediction or creditworthyness as-
sessment tools (citizens should get explanations when they are pre-emptively
detained or denied a credit). Human experts can usually explain why they
come up with an answer to a question, or at least, what they believe are
their reasons.
Slow and brittle learning from rewards. Some of the most stunning DL suc-
cesses, in particular in game playing, arise from reinforcement learning (RL,
which we unfortunately did not cover in this course), that is, learning from
past rewards to maximize future rewards. However, RL learning is excruci-
atingly slow and generalizes poorly to informal real-world applications. For
example, the seemingly simple learning task of optimizing the switching cy-
cles of the traffic lights at a single street crossing marks the limits of what
is currently achievable with RL in real-world settings. I have learnt this
from an external PhD project that I am supervising. A 2-year effort of a
professional developer team was needed to train a RL traffic light controller
(Müller et al. 2021), which in the end was not fielded because of deficits in
sensor reliability and termination of funding. A little girl who has burnt her
finger on a candle just once, will not put her finger into flames thereafter
— not only not into candle flames, but also not into flames from matches,
torches, chimney fires.
252
(assuming drunken speaker recordings were not in the training corpus). In
the neural network world this problem has been realized since long under
the name of catastrophic forgetting. Only in the last few years, halfway satis-
factory solutions to the continual learning problem begin to appear (briefly
introduced in Section 11). However, these solutions function only in limited
scenarios where the new task variants are of the same sort as the originally
learnt one — same input data format, same output data format, for in-
stance. You and I continue to learn all throughout our lives, — for instance,
you have learnt about ML in the last weeks and I have learnt that there is
no blackboard in the Aletta Jacobs lecture halls. Furthermore, we are not
single-purpose but general-purpose performers. Not only can we recognize
faces, but also speak, sing, walk, make breakfast and funny grimaces, play
guitar (some of us), and all the rest. Try to come up with a complete list
for yourself: how long would that list be and how crazily diverse? It seems
we can do almost everything!
Why, oh why, is machine learning so far away from what we humans can learn?
What is the secret behind the magnificent efficiency, robustness, adaptability,
versatility and creativity of human learning? If I knew the answer, I would know
how to make gold from lead and brains from C++ code. All I can offer are some
personal speculations about what might be the big blind spot in our view on
“learning”, and even more speculatively, which scientific revolutions may be lying
ahead — maybe just around the next corner.
In this course I tried to draw the essential picture of ML as it is understood
today. All modern ML grows from one root idea, which is elegant, simple, con-
vincing and powerful, and which I tried to carry across in this course in ever so
many repetitions and variations. This idea is that
• learning means extracting useful information from data;
• real-world data are always stochastic;
• all “information” in stochastic data sources is contained in the joint dis-
tribution of the random variables that can be observed (the visible ones)
or inferred (the hidden ones) — this view of “information = properties of a
probability distribution” is also the core of information theory and statistical
physics;
• therefore, all machine learning models should be, and can be, understood as
modeling probability distributions;
• which directly leads to our modern perception of the main challenges of ML,
namely
– finding representation formats for complex distributions that can be
handled on computers — like decision trees, Bayesian networks or neu-
ral networks,
253
– finding ways to estimate such representations from data, such that the
estimated distributions come close to the true one — this leads to
the core strategies of maximum-likelihood modeling paired with cross-
validation, or Bayesian modeling, and generally connects modern ML
closely to classical statistics.
The powers of this line of thinking become apparent not only in the successes
of ML, but also have left a deep imprint on cognitive and neuroscience, where
many theories that try to explain human cognition and brains are expressed in
statistical frameworks. A small choice of important examples:
• the Bayesian brain approaches (also known as, or closely related to, the
paradigms of the predictive brain or the free energy model of adaptive cogni-
tion, see this Wikipedia article), which cast cognitive processing and neural
processes as adaptive mechanisms that help the brain’s owner to extract
important and reliable information from noisy sensor data, solving the gen-
eralization problem by applying genetically evolved or individually acquired
Bayesian priors;
• the neural sampling hypothesis which interprets the apparent stochasticity of
neural spikes as an MCMC sampling mechanism for computing approximate
inferences about probability distributions (Buesing et al. 2011);
• rigorous statistical models of neural mechanisms that enable reinforcement
learning in animals and humans (Xie and Seung 2004).
Beyond connecting ML to cognitive and neuroscience, the statistical view on learn-
ing also provides a basis for scientific exchange across other disciplinary bound-
aries, in particular statistical physics, signal processing & control, information and
communication theory, and optimization theory — and, of course, mathematical
statistics proper. Researchers in all of theses fields share the intuition and the
mathematical techniques to describe reality through probability distributions.
But — do the best known brains (still, that’s ours!) actually process informa-
tion on the basis of statistical inference and learning representations of distribu-
tions? Or — are our brains so good exactly because they don’t?
A first indication that this might indeed be the case is the somewhat dis-
turbing fact that humans are actually quite bad at statistical reasoning, as doc-
umented in the landmark paper of Tversky and Kahneman 1974 (47K Google
Scholar cites, as of January 2023). Famously, in elementary diagnostic reason-
ing (inferring causes Y from symptoms X), students of medicine typically fail at
properly factoring in the base rate P (Y ) when they (should) apply Bayes’ rule
P (Y |X) = P (X|Y ) P (Y )/P (X) to get the conditional probability P (Y |X) of
causes given symptoms, which is why they tend to over-estimate the probability
of a serious, but rare, illness. Good medical schools therefore have mandatory
courses on medical decision making. Tversky and Kahneman present numerous
other, well-documented scenarios that reveal humans as statistically challenged.
254
Another indication comes from a strand in cognitive science that explains hu-
man conceptual systems and reasoning as something that has grown from bodily
experience and physical action in physical environments. I am a great fan of this
view and would love to give you a sparkling, full account of this tradition, but
that would mean an entire separate course. I must make do with a few links to
standard references. If you are interested, check out
• the book of Port and Gelder 1995, “Mind as Motion: Explorations in the
Dynamics of Cognition”, which collects papers that model cognition as non-
linear dynamical processes (as opposed to symbolic reasoning);
255
• or at a grander scale of abstraction, the philosophical school of evoluationary
epistemology (Bradie 1986) whose central tenet is that our human conceptual
and reasoning apparatus has biologically evolved and can be continuously
traced back to the prerational intelligence of amoeba, see the 1500 page, 1-
year workshop proceedings “Prerational Intelligence: Adaptive Behavior and
Intelligent Systems Without Symbols and Logic” (Cruse, Dean, and Ritter
2013) (I was a member of that workshop for three months);
• or the PhD thesis book of Drescher 1991, “Made-up Minds: A Construc-
tivist Approach to Artificial Intelligence”, in which the author (who then
left academia, what a pity) develops and implements a computer simulation
version of Piaget’s developmental theory — this work passed under most
radars for two decades and is now being re-discovered by Google Deep Mind
researchers;
• or, finally (my favorite!), the work of cognitive scientists / cognitive linguists
Lakoff and R. E. Núñez 2000, “Where mathematics comes from: How the
embodied mind brings mathematics into being”, who analysed the most ab-
stract concepts that our brains have, namely the concepts of mathematics
itself (summary article: Lakoff and R. Núñez 2013), and trace their origins
back to our abilities and practices of purposeful motor action: walking from
here to there becomes the ultimate root of the mathematical “line” concept,
putting things on a heap is found to be the source of the “set” concept, etc.,
all worked out in much greater detail than I can highlight here.
While there are many differences between these lines of thinking and research,
they all grow forward from one common idea: that our conceptual and reason-
ing system has gradually grown from our bodily interaction with our physical
environment, both in the evolution of our species and again and further in the
development of every individual. As a consequence, if one wants to understand
intelligence, one should not start “top-down” from symbols, concepts, logic, and
adult rational reasoning, but instead reconstruct these precious items “bottom-
up” as an end product that arises from the basic physical-bodily conditions of our
existence. Seems a plausible enough story, no? And yet, classical AI and cognitive
science wanted to do it the other way round for decades, and still often cling to
the “logic first” imperative. But following up on that theme would lead us too far
astray here.
Returning to our main topic. The bottom-up approach to intelligence has no
difficulties explaining the weaknesses and errors that we humans make all the time
when trying to think logically, talk grammatically, or make statistical inferences.
The question however remains — can a proper understanding of human cognition
help us to revolutionize machine learning? Isn’t it so, to the contrary, that trying
to build human-like machine learning systems would just give us ML systems that
are just as weak as humans when they would try to think, talk or predict? Or,
is it so that these weaknesses are the other side of a coin, whose face side would
256
be the the positive powers of human cognition: all that astounding efficiency,
robustness, adaptability, versatility and creativity? We don’t know and I must
leave this question for you to think about. It’s a big one.
All I can do is to report on a recent development in machine learning, which
tries to flip that coin of learning on its other side. This recent work is based on
a 30 year long, specialized research line in Bayesian networks, which in turn turn
tried to come to terms with an eternal riddle of philosophy. I will tell this story
staring from that philosophical end.
That eternal riddle is the riddle to understand causality — intuitively, philo-
sophically, natural-science-ly, statistically.
Intuitively we recognize causality very well what we meet it. When I hit a nail
with a hammer, this hammer hit causes the nail to sink in a little further. It’s not
the other way round: you would call me crazy if I would claim that the nail’s little
sinking motion caused the hammer to hit it. Yet, both events are tightly — even
completely — correlated. They coincide. And causation seems always to connect
causes with effects directly, both happening simultaneously. The archetype of
cause-effect pairings is when a causing force is applied to effect a change in the
location, speed, or shape of an object. But the change happens exactly at the
same time while the force is applied — when there is no force, there is no change.
Yet, somehow, the force comes “first” and its effect is “dependent”, “secondary”.
But this is not a temporal “first” versus “second”.
A brief digression: when you check out the Wikipedia article on Granger
causality, you will find a statistical method, which is widely used in financial
forecasting, that claims (or tries) to pinpoint causality by exploiting temporal
precedence of causes to their effects. Clive Granger was awarded the Nobel prize
in economics. Though it is called Granger causality, critics and Granger himself
emphasize that this method does not capture the essence of causality, but only
a certain phenomenon of systematic temporal precedence. Nonetheless, Granger
causality analyses have been applied in a less (self)critical fashion, also in neuro-
science where these statistical methods have been used to establish that activity
in certain brain regions “causes” response activity in other regions, leading to a
directed functional connectivity graph of neural modules (run Google Scholar on
“functional connectivity granger causality”).
Returning again to the main thread, I want to confuse you further. There does
seem to be something temporal about causation. When you play a video of any
everyday life scene backwards, you will find that it looks completely weird. You
can very clearly tell whether the movie was played forward or backward: there is
an arrow of time which always flies forward. Yet, no physical law is violated in
the backward version: all fundamental laws of physics are time-reversible (well,
this needs some differentiation: while all models of classical physics are time re-
versible, in subatomic particle and quantum physics one needs to reverse parity
and charge together with time to obtain the time-mirrored equivalent laws). Be-
sides the fundamental laws of physics, also the second law of thermodynamics
257
has been solicited as explaining why real-world time seems to differ in the for-
ward versus the reverse direction. This law states directly that there is an arrow
of time. It roughly says, the entropy of a system can only grow, or in more in-
tuitive terms: the world progresses from order to disorder. However, this is a
statistical law, not a fundamental one, and can explain the arrow of time only
by assuming that the state of the universe at its big bang beginning was highly
ordered, such that there is always enough leeway to continue becoming more dis-
ordered. The problem of the arrow of time becomes a problem of cosmology.
Furthermore, this statistical view cannot explain why you think that it is the
hammer that drives the nail, and not the nail that pulls the hammer. The ar-
row of time problem is not satisfactorily solved by today’s physics. Check out
https://en.wikipedia.org/wiki/T-symmetry for more information on the physics
of time reversal, https://en.wikipedia.org/wiki/Arrow of time for an introductory
(somewhat shallow) discussion of this arrow not only in physics but also in other
domains of science and life, and Callender 2011 for in introduction to the philos-
ophy of time.
Cutting things short and maybe a bit unfairly, let me summarize the above
as follows: while intuitively we see a close connection between causation and the
arrow of time, physicists know that they don’t know what exactly this link is, and
philosophers know that they do not even know what they don’t know. Causation
is as ungraspable as time, space, and randomness. It is an eternal riddle.
After having set the stage in this big way, I come to this 30-year long research
line in Bayesian networks that I announced above. This part of the story is a
tribute to Judea Pearl. You know him as the pioneer of Bayesian networks. As
you have learnt in this course, Bayesian networks (BNs) are our best (in many
ways, only) tool when it comes to reason about joint probability distributions
of many interacting random variables. Specifically, BNs allow us to calculate
conditional probabilities P (Y |X) of causes given symptoms, where symptoms X
are observed and causes Y are inferred. Wait... “causes”?? This is the word that
is customarily used in the BN literature. Furthermore, BNs are also known as
“causal” networks, and the arrows in a BN are said to reflect “causal” influences:
a modeler drawing the graph structure of a BN should draw an arrow Y → X
when he/she thinks there is a “causal” influence from Y to X. This is allright
because and inasmuch a human modeler has a clear intuitive grasp on causation.
But above I painted a picture that shows that our intuitive understanding does not
amount to scientific or philosophical understanding. Thus, it seems that calling
BNs and the arrows within “causal” is neither good science, nor good philosophy
— just a selling point and a helper to let modelers think of heuristically okay’ish
graph structures.
The first person to become frustrated by this conceptual blank in BNs was
Pearl himself. Well, I guess so; I don’t know him in person. But it is a fact that
after establishing the foundations of BNs as you saw them in our course, ever after
1991 he directed his efforts to detecting true causality in statistical models (early
258
publications: Pearl and Verma 1992; Pearl 1993). Apparently his preoccupation
with causality started much earlier. Citing Pearl (after Russell 2011):
“I got my first hint of the dark world of causality during my junior year of
high school. My science teacher, Dr. Feuchtwanger, introduced us to the study of
logic by discussing the 19th century finding that more people died from smallpox
inoculations than from smallpox itself. Some people used this information to argue
that inoculation was harmful when, in fact, the data proved the opposite, that
inoculation was saving lives by eradicating smallpox.
And here is where logic comes in,” concluded Dr. Feuchtwanger, “To protect
us from cause-effect fallacies of this sort.” We were all enchanted by the marvels of
logic, even though Dr. Feuchtwanger never actually showed us how logic protects
us from such fallacies.
It doesn’t, I realized years later as an artificial intelligence researcher. Neither
logic, nor any branch of mathematics had developed adequate tools for managing
problems, such as the smallpox inoculations, involving cause-effect relationships.”
Baysian networks, as you know them, do not allow one to pin down causality.
Citing Russell 2011: “A Bayesian network such as Smoking → Cancer fails to
capture causal information; indeed, it is mathematically equivalent to the network
Cancer → Smoking.” No model of a joint distribution has information about
true causal dependencies between the modeled random variables. All that can be
inferred from any such model are statistical dependencies of various sorts (corre-
lations, conditional probabilities), but as you surely have heard, Correlation Does
Not Imply Causation, and neither does the stronger brother of correlation, sta-
tistical dependence. In this situation, Pearl extended the mathematical apparatus
of statistics and added a new operator to statistics, today called the de-operator.
Here I can only give you an intuition of this operator; the mathematical theory
behind it is not super easy and describing it would turn your reading experience
into real work (if you want to dig deeper: the tutorial by Schölkopf, Locatello, et
al. 2021 gives a brilliantly written introduction). So, — what does the do-operator
do?
One way to understand it is to take a look at clinical studies that aim at
establishing the true causal efficacy of a new drug. To achieve this goal, a double-
blind study must be carried out. The new drug, or a placebo, are administered
to two balanced populations of patients. If the drug-administered population
shows a statistically significant improved health status compared to the placebo
group, a (positive) causal effect of the drug is established. While this scheme is
sound, it suffers from statistical variance, and from imperfections in balancing the
treatment versus placebo groups. The statistical efficiency of this scheme would be
very much enhanced if one could give the drug to the very same pool of patients as
the placebo. For this, however, one would need two worlds, which at the beginning
of the study would be identical, and where in world A one would give the patient
259
pool the drug, and in world B the placebo, and then compare the outcomes. But
we have only one world.
Here comes the do-operator. It creates these two worlds. Staying in the medical
domain, consider the classical probability P (X = x | Y = y), which describes the
probability that a health status x is observed after administering a treatment y.
Estimating the conditional distribution PX|Y from data — the standard basic task
of statistics and machine learning — is based on datasets (xi , yi )i=1,...,N , where
the cases i correspond to individual patients. Unfortunately, each patient i is
either given the drug, or the placebo. This bars the way to retrieving causality
from these data, because causal influences are confounded with the assignments
of patients i to the treatment versus placebo populations. We would however
get causality information if we could observe each patient i in two conditions, one
observation giving the health outcome xi when the drug yi was given, and the other
observation x0i when the placebo yi0 was given. The do-operator gives us these two
sorts of probabilities. The formal expression P (X = x | do(Y = y)) represents the
probability of getting health outcome x if all patients received treatment y, and
P (X = x | do(Y = y 0 )) is the probability of getting outcome x under administering
the placebo y 0 to all patients. In the parlance of causal statistics, the two “do-
settings” do(Y = y) and do(Y = y 0 ) are called interventions. If we would know
these two kinds of interventional probabilities, we could assert a causal effect of
the treatment (if there is any).
Obviously, these interventional statistics are not contained in the data. Each
patient i is observed only under one of the two conditions: the dataset contains
either (xi , yi ), or (x0i , yi0 ), but not both. To get “data” for estimating P (X =
x | do(Y = y)) and P (X = x | do(Y = y 0 )), for each patient i we would need
counterfactual information that complements the actually observed half of the
picture, for instance: “if patient i would have gotten placebo y 0 , we would have
observed outcome x0 . The new statistical theory, which was worked out and step-
wise refined by Pearl and others in his followship, has identified a number of
additional assumptions — often satisfiable in reasonably good approximation —
and computational techniques, which together allow us to derive the interventional
distributions PX|do(Y ) from a classical Bayesian network which was estimated from
“one-world-only” data (xi , yi )i=1,...,N . These techniques involve the construction
of modified versions of the original BN, in which certain links are deleted, and
involved estimations of auxiliary conditional probabilities.
You might ask, why not simply collect data like it is done in double-blind clin-
ical studies, and then have direct access to causality through these data? This is
not an option in general, for several reasons. First, in clinical studies the inde-
pendent variable (drug versus placebo) is binary, which mandates two balanced
patient populations. If one wants to study the causal impact of an independent
variable with n different values, one would need n separate patient populations.
And if one wants to investigate the causal relationships between k variables, each
with n values, one would need nk separate data-donating populations — a dead
260
end. Furthermore, BNs are typically estimated from data that do not come from
controlled experiments, but from opportunistically collected ad-hoc observations,
motto: “use any data you can get” (as was the case for the Space Shuttle example
in Section 10, and is the case for all the magnificient DL super-machines). The
do-operator methods of causal analysis enable the assessment of the direction and
strength of causal influence between all modeled random variables in all of these
scenarios! (BIG exclamation mark!)
So far, causal analysis techniques have become widely adopted mostly in medicine
and in particular, in epidemology (as far as I can tell, I am not an expert and this
is second-hand but reliable information). A recent Master’s project that I was
lucky enough to accompany from a distance (it was carried out at the Harvard
Medical School by RUG student Richard Fischer) provides a beautiful case study
and analysis of causal analysis, and if you are interested in these matters, is a rec-
ommendable reading with transparent methods explanations (Fischer 2023). This
thesis also illustrates the remaining problems of these techniques, among them nu-
merical instabilities and the need for numerous clever algorithm design decisions
from the side of the data analyst. It’s not (yet) a download-and-be-happy tool for
non-expert users.
Before we proceed to the hopes for revolutionizing ML and DL at large, let us
step back and consider the big question of causality once again. Causal analysis
in the spirit of Pearl does not give a philosophical answer to the eternal question
of the essence of causality. It is a computational technique that lets us identify
a specific kind of causality in data. What is the nature of this specific kind
of causality? The key lies in the little word “do”. The do-operator formalizes
the decision to assign a drug or placebo to a patient, or in more general terms,
to set the value of an independent variable by a wilful act of an experimenter.
This is analog to how experiments in the natural sciences should be carried out:
the values of independent variables are systematically varied while otherwise the
experimental set-up is not changed. In our statistical terminology, these controlled
laboratory conditions amount to repeated observations of the same “individual”
i (the unchanged experiment set-up) under application of different “treatments”.
The workings of the do-operator are also analog to how human children and adults
learn about causal effects of their actions: in our daily experimentation with the
world around us we infer what our acts causally effect by comparing the outcomes
when we do something versus when we don’t do it. We observe that the nail does
not sink in while we do not use the hammer, and that it sinks in when we do. The
philosophical side of this: we learn about a kind of causality that is connected
to our wilful actions, “wilful” meaning that we decide when and what to do or
not to do. This view on “causality” is philosophically worked out in Kutach 2011
(warning: not an easy read, it’s a philosopher’s text!).
And now, finally, we return to ML, and DL in particular. We started our
thought odyssee by asking the double question, what is it in purely statistical dis-
tribution modeling that makes it so weak with regards to efficiency, generalization,
261
robustness etc., and what is it in human cognition and learning that makes us so
strong with regards to the same (at the price of being weak statisticians)? One
possible answer is that our human conceptual system is not organized around prob-
abilities, correlations and statistical dependencies, but around chains of causality,
and furthermore, that the sort of causality that we use roots in our experience of
the consequences of wilful action.
This leads to a revolutionary research programme for putting ML on a new
theoretical foundation. Instead of looking at data in search of statistical struc-
ture, look at them in search for causal structures. The hope and promise of this
view is that, if a ML model is structured by representational modules that inter-
act along causal influence chains, not through statistical dependency connections,
information from new data not contained in the original training corpus can be
incorporated much faster and more robustly. For instance, if a child has learnt
that when she hits a small solid object with a heavier one, the impact will cause
the smaller object to be driven away from the impact direction, and this child then
gets a hammer into her hands for the first time in her life, she will predict that
the nail will go down. A purely statistical learner, in turn, will be quite helpless
because hammers were not present in the training data, and a good amount of new
hammering data would be needed to establish a sufficiently significant correlation
between the hammer’s and the nail’s behavior.
This is an intriguing perspective. Pearl’s original do-operator formalism is tied
to the mathematical BN framework, and furthermore, causal analysis starts from a
trained BN and the do-operator does not instruct us how we can improve training
the BN in the first place (as far as I can judge, I must repeat that I am no BN
expert). Thus, Pearl’s ingenious studies do not directly show us how we could
put ML at large on a new foundation. Enter Bernhard Schölkopf. Renown for his
contributions on statistical learning theory, kernel methods in general and support
vector machines in particular, since more than a decade Schölkopf has taken up
the challenge of making causality principles fertile for ML in general. He has
teamed up with deep neural network researchers. Two recent, extensive, visionary
tutorials written by him together with other ML experts (among them, DL pioneer
Joshua Bengio) will, as far as I can forecast, become a milestone in spreading the
idea (Schölkopf, Locatello, et al. 2021; Schölkopf and Kügelgen 2022). Deep neural
network models that are organized around causality are beginning to sprout (for
example, Tangemann et al. 2021). Do not expect an immediate total overthrow
of machine learning, however. Causal analysis is subtle and difficult enough even
in the restricted BN setting, and transfer to other branches of ML is far from
straightforward. I conclude by citing the concluding remarks of Schölkopf and
Kügelgen 2022: “Most of the discussed fields are still in their infancy [...]. With
the current hype around machine learning, there is much to say in favor of some
humility towards what machine learning can do, and thus towards the current state
of AI — the hard problems have not been solved yet, making basic research in this
field all the more exciting.”
262
Appendix
A Elementary mathematical structure-forming
operations
A.1 Pairs, tuples and indexed families
If two mathematical objects O1 , O2 are given, they can be grouped together in a
single new mathematical structure called the ordered pair (or just pair ) of O1 , O2 .
It is written as
(O1 , O2 ).
In many cases, O1 , O2 will be of the same kind, for instance both are integers. But
the two objects need not be of the same kind. For instance, it is perfectly possible
to group integer O1 = 3 together with a random variable (a function!) O2 = X7
in a pair, getting (3, X7 ).
The crucial property of a pair (O1 , O2 ) which distinguishes it from the set
{O1 , O2 } is that the two members of a pair are ordered, that is, it makes sense to
speak of the “first” and the “second” member of a pair. In contrast, it makes not
sense to speak of the “first” or “second” element of the set {O1 , O2 }. Related to
this is the fact that the two members of a pair can be the same, for instance (2, 2)
is a valid pair. In contrast, {2, 2} makes no sense.
A generalization of pairs is N -tuples. For an integer N > 0, an N -tuple of N
objects O1 , O2 , . . . , ON is written as
(O1 , O2 , . . . , ON ).
1-tuples are just individual objects; 2-tuples are pairs, and for N > 2, N -tuples
are also called lists (by computer scientists that is; mathematicians rather don’t
use that term). Again, the crucial property of N -tuples is that one can identify its
i-th member by its position in the tuple, or in more technical terminology, by its
index. That is, in an N -tuple, every index 1 ≤ i ≤ N “picks” one member from
the tuple.
The infinite generalization of N -tuples is provided by indexed families. For any
nonempty set I, called an index set in this context,
(Oi )i∈I
denotes a compound object assembled from as many mathematical objects as
there are index elements i ∈ I, and within this compound object, every individual
member Oi can be “addressed” by its index i. One simply writes
Oi
to denote the ith “component” of (Oi )i∈I . Writing Oi is a shorthand for applying
the ith projection function on (Oi )i∈I , that is, Oi = πi ((Oi )i∈I ).
263
A.2 Products of sets
We first treat the case of products of a finite number of sets. Let S1 , . . . , SN be
(any) sets. Then the product S1 × . . . × SN is the set of all N -tuples of elements
from the corresponding sets, that is,
S1 × . . . × SN = {(s1 , . . . , sN ) | si ∈ Si }.
This generalizes to infinite products as follows. Let I be any set — we call it
an index set in this context. For every i ∈ I, let Si be some set. Then the product
set indexed by I is the set of functions
Y [
Si = {ϕ : I → Si | ∀i ∈ I : ϕ(i) ∈ Si }.
i∈I i∈I
is the set of all right-infinite real-valued timeseries (with discrete time points start-
ing at time n = 0).
264
B Joint, conditional and marginal probabilities
Note. This little section is only a quick memory refresher of some of the most
basic concepts of probability. It does not replace a textbook chapter!
We first consider the case of two observations of some part of reality that
have discrete values. For instance, an online shop creating customer profiles may
record from their customers their age and gender (among many other items). The
marketing optimizers of that shop are not interested in the exact age but only in
age brackets, say a1 = at most 10 years old, a2 = 11 − 20 years, a3 = 21 − 30
years, a4 = older than 30. Gender is roughly categorized into the possibilities
g1 = f, g2 = m, g3 = o. From their customer data the marketing guys estimate the
following probability table:
P (X = gi , Y = aj ) a1 a2 a3 a4
g1 0.005 0.3 0.2 0.04
(172)
g2 0.005 0.15 0.15 0.04
g3 0.0 0.05 0.05 0.01
The cell (i, j) in this 3 × 4 table contains the probability that a customer with
gender gi falls into the age bracket aj . This is the joint probability of the two
observation values gi and aj . Notice that all the numbers in the table sum to 1.
The mathematical tool to formally describe a category of an observable value is
a random variable (RV). We typically use symbols X, Y, Z, . . . for RVs in abstract
mathematical formulas. When we deal with concrete applications, we may also
use “telling names” for RVs. For instance, in Table (172), instead of P (X =
gi , Y = aj ) we could have written P (Gender = gi , Age = aj ). Here we have two
such observation categories: gender and age bracket, and hence we use two RVs
X and Y for gender and age, respectively. In order to specify, for example, that
female customers in the age bracket 11-20 occur with a probability of 0.3 in the
shop’s customer reservoir (the second entry in the top line of the table), we write
P (X = g1 , Y = a2 ) = 0.3.
Some more info bits of concepts and terminology connected with RVs. You
should consider a RV as the mathematical counterpart of a procedure or apparatus
to make observations or measurements. For instance, the real-world counterpart of
the Gender RV could be an electronic questionnaire posted by the online shop, or
more precisely, the “what is your age?” box on that questionnaire, plus the whole
internet infrastructure needed to send the information entered by the customer
back to the company’s webserver. Or in a very different example (measuring
the speed of a car and showing it to the driver on the speedometer) the real-
world counterpart of a RV Speed would be the total on-board circuitry in a car,
comprising the wheel rotation sensor, the processing DSP microchip, and the
display at the dashboard.
A RV always comes with a set of possible outcomes. This set is called the
sample space of the RV, and I usually denote it with the symbol S. Mathematically,
265
a sample space is a set. The sample space for the Gender RV would be the set
S = {m, f, o}. The sample space for Age that we used in the table above was S =
{{0, 1, . . . , 10}, {11, . . . , 20}, {21, . . . , 30}, {31, 32, . . .}}. For car speed measuring
we might opt for S = R≥0 , the set of non-negative reals. A sample space can be
larger than the set of measurement values that are realistically possible, but it
must contain at least all the possible values.
Back to our table and the information it contains. If we are interested only in
the age distribution of customers, ignoring the gender aspects, we sum the entries
in each age column and get the marginal probabilities of the RV Y . Formally, we
compute
X
P (Y = aj ) = P (X = gi , Y = aj ).
i=1,2,3
a1 a2 a3 a4
g1 0.005 0.3 0.2 0.04 0.545
g2 0.005 0.15 0.15 0.04 0.345 (173)
g3 0.0 0.05 0.05 0.01 0.110
0.01 0.5 0.4 0.09
Notice that the marginal probabilities of age 0.01, 0.5, 0.4, 0.09 sum to 1, as do
the gender marginal probabilities.
Finally, the conditional probability P (X = gi | Y = aj ) that a customer has
gender gi given that the age bracket is aj is computed through dividing the joint
probabilities in column j by the sum of all values in this column:
P (X = gi , Y = aj )
P (X = gi | Y = aj ) = . (174)
P (Y = aj )
There are two equivalent versions of this formula:
P (X = gi , Y = aj ) = P (X = gi | Y = aj )P (Y = aj ) (175)
where the righthand side is called a factorization of the joint distribution on
the lefthand side, and
P (X = gi , Y = aj )
P (Y = aj ) = , (176)
P (X = gi | Y = aj )
demonstrating that each of the three quantities (joint, conditional, marginal prob-
ability) can be expressed by the respective two others. If you memorize one of
these formulas – I recommend the second one – you have memorized the very key
266
to master “probability arithmetics” and will never get lost when manipulating
probability formulas.
The factorization (175) can be done in two ways: P (Y = aj | X = gi )P (X =
gi ) = P (X = gi | Y = aj )P (Y = aj ), which gives rise to Bayes’ formula
P (X = gi | Y = aj )P (Y = aj )
P (Y = aj | X = gi ) = , (177)
P (X = gi )
which has many uses in statistical modeling because it shows how one can revert
the conditioning direction.
Joint, conditional, and marginal probabilities are also defined when there are
more than two categories of observations. For instance, the online shop marketing
people also record how much a customer spends on average, and formalize this by
a third random variable, say Z. The values that Z can take are spending brackets,
say s1 = less than 5 Euros to s20 = more than 5000 Euros. The joint probability
values P (X = gi , Y = aj , Z = sk ) would be arranged in a 3-dimensional array
sized 3 × 4 × 20, and again all values in this array together sum to 1. Now there
are different arrangements for conditional and marginal probabilities, for instance
P (Z = sk | X = gi , Y = aj ) is the probability that among the group of customers
with gender gi and age aj , a person spends an amount in the range sk . Or P (Z =
sk , Y = aj | X = gi ) is the probability that in the gender group gi a person is aged
aj and spends sk . As a last example, the probabilities P (X = gi , Z = sj ) are the
marginal probabilities obtained by summing away the Y variable:
X
P (X = gi , Z = sj ) = P (X = gi , Y = ak , Z = sj ) (178)
k=1,2,3,4
So far I have described cases where all kinds of observations were discrete, that
is, they (i.e. all RVs) yield values from a finite set – for instance the three gender
values or the four age brackets. Equally often one faces continuous random values
which arise from observations that yield real numbers – for instance, measuring
the body height or the weight of a person. Since each such RV can give infinitely
many different observation outcomes, their probabilities cannot be represented in
a table or array. Instead, one uses probability density functions (pdf’s) to write
down and compute probability values.
Let’s start with a single RV, say H = Body Height. Since body heights are
non-negative and, say, never larger than 3 m, the distribution of body heights
within some reference population can be represented by a pdf f : [0, 3] → R≥0
which maps the interval [0, 3] of possible values to the nonnegative reals (Figure
86). We will be using subscripts to make it clear which RV a pdf refers to, so the
pdf describing the distribution of body height will be written fH .
A pdf for the distribution of a continuous RV X can be used to calculate the
probability that this RV takes values within a particular interval, by integrating
the pdf over that interval. For instance, the probability that a measurement of
267
2
1.5
pdf
1
0.5
0
0 0.5 1 1.5 2 2.5 3
H (body height)
body height comes out between 1.5 and 2.0 meters is obtained by
Z 2.0
P (H ∈ [1.5, 2.0]) = fH (x)dx, (179)
1.5
• Be aware that the values f (x) of a pdf are not probabilities! Pdf’s turn into
probabilities only through integration over intervals.
• Values f (x) can be greater than 1 (as in Figure 86), again indicating that
they cannot be taken as probabilities.
268
0.8
0.6
0.4
pdf
0.2
0
3
1
2 3
Y 0 1
0
X
where also the cases ai = −∞ and bi = ∞ are possible. A more compact notation
for the same integral is Z
f (x) dx,
D
where D denotes the k-dimensional box [a1 , b1 ]×. . .×[ak , bk ] and x denotes vectors
in Rk . Mathematicians speak of k-dimensional intervals instead of “boxes”. The
set of points S = {x ∈ Rk | fX1 ,...,Xk > 0} is called the support of the distribution.
Obviously S ⊆ D.
In analogy to the 1-dim case from Figure 86, probabilities are obtained from
a k-dimensional pdf fX1 ,...,Xk by integrating over sub-intervals. For such a k-
dimensional subinterval [r1 , s1 ] × . . . × [rk , sk ] ⊆ [a1 , b1 ] × . . . × [ak , bk ], we get its
probability by
Z s1 Z sk
P (X1 ∈ [r1 , s1 ], . . . , Xk ∈ [rk , sk ]) = ... f (x1 , . . . , xk ) dxk . . . dx1 . (180)
r1 rk
In essentially the same way as we did for discrete distributions, the pdf’s of
marginal distributions are obtained by integrating away the RV’s that one wishes
269
to expel. In analogy to (178), for instance, one would get
Z b2
fX1 ,X3 (x1 , x3 ) = fX1 ,X2 ,X3 (x1 , x2 , x3 ) dx2 . (181)
a2
fX,Y (x, c)
fX | Y =c (x) = . (183)
fY (c)
270
In this appendix (and in the lecture) I consider only two ways of representing
probability distributions: discrete ones by finite probability tables or probability
tables; continuous ones by pdfs. These are the most elementary formats of repre-
senting probability distributions. There are many others which ML experts readily
command on. This large and varied universe of concrete representations of prob-
ability distributions is tied together by an abstract mathematical theory of the
probability distributions themselves, independent of particular representations.
This theory is called probability theory. It is not an easy theory and we don’t at-
tempt an introduction to it. If you are mathematically minded, then you can get an
introduction to probability theory in my graduate lecture notes “Principles of Sta-
tistical Modeling” (https://www.ai.rug.nl/minds/uploads/LN_PSM.pdf). At
this point I only highlight two core facts from probability theory:
argmax ϕ(a)
a
is that d ∈ D for which ϕ(d) is maximal among all values of ϕ on D. If there are
several arguments a for which ϕ gives the same maximal value, – that is, ϕ does
not have a unique maximum –, or if ϕ has no maximum at all, then the argmax
is undefined.
271
by three random variables G, A, S. A random variable always comes together with
a sample space. This is the set of values that might be delivered by the random
variable. For instance, the sample space of the gender RV G could be cast as
{m, f, o} – a symbolic (and finite) set. A reasonable sample space for the age
random variable A would be the set of integers between 0 and 200 – assuming
that no customer will be older than 200 years and that age is measured in integers
(years). Finally, a reasonable sample space for the spending RV S could be just
the real numbers R.
Note that in the A and S examples, the sample spaces that I proposed look
very generous. We would not really expect that some customer is 200 years old,
nor would we think that ever a customer spends 101000 Euros – although both
values are included in the respective sample space. The important thing about a
sample space is that it must contain all the values that might be returned by the
RV; but it may also contain values that will never be observed in practice.
Every mathematical set can serve as a sample space. We just saw symbolic,
integer, and real sample spaces. Real sample spaces are used whenever one is
dealing with an observation procedure that returns numerical values. Real-valued
RVs are of great practical importance, and they allow many insightful statistical
analyses that are not defined for non-numerical RVs. The most important analyt-
ical characteristics of real RVs are expectation, variance, and covariance, which I
will now present in turn.
For the remainder of this appendix section we will be considering random
variables X whose sample space is Rn — that is, observation procedures which
return scalars (case n = 1) or vectors. We will furthermore assume that the
distributions of all RVs X under consideration will be represented by pdf’s fX :
Rn → R≥0 . (In mathematical probability theory, more general numerical sample
spaces are considered, as well as distributions that have no pdf — but we will
focus on this basic scenario of real-valued RVs with pdfs).
The expectation of a RV X with sample space Rn and pdf fX is defined as
Z
E[X] = x fX (x) dx, (184)
Rn
272
PX,Y of two RVs X, Y , we may compute the average value of the xi by
N
X
mean({x1 , . . . , xN }) = 1/N xi ,
i=1
but this sample mean is NOT the expectation of X. If we would have used another
random sample, we would most likely have obtained another sample mean. In
contrast, the expectation E[X] of X is defined not on the basis of a finite, random
sample of X, but it is defined by averaging over the true underlying distribution.
Since in practice we will not have access to the true pdf fX , the expectation
of a RV X cannot usually be determined in full precision. The best one can do is
to estimate it from observed sample data. The sample mean is an estimator for
the expectation of a numerical RV X. Marking estimated quantities by a “hat”
accent, we may write
XN
Ê[X] = 1/N xi .
i=1
A random variable X is centered if its expectation is zero. By subtracting the
expectation one gets a centered RV. In these lecture notes I use the bar notation
to mark centered RVs:
X̄ := X − E[X].
The variance of a scalar RV with sample space R is the expected squared
deviation from the expectation
σ 2 (X) = E[X̄ 2 ], (185)
which in terms of the pdf fX̄ of X̄ can be written as
Z
2
σ (X) = x2 fX̄ (x) dx.
R
but in fact this estimator is not the best possible – on average (across different
samples) it underestimates the true variance. If one wishes to have an estimator
that is unbiased, that is, which on average across different samples gives the correct
variance, one must use
N N
!2
X X
σ̂ 2 ({x1 , . . . , xN }) = 1/(N − 1) xi − 1/N xj
i=1 j=1
273
instead. The Wikipedia article on “Variance”, section “Population variance and
sample variance” points out a number of other pitfalls and corrections that one
should consider when one estimates variance from p samples.
The square root of the variance of X, σ(X) = σ 2 (X), is called the standard
deviation of X.
The covariance between two real-valued scalar random variables X, Y is defined
as
Cov(X, Y ) = E[X̄ Ȳ ], (186)
which in terms of a pdf fX̄ Ȳ for the joint distribution for the centered RVs spells
out to Z
Cov(X, Y ) = x y fX̄ Ȳ ((x, y)0 ) dx dy.
R×R
Finally, let us inspect the correlation of two scalar RVs X, Y . Here we have to
be careful because this term is used differently in different fields. In statistics, the
correlation is defined as
Cov(X, Y )
Corr(X, Y ) = . (187)
σ(X) σ(Y )
274
1. Expectation is a linear operator:
2. Expectation is idempotent:
E[E[X]] = E[X].
3.
Cov(X, Y ) = E[X Y ] − E[X] E[Y ].
E Derivation of Equation 32
X
1/N kxi − d ◦ f (xi )k2 =
i
X m
X
= 1/N kx̄i − (x̄0i uk ) uk k2
i k=1
X n
X m
X
= 1/N k (x̄0i uk ) uk − (x̄0i uk ) uk k2
i k=1 k=1
X Xn
= 1/N k (x̄0i uk ) uk k2
i k=m+1
X X n n
X X
= 1/N (x̄0i uk )2 = 1/N (x̄0i uk )2
i k=m+1 k=m+1 i
n
X
= σk2 .
k=m+1
275
References
[1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. “A Learning Algorithm
for Boltzmann Machines”. In: Cognitive Science 9 (1985), pp. 147–169.
[2] A. C. Antoulas and D. C. Sorensen. “Approximation of large-scale dynami-
cal systems: an overview”. In: Int. J. Appl. Math. Comput. Sci. 11.5 (2001),
pp. 1093–1121.
[3] D. Bahdanau, K. Cho, and Y. Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: International Conference on
Learning Representations (ICLR). 2015. url: http://arxiv.org/abs/
1409.0473v6.
[4] J. A. Bednar and S. P. Wilson. “Cortical maps.” In: The Neuroscientist
22.6 (2016), pp. 604–617.
[5] Y. Bengio and Y. LeCun. “Scaling Learning Algorithms towards AI”. In:
Large-Scale Kernel Machines. Ed. by Bottou L. et al. MIT Press, 2007.
[6] M. Bradie. “Assessing evolutionary epistemology”. In: Biology and Philos-
ophy 1.4 (1986), pp. 401–459.
[7] L. Breiman. “Random forests”. In: Machine Learning 45 (2001), pp. 5–32.
[8] L. Buesing et al. “Neural Dynamics as Sampling: A Model for Stochastic
Computation in Recurrent Networks of Spiking Neurons”. In: PLoS Comp.
Biol. 7.11 (2011), e1002211.
[9] C. Callender. “Introduction”. In: The Oxford Handbook of Philosophy of
Time. Ed. by C. Callender. Oxford University Press, 2011.
[10] A. Clark. “Whatever Next? Predictive Brains, Situated Agents, and the Fu-
ture of Cognitive Science.” In: Behavioral and Brain Sciences 36.3 (2013),
pp. 1–86.
[11] G. E. Crooks. Field Guide to Continuous Probability Distributions, v 0.11
beta. online manuscript, retrieved April 2017, extended version also avail-
able in print since 2019. 2017. url: http://threeplusone.com/fieldguide.
[12] H. Cruse, J. Dean, and H. Ritter, eds. Prerational Intelligence: Adaptive
Behavior and Intelligent Systems Without Symbols and Logic (3 volumes).
Springer Science & Business Media, 2013.
[13] A. Deisenroth, A. Faisal, and C. S. Ong. Mathematics for Machine Learn-
ing. Free online copy at https : / / mml - book . github . io/. Cambridge
University Press, 2019.
[14] A.P. Dempster, N.M. Laird, and D.B. Rubin. “Maximum likelihood from
incomplete data via the EM-algorithm”. In: Journal of the Royal Statistical
Society 39 (1977), pp. 1–38.
276
[15] S. Demyanov. “Regularization Methods for Neural Networks and Related
Models”. PhD thesis. Dept of Computing and Information Systems, Univ.
of Melbourne, 2015.
[16] J. Devlin et al. BERT: Pre-training of deep bidirectional transformers for
language understanding. Tech. rep. arXiv manuscript https://arxiv.org/
pdf/1810.04805.pdf. 2018.
[17] G. L. Drescher. Made-up Minds: A Constructivist Approach to Artificial
Intelligence. MIT Press, Cambridge, Mass., 1991.
[18] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (second
edition). Wiley Interscience, 2001.
[19] R. Durbin et al. Biological Sequence Analysis: Probabilistic Models of Pro-
teins and Nucleic Acids. Cambridge University Press, 2000.
[20] D. Durstewitz, J. K. Seamans, and T. J. Sejnowski. “Neurocomputational
models of working memory”. In: Nature Neuroscience 3 (2000), pp. 1184–
91.
[21] S. Edelman. “The minority report: some common assumptions to reconsider
in the modelling of the brain and behaviour”. In: J of Experimental and
Theoretical Artificial Intelligence (2015). url: http://www.tandfonline.
com/action/showCitFormats?doi=10.1080/0952813X.2015.1042534.
[22] S. Ermon. Probabilistic Graphical Models. Online lecture notes of a graduate
course at Stanford University. 2019. url: https://ermongroup.github.
io/cs228-notes/.
[23] B. Farhang-Boroujeny. Adaptive Filters: Theory and Applications. Wiley,
1998.
[24] R. Fischer. “Development and Evaluation of Causal Models With an Ap-
plication to Orthopaedic Inquiries”. https : / / www . ai . rug . nl / minds /
uploads/3924_Fischer23.pdf. Master’s thesis. Harvard Medical School
and University of Groningen, Jan. 2023.
[25] K. Friston. “A theory of cortical response”. In: Phil. Trans. R. Soc. B 360
(2005), pp. 815–836.
[26] K. Friston. “Learning and Inference in the Brain”. In: Neural Networks 16
(2003), pp. 1325–1352.
[27] S. Fusi and X.-J. Wang. “Short-term, long-term, and working memory”.
In: From Neuron to Cognition via Computational Neuroscience. Ed. by M.
Arbib and J. Bonaiuto. MIT Press, 2016, pp. 319–344.
[28] J. J. Gibson. “The Theory of Affordances”. In: Perceiving, Acting, and
Knowing. Ed. by R. Shaw and J. Bransford. Lawrence Erlbaum Ass., Hills-
dale, New Jersey, 1979, pp. 127–143.
277
[29] I. J. Goodfellow, J. Shlens, and C. Szegedy. “Explaining and Harnessing
Adversarial Examples”. In: Proc. ICLR 2015. arXiv:1412.6572v3. 2014.
[30] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Open access
version at http://www.deeplearningbook.org. MIT Press, 2016.
[31] A. N. Gorban et al. Principal Manifolds for Data Visualization and Dimen-
sion Reduction. Springer, 2008.
[32] A. Graves et al. “Hybrid computing using a neural network with dynamic
external memory”. In: Nature 7626 (2016), pp. 471–476.
[33] A. Hart, J. Hook, and J. Dawes. “Embedding and approximation theorems
for echo state networks”. In: Neural Networks 128 (2020), pp. 234–247.
[34] M. Hasegawa, H. Kishino, and T. Yano. “Dating the human-ape splitting
by a molecular clock of mitochondrial DNA”. In: J. of Molecular Evolution
22 (1985), pp. 160–174.
[35] X. He. “Continual lifelong learning in neural systems”. PhD thesis. School
of Engineering and Science, University of Groningen, 2022.
[36] G. E. Hinton and R. R. Salakuthdinov. “Reducing the Dimensionality of
Data with Neural Networks”. In: Science 313.July 28 (2006), pp. 504–507.
[37] E. Horvitz and M. Barry. “Display of information for time-critical decision
making”. In: Proc. 11th Conf. on Uncertainty in Artificial Intelligence.
Morgan Kaufmann Publishers Inc., 1995, pp. 296–305.
[38] C. Huang and A. Darwiche. “Inference in Belief Networks: A Procedural
Guide”. In: Int. J. of Approximate Reasoning 11.1 (1994), p. 158.
[39] J. P. Huelsenbeck and F. Ronquist. “MRBAYES: Bayesian inference of
phylogenetic trees”. In: Bioinformatics 17.8 (2001), pp. 754–755.
[40] L. Hyafil and R. L. Rivest. “Computing optimal binary decision trees is
NP-complete”. In: Information Processing Letters 5.1 (1976), pp. 15–17.
[41] G. Indiveri. Rounding Methods for Neural Networks with Low Resolution
Synaptic Weights. arXiv preprint. Institute of Neuroinformatics, Univ. Zurich,
2015. url: http://arxiv.org/abs/1504.05767.
[42] H. Jaeger. “Echo State Network”. In: Scholarpedia. Vol. 2. 2007, p. 2330.
url: http://www.scholarpedia.org/article/Echo_State_Network.
[43] E. T. Jaynes. Probability Theory: the Logic of Science. First partial online
editions in the late 1990ies. First three chapters online at http://bayes.
wustl.edu/etj/prob/book.pdf. Cambridge University Press, 2003.
[44] D. Jones. Good Practice in (Pseudo) Random Number Generation for Bioin-
formatics Applications. technical report, published online. UCL Bioinfor-
matics, 2010. url: http : / / www . cs . ucl . ac . uk / staff / d . jones /
GoodPracticeRNG.pdf.
278
[45] M. I. Jordan, Z. Ghahramani, et al. “An introduction to variational meth-
ods for graphical models”. In: Machine Learning 37.2 (1999), pp. 183–233.
[46] M. I. Jordan and D. M. Wolpert. “Computational motor control”. In: The
Cognitive Neurosciences, 2nd edition. Ed. by M. Gazzaniga. MIT Press,
1999.
[47] A. Karmiloff-Smith. “Precis of: Beyond Modularity: A Developmental Per-
spective on Cognitive Science”. In: Behaviour and Brain Sciences 17.4
(1994), pp. 693–707.
[48] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by Simulated
Annealing”. In: Science 220.4598 (1983), pp. 671–680.
[49] R. Kiros, R. Salakhutdinov, and R. S. Zemel. “Unifying Visual-Semantic
Embeddings with Multimodal Neural Language Models”. http://arxiv.org/abs/1411.2539
Presented at NIPS 2014 Deep Learning Workshop. 2014.
[50] J. Kittler et al. “On Combining Classifiers”. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence 20.3 (1998), pp. 226–239.
[51] D. Kutach. “The Asymmetry of Influence”. In: The Oxford Handbook of
Philosophy of Time. Ed. by C. Callender. Oxford University Press, 2011.
Chap. 8, pp. 247–275.
[52] G. Lakoff and R. Núñez. “The metaphorical structure of mathematics:
sketching out cognitive foundations for a mind-based mathematics”. In:
Mathematical Reasoning. Routledge, 2013, pp. 29–98.
[53] G. Lakoff and R. E. Núñez. Where mathematics comes from: How the em-
bodied mind brings mathematics into being. Basic Books, 2000.
[54] S.L. Lauritzen. “The EM algorithm for graphical association models with
missing data”. In: Computational Statistics & Data Analysis 19.2 (1995),
pp. 191–201.
[55] D. Luchinsky and et al. “Overheating Anomalies during Flight Test due
to the Base Bleeding”. In: Proc. 7th Int. Conf. on Computational Fluid
Dynamics, Hawaii July 2012. 2012.
[56] B. Mau, M.A. Newton, and B. Larget. “Bayesian phylogenetic inference via
Markov chain Monte Carlo methods”. In: Biometrics 55 (1999), pp. 1–12.
url: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
33.8433&rep=rep1&type=pdf.
[57] W. S. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in
nervous activity”. In: Bull. of Mathematical Biophysics 5 (1943), pp. 115–
133.
[58] N. Metropolis et al. “Equation of state calculations by fast computing ma-
chines”. In: The Journal of Chemical Physics 21.6 (1953), pp. 1087–1092.
279
[59] T. Mikolov et al. “Distributed Representations of Words and Phrases and
their Compositionality”. In: Advances in Neural Information Processing
Systems 26. Ed. by C. J. C. Burges et al. 2013, pp. 3111–3119. url: http:
//papers.nips.cc/paper/5021- distributed- representations- of-
words-and-phrases-and-their-compositionality.pdf.
[60] A. Minnaar. Wor2Vec tutorial part I: the Skip-Gram model. Online tu-
torial. 2015. url: http : / / mccormickml . com / 2016 / 04 / 27 / word2vec -
resources/%5C#efficient- estimation- of- word- representations-
in-vector-space.
[61] T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[62] S. R. T. Mouafo et al. “A tutorial on the EM algorithm for Bayesian net-
works: application to self-diagnosis of GPON-FTTH networks”. In: Proc.
12th International Wireless Communications & Mobile Computing Confer-
ence (IWCMC 2016). 2016, pp. 369–376. url: https://hal.archives-
ouvertes.fr/hal-01394337.
[63] A. Müller et al. “Towards Real-World Deployment of Reinforcement Learn-
ing for Traffic Signal Control”. In: Proc. 20th IEEE International Confer-
ence on Machine Learning and Applications (ICMLA). IEEE, 2021, pp. 507–
514.
[64] K. Murphy. An introduction to graphical models. Technical Report. http://www.cs.ubc.ca/∼
Intel, 2001.
[65] K. P. Murphy. “Dynamic Bayesian Networks: Representation, Inference
and Learning”. Univ. of California, Berkeley, 2002.
[66] R. M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Meth-
ods. Technical Report CRG-TR-93-1. Dpt. of Computer Science, University
of Toronto, 1993.
[67] R. M. Neal. Using Deterministic Maps when Sampling from Complex Distri-
butions. Presentation given at the Evolution of Deep Learning Symposium
in honor of Geoffrey Hinton. 2019. url: http://www.cs.utoronto.ca/
~radford/ftp/geoff-sym-talk.pdf.
[68] K. Obermayer, H. Ritter, and K. Schulten. “A principle for the formation
of the spatial structure of cortical feature maps”. In: Proc. of the National
Academy of Sciences of the USA 87 (1990), pp. 8345–8349.
[69] O. M. Parkhi, A. Vedaldi, and A. Zisserman. “Deep Face Recognition”. In:
Proc. of BMVC. 2015. url: http://www.robots.ox.ac.uk:5000/~vgg/
publications/2015/Parkhi15/parkhi15.pdf.
[70] R. Pascanu and H. Jaeger. “A Neurodynamical Model for Working Mem-
ory”. In: Neural Networks 24.2 (2011). DOI: 10.1016/j.neunet.2010.10.003,
pp. 199–207.
280
[71] J. Pearl. “[Commenatary on: Bayesian analysis in expert systems] graphi-
cal models, causality and intervention”. In: Statistical Science 8.3 (1993),
pp. 266–269.
[72] J. Pearl and S. Russell. “Bayesian Networks”. In: Handbook of Brain The-
ory and Neural Networks, 2nd Ed. Ed. by M.A. Arbib. MIT Press, 2003,
pp. 157–160. url: https://escholarship.org/uc/item/53n4f34m.
[73] J. Pearl and T. S. Verma. “A statistical semantics for causation”. In: Statis-
tics and Computing 2.2 (1992). this article apparently is a workout of a 1991
conference contribution of the same authors, which however is not available
online., pp. 91–95.
[74] S. E. Peters et al. “A Machine Reading System for Assembling Synthetic Pa-
leontological Databases”. In: PLOS-ONE 9.12 (2014), e113523. url: http:
//journals.plos.org/plosone/article?id=10.1371/journal.pone.
0113523.
[75] R. Pfeifer and Ch. Scheier. Understanding Intelligence. MIT Press, 1999.
[76] J. Piaget. “The biological problem of intelligence”. In: Organization and
pathology of thought: Selected sources. Ed. by D. Rapaport. engl. transl. of
the introductory chapter of Piaget’s La Naissance de l’Intelligence (1936).
Columbia University Press, 1951. Chap. 7, pp. 176–192.
[77] R. F. Port and T. van Gelder, eds. Mind as Motion: Explorations in the
Dynamics of Cognition. MIT Press/Bradford books, 1995.
[78] L.R. Rabiner. “A tutorial on Hidden Markov Models and Selected Appli-
cations in Speech Recognition”. In: Readings in Speech Recognition. Ed. by
A. Waibel and K.-F. Lee. Reprinted from Proceedings of the IEEE 77 (2),
257-286 (1989). Morgan Kaufmann, San Mateo, 1990, pp. 267–296.
[79] F. Rosenblatt. “The Perceptron: a probabilistic model for information stor-
age and organization in the brain”. In: Psychological Review 65.6 (1958),
pp. 386–408.
[80] S. Roweis and Z. Ghahramani. “A unifying review of linear Gaussian mod-
els”. In: Neural Computation 11.2 (1999), pp. 305–345.
[81] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning Internal
Representations by Error Propagation”. In: Parallel Distributed Processing
Vol. 1. Ed. by D. E. Rumelhart and J. L. McClelland. Also as Technical Re-
port, La Jolla Inst. for Cognitive Science, 1985. MIT Press, 1986, pp. 318–
362.
[82] S. Russell. Judea Pearl. https://amturing.acm.org/award_winners/
pearl_2658896.cfm. Review of scientific achievements of Judea Pearl that
led to the Turing Award. Accessed: January 2023. 2011.
281
[83] M. A. Savi. “Nonlinear Dynamics and Chaos”. In: Dynamics of Smart Sys-
tems and Structures. Ed. by V. Lopes Junior and et al. Springer Interna-
tional Publishing Switzerland, 2016, pp. 93–117.
[84] J. Schmidhuber. “Deep Learning in Neural Networks: An Overview”. In:
Neural Networks 61 (2015). Preprint: arXiv:1404.7828, pp. 85–117.
[85] B. Schölkopf and J. von Kügelgen. From Statistical to Causal Learning.
arXiv manuscript https://arxiv.org/abs/2204.00607. 2022.
[86] B. Schölkopf, F. Locatello, et al. “Towards Causal Representation Learn-
ing”. In: Proceedings of the IEEE 109.5 (2021). arxiv preprint at https:
//arxiv.org/abs/2102.11107, pp. 612–634.
[87] C. E. Shannon. “A mathematical theory of communication”. In: The Bell
System Technical Journal 27.3 (1948), pp. 379–423.
[88] D. Silver et al. “Mastering the game of Go with deep neural networks and
tree search”. In: Nature 529 (2016), pp. 484–489.
[89] P. Smyth. “Belief Networks, Hidden Markov Models, and Markov Random
Fields: a Unifying View”. In: Pattern Recognition Letters 18.11-13 (1997),
pp. 1261–1268.
[90] F. Suchanek et al. “Advances in automated knowledge base construction”.
In: SIGMOD Records Journal March (2013). url: http : / / suchanek .
name/work/publications/sigmodrec2013akbc.pdf.
[91] N. V. Swindale and H.-U. Bauer. “Application of Kohonen’s self-organizing
feature map algorithm to cortical maps of orientation and direction prefer-
ence”. In: Proc. R. Soc. Lond. B 265 (1998), pp. 827–838.
[92] F. Takens. “Detecting strange attractors in turbulence”. In: Dynamical Sys-
tems and Turbulence. Ed. by D.A. Rand and L.-S. Young. Lecture Notes
in Mathematics 898. Springer-Verlag, 1981, pp. 366–381.
[93] M. Tangemann et al. Unsupervised Object Learning via Common Fate. arxiv
report https://arxiv.org/pdf/2110.06562. 2021.
[94] J. Tenenbaum, T. L. Griffiths, and C. Kemp. “Theory-based Bayesian mod-
els of inductive learning and reasoning”. In: Trends in Cognitive Science
10.7 (2006), pp. 309–318.
[95] E. Thelen. “Grounded in the world: Developmental origins of the embodied
mind”. In: Infancy 1.1 (2000), pp. 3–28.
[96] A. Tversky and D. Kahneman. “Judgment under Uncertainty: Heuristics
and Biases: Biases in judgments reveal some heuristics of thinking under
uncertainty”. In: Science 185.4157 (1974), pp. 1124–1131.
[97] M. Weliky, W. H. Bosking, and D. Fitzpatrick. “A systematic map of direc-
tion preference in primary visual cortex”. In: Nature 379 (1996), pp. 725–
728.
282
[98] X. Xie and H. S. Seung. “Learning in neural networks by reinforcement of
irregular spiking”. In: Phys Rev E Stat Nonlin Soft Matter Phys 69 (2004),
p. 041909.
[99] H. Yin. “Learning nonlinear principal manifolds by self-organising maps”.
In: Principal Manifolds for Data Visualization and Dimension Reduction.
Ed. by A. N. Gorban et al. Vol. 58. Lecture Notes in Computer Science and
Engineering. Springer, 2008, pp. 68–95.
[100] P. Young et al. “From image descriptions to visual denotations: New sim-
ilarity metrics for semantic inference over event descriptions.” In: Trans-
actions of the Association for Computational Linguistics 2 (2014), pp. 67–
78.
[101] A. Zhang et al. Dive into deep learning. published on arXiv https : / /
arxiv.org/abs/2106.11342. 2021.
283