0% found this document useful (0 votes)

13 views

Tubingen DL Notes

Deep Learning notes from Uni Tübingen

Uploaded by

Arun Jana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Tubingen DL Notes

Deep Learning notes from Uni Tübingen

Uploaded by

Arun Jana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

University of Tübingen

Deep Learning Lecture Notes

Prof. Dr.-Ing. Andreas Geiger

Winter Term 2020/2021

Abstract
These lecture notes have been written collectively by the class of winter 2020/2021 and curated by the
TAs of the deep learning lecture. If you find an error, please email the TA responsible for the section.

1 Introduction
1.1 Introduction
The introduction contained all organizational matters. The team was introduced and the contents, goal and
organization of the course, the exercises, lecture notes, materials, credits and prerequisites were explained.

1.2 History of Deep Learning

Three waves of development
Deep Learning has been developed in three waves. The term “Deep Learning” didn’t exist whet these ideas
started to grow. During the first wave from 1940 to 1970, deep learning was called “Cybernetics”. This wave
was also called the “Golden Age”, with a lot of great discoveries and following great hopes. At that time, people
were using simple computational models of the brain to imitate biological learning and to learn simple rules.
These rules could successfully classify simple patterns. But people realized, that these simple models were not
powerful enough to solve complex tasks, which led to a decline in research.
During the second wave from 1980 to 2000, Deep Learning has been associated with “connectionism”. In con-
nectionism, it was assumed, that intelligent behavior was realized through a large number of simple units. Also
backpropagation and sequence models have been developed. Despite being very important, these developments
have been overshadowed by deployments in other research fields due to a lack of algorithms, ways of computing
and data. This is why this period is also called the “Dark Age”.
From 2006 until now, Deep Learning was actually called “Deep Learning”. This can be called the “Revolution
Age” where it was demonstrated, that with deeper networks, lager datasets and more compute, Deep Learning
was leading to state-of-the-art results, dominating almost all leader boards across different fields and disciplines.

1943: McCullock and Pitts

McCullock and Pitts developed an early model for neural activation. It is called a linear threshold neuron
because it is a threshold on a linear computation. It looks as follows:
(
+1 if wT x ≥ 0
fw (x) =
−1 otherwise

Here, x is the data and w are the weights. X could also be some features of the data, so we will call it features
from now on. If you multiply there features with the weights and the product is bigger than zero, we assign +1
to that function. If it is smaller than 0, we assign -1. That’s a binary decision, that’s why it is called a linear
threshold neuron. It has been demonstrated that his simple operation is more powerful that AND/OR gates,
which are special cases of this computation. But at the time there was no procedure to effectively learn the
weights. This has changed in 1958:

1958-1962: Rosenblatt’s Perceptron

In 1958, Rosenblatt developed the famous perceptron, which was the first algorithm to train the single lin-
ear threshold neuron. He also developed a hardware implementation of this algorithm. He used the so-called

1
“perceptron algorithm”, which was optimizing of perceptron criterion:
X
L(w) = − w T x n yn
n∈M

This is very different from the gradient based optimization with back propagation used today because the linear
threshold unit is non-differentiable. Therefore, the auxiliary task of the perceptron criterion has been defined.
The optimization of this criterion simply looks at all the results that have been wrongly labeled; so M is the set
of wrongly labeled examples. It looks what the model does for those and what the true label is (+1 or -1) and
it is minimizing a loss based on these incorrectly classified labels such that they are classified correctly in the
next iteration of the algorithm. For this very simple model, this is an optimal thing to do because it converges
to the right solution, if a solution exists, which was proven by Novikoff. It was a big success and there was also
some resonance in the media, as for example a perceptron, that was trained to recognize the difference between
males and females. While this provides a working example of such an algorithm, the perceptron was heavily
overhyped: For instance, Rosenblatt claimed that the perceptron will lead to computers that walk, talk, see,
write, reproduce themselves and are conscious of their own existence. So the hopes and expectations were really
high and couldn’t be fulfilled, which lead to mistrust in this technology.

1969: Minsky and Papert publish book

In 1969, Minsky and Papert published a book called “Perceptrons”, which mathemalically showed several dis-
couraging results for the model. For example, it showed that single-layer perceptrons cannot solve some very
simple problems such as the XOR problem or counting. This led to less interest and funding in this area, as
well as to the rise of symbolic AI research, which dominated the 70s.

1979: Fukushima’s Neocognitron

In 1979, Fukushima proposed the neocognitron, which was a predecessor of models that are similar to convo-
lutional neural network models that are still used today. The neocognitron was inspired by Hubel and Wiesels
experiments in the 1950s. They studied the visual cortex V1 in cats and they found out that there are different
types of neural cells in the V1, which are sensitive to orientation of edges but insensitive to their position. The
so-called simple cells respond primarily to edges while the complex cells imlepement spatial invariance. Hubel
and Wiesel received the Nobel price in 1981.
Fukushima proposed a computational model that mimics these simple and complex cells. It is a multi-layer
processing network to create intelligent behavior. It is composed of simple (S) and complex (C) cells which
implement convolution and pooling. However, there was no backpropagation algorithm at that time, so learning
happened, using reinforcement based learning. But still, the model architecture was an inspiration for modern
CNNs.

1986: Backpropagation Algorithm

In 1986, the backpropagation algorithm was reinvented by Rumelhart, Hinton and Williams. It was known
since 1961, but had its first empirical success in 1986 and remains the main workhorse in deep learning today.
It allows the efficient calculation of gradients in a deep neural network with respect to network weights by very
efficiently updating the weights gradients. It enables application of gradient based learning to deep networks.
This was a major breakthrough.

1997: Long Short-Term Memory

In 1991, Hochreiter demonstrated the problem of vanishing and exploding gradients in his Diploma Thesis. This
led to the development of long-short term memory for sequence modeling. It uses a combination of feedback
loops and forget or keep gates to effectively bridge this gradient flow over very long time horizons. This has
revolutionized natural language processing (NLP). But only from 2015 on, it became very popular.

1998: Convolutional Neural Networks

In 1998, convolutional neural networks (CNN) were proposed, which are similar to the Neocognitron, but are
trained end-to-end using backpropagation. It implements spatial invariance via convolutions and max-pooling
and uses weight sharing to reduce the parameter space. It was demonstrated, that a CNN can lead to very
good results in standard machine learning tasks such as the MNIST digit classification task or Tanh/Softmax
activations. However, the results did not scale up (yet) to the complexity of computer vision problems.

2009-2012: ImageNet and AlexNet

Another major breakthrough was the demonstration of ImageNet and AlexNet. ImgageNet is a huge dataset
and a recognition benchmark called the “Image Net Large Scale Visual Recognition Challenge” (ILSVRC) which
features 10 million annotated images out of 1000 categories. This was the first time that such a huge amount

2
of data became available. AlexNet was the first neural network to win the ILSVRC via GPU training, deep
models, better neural network architectures and huge datasets. This was the sparking point of the deep learning
revolution, where people recognized that deep learning can really change the world.

2012-now: Golden Age of Datasets

From 2012 on, realized that big annotated datasets are a key to solving complex challenges, and so there was a
large number of datasets developed subsequently, such as the KITTI dataset for self-driving in cityscapes. There
are lots of datasets for various tasks, e.g. PASCAL and MS COCO for Recognition, ShapeNet and ScanNet
for 3D Deep Learning, GLUE for Language understanding, Visual Genome for Vision and Language modeling,
VisualQA for Question Answering and MITOS for the recognition of Breast cancer tissue.

2012-now: Synthetic Data

Annotating real data is expensive, which led to the surge of synthetic datasets. Creating 3D assets is also costly,
but even very simple 3D datasets proved tremendously useful for pre-training deep neural models in particular
e.g. in the case of optical flow.

2014: Generalization
This lead to the success story of deep learning in terms of generalization. It could be empirically demonstrated
that deep representations generalize well despite a large number of parameters. A CNN can be pre-trained on
large amounts of data on a generic task (e.g., ImageNet classification) and then only the last layers need to be
fine-tuned (re-trained) on few data of a new task and still perform very well.

2014: Visualization
Zeiler and Fergus published a paper on visualization of that a deep neural network learns. The goal of this
paper was to provide insights into what the network has learned, because the networks are like black boxes.
They visualized image regions that most strongly activate various neurons at different layers of the network.
And they found that higher levels capture more abstract semantic information.

2014: Adversarial Examples

Contrary to all these findings, in 2014 it was also demonstrated that accurate image classifiers can be easily
fooled by imperceptible changes, using what is called the “Adversarial example”:

x + argmin {k∆xk2 : f (x + ∆x) 6= f (x)}

∆x

After applying this change to an image, the network recognizes all images as being classified as “ostrich”.

2014: Domination of Deep Learning

From 2014 on, Deep Learning has developed really fast and deep learning has dominated many research fields.
In 2014, there have been the first successfull deep models for machine translation e.g. Seq2Seq. Deep generative
models (like VAEs and GANs) also started to produce compelling images, e.g. novel images of human faces.
Major breakthroughs can also be observed e.g. in the prediction of molecular properties for generating novel
materials, which have been revolutionized using Graph Neural Networks (GNNs). In summary, there have been
dramatic gains in vision and speech which refers to Moore’s Law of AI.

2015: Deep Reinforcement Learning

In 2015, DeepMind demonstrated that it is possible to learn a policy (a state→action mapping) through random
exploration and reward signals (e.g., game score). This is called reinforcement learning, and this can be enriched
by deep learning. It can play successfully a variety of Atari games without other supervision. However, some
games remain hard, especially games that require reasoning, remembering or thinking long ahead.

2016: WaveNet
In 2016 it was demonstrated that deep generative models of raw audio waveforms like WaveNet can generate
speech which mimics human voice, as well as music.

2016: Style Transfer

Networks trained on datasets such as ImageNet can learn powerful features that allow e.g. to manipulate pho-
tographs by adopting the style of a another image or painting. It uses a deep neural network pre-trained on
ImageNet for disentangling the content from style. You can try it yourself on https://deepart.io/

2016: AlphaGo defeats Lee Sedol

In 2016, DeepMind developed AlphaGo, which combines deep learning with the Monte Carlo tree search. It

3
was the first computer program to defeat a professional Go player. AlphaZero, developed in 2017 even learns
via self-play and masters multiple games.

2017: Mask R-CNN

Mask R-CNN is a deep neural network for joint object detection and instance segmentation which outputs
“structured object”, an entire pixel map and a label map, not only a single number or class label.

2017-2018: Transformers and BERT

Transformers demonstrated that attention can effectively replace recurrence and convolutions in neural net-
works. BERT showed that pre-training of language models on unlabeled text can be very effective. Once these
model are fine-tuned, state-of-the-art results on very challenging tasks such as the GLUE benchmark can be
obtained. On GLUE, algorithms achieve superhuman performance on some language understanding tasks such
as paraphrasing and question answering. However, computers still fail in dialogue. It is easy to make these
systems fail and make them not pass the Turing Test.

2018: Turing Award

In 2018, the “nobel price of computing” has been awarded to the “founding fathers of deep learning”, Yoshua
Bengio, Geoffrey Hinton and Yann LeCun.

2016-2020: 3D Deep Learning

From 2016 on, in the computer vision domain, first models to successfully output 3D representations were de-
veloped. They could effectively predict voxels, point clouds, meshes and implicit representations. Prediction of
3D models became even possible from a single 2D image. The models have been extended to properties such
as geometry, materials, light and objects in motion.

2020: GPT-3
This year, GPT-3 came along, which is the first version of the language model by OpenAI. It is upscaling
existing language models to 175 Billion parameters.It has a text-in / text-out interface and many use cases
like coding, poetry, blogging, news articles and chatbots. There are also controversial discussions. It has been
licensed exclusively to Microsoft on September 22, 2020.

Current Challenges
There remain still some challenges for the next generation, such as un- or self-supervised learning, interactive
learning, accuracy (e.g. for self-driving), robustness and generalization, inductive biases, understanding and
mathematics, memory and compute and, last but not least ethics and legal questions. And it also remains
open, whether “Moore’s Law of AI” will continue.

1.3 Machine Learning Basics

These basics are meant as a recap, but if things are going too fast of if you haven’t seen any machine learn-
ing lecture before, you can have a look into Goodfellow et al.: Deep Learning, Chapter 5 (http://www.
deeplearningbook.org/contents/ml.html).

1.3.1 Learning Problems

Supervised learning
In supervised learning, model parameters are learned using a dataset of data-label pairs {(xi , yi )}N
i=1 . Here x
is the input and y is the output and we have N such pairs. Examples include classification, regression and
structured prediction problems.

Unsupervised learning
Model parameters are learned using a dataset without labels {xi }N
i=1 , so just with inputs. Examples include
clustering, dimensionality reduction and generative models.

Self-supervised learning
Model parameters are learned using a dataset of data-data pairs {(xi , x0i )}N
i=1 . Examples include self-supervised
stereo and optical flow estimation and contrastive learning.

Reinforcement learning
Model parameters are learned using active exploration from sparse rewards instead of a fixed dataset. Examples

4
include deep q learning, gradient policy and actor critique.

In this lecture, we will cover the first three learning problems.

1.3.2 Supervised Learning

Classification / Regression:
In classification, the goal is to predict a discrete class label using a function f with input X :

f :X →N

In regression, we are interested in predicting from an arbitrary input X a continuous variable.

f :X →R

Inputs x ∈ X can be any kind of objects images, text, a sequence of amino acids, . . .
The output y ∈ N/y ∈ R is either a discrete or real number, but only one single number. Examples are the
label of an image as in classification, or regressing a single number for the stock market, a density estimation, . . .

Structured Output Learning:

In structured output learning, both the input and the output are structured objects.

f :X →Y

The inputs x ∈ X can be any kind of objects and the outputs y ∈ Y are complex (structured) objects such as
images, text, parse trees, folds of a protein, computer programs, . . .

Today we will focus on the regression problem. A model f always takes an input x and produces an out-
put y. There are two main tasks, the learning task and the interference task.

Learning: Learning happens through estimating the parameters w from training data {(xi , yi )}N
i=1
Inference: Given w from the learning task, novel predictions are made: y = fw (x)

Classification
An example is the mapping of an input image to the output label “Beach” or “No Beach”, thus performing the
task of classification:
Mapping: fw : RW ×H → {“Beach”, “No Beach”}

Regression
For a regression problem, a stock value prediction with N numbers as input and a single number as output aree
an example:
Mapping: fw : RN → R

Structured Prediction
An example for structured prediction problems is an audio signal as input and a sequence of words, forming a
sentence as output.
Mapping: fw : RN → {1, . . . , L}M

Semantic segmentation is another example for structured prediction, where the input is an image and the
output is a label map.
Mapping: fw : RW ×H → {1, . . . , L}W ×H

A third example of structured prediction is 3D reconstruction, the input is a set of images and the output
3
is a 3D reconstruction. Mapping: fw : RW ×H×N → {0, 1}M

For this mapping, suppose we want to predict 323 voxels, with a binary variable per voxel (occupied/free)
3
232 = 232768 different reconstructions could be predicted by the model. This number is even larger than the
number of atoms in the universe, which is ∼ 2273 .

5
1.3.3 Linear Regression
In linear regression, we want to make a prediction for the model parameters in a linear sense. Formally, this
means: Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R). The goal is, to
predict y for a previously unseen input x. The input x may be multidimensional.
As an example, we try to fit a line as the ”ground truth” over some noisy observations, which are samples from
the model with added noise. This is what we provide to the learning algorithm. When we execute that model
at a novel location x, we get a sensible response y. This is done by defining an error function.
The error function E(w) measures the displacement along the y dimension between the data points and the
model f (x, w) specified by the parameters w. It aims at minimizing the overall error.

f (x, w) = w> x
N
X 2
E(w) = (f (xi , w) − yi )
i=1
N
X 2
= x>
i w − yi
i=1
2
= kXw − yk2

In linear regression, we have a function represented as a linear model. w> x is the inner product of the weight
vector w and the input features x. The error function is the sum over all data points of the square of the
difference in the y direction between the prediction of the model and the observation y. This can be written
as the squared `2 norm of the matrix X times the weight vector w minus the vector y.

The gradient of the error function with respect to the parameters w is given by:
2
∇w E(w) = ∇w kXw − yk2
>
= ∇w (Xw − y) (Xw − y)
= ∇w w> X> Xw − 2w> X> y + y> y

= 2X> Xw − 2X> y

As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form. This means that we can
set ∇w E(w) to zero to obtain a closed form solution:
−1
∇w E(w) = 0 ⇒ w = (X> X) X> y

−1
The matrix (X> X) X> is also called Moore-Penrose inverse or pseudoinverse because it allows us, to
compute the inverse of a non-square matrix, which is the minimizer of this least-square problem.

Example: Line Fitting

A linear least squares fit of the model f (x, w) = w0 + w1 x to the data points can be obtained with line fitting.
The error function E(w) wrt. parameter w1 is a square function that produces a parabola. In order to minimize
the error, we must find the lowest point of the parabola.

Example: Polynomial Curve Fitting

Let us choose a polynomial of order M to model dataset X :
M
X
f (x, w) = w j xj = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0

This function is linear in w, but it is not linear in x.

There are two tasks, the training and the interference. In training, we want to estimate w from dataset X and
during inference we want to predict y for novel x given estimated w. Note that these features can be anything,
including multi-dimensional inputs (e.g., images, audio), radial basis functions, sine/cosine functions, etc. In
this example of polynomial curve fitting we have monomials.

6
For estimating w from X , we define a squared error function, e.g.:
N
X 2
E(w) = (f (xi , w) − yi )
i=1

The goal is, to optimize error function wrt. the parameters w.

The error function is linear in w but not in x. This means, that wrt. w we obtain the same problem as before;
we have a quadratic problem with a closed form solution:
 2
N N N M
X 2
X 2 X X
w > x i − yi = wj xji − yi 

E(w) = (f (xi , w) − yi ) = 
i=1 i=1 i=1 j=0

It can be rewritten in the matrix-vector notation (i.e., as linear regression problem).

2
E(w) = kXw − yk2

with feature matrix X, observation vector y and weight vector w:

.. .. .. .. ..
     
w0
 . . . .  .
w =  ... 
 
X=  1 xi x2i . . . xM y =  yi
   
i   
.. .. .. .. .. wM
. . . . .

Example for Polynomial Curve Fitting Results

Figure 1: Polynomial Curve fitting. Plots of polynomials of various degrees M (red) fitted to the data
(green). We observe underfitting (M = 0/1) and overfitting (M = 9). Choosing the best fitting degree M is a
model selection problem.

Capacity, Overfitting and Underfitting

In summary, the goal of polynomial curve fitting is to perform well on new, previously unseen inputs (test set,
blue in Fig. 1), not only on the training set (green in Fig. 1). This is called generalization and separates

7
machine learning from optimization. In optimization, we are just interested in fitting a model to observations
while in machine learning, we are always interested in a model that generalizes well.
The assumption that is often made in statistical learning theory, is that the training and test data are indepen-
dent and identically (i.i.d.) drawn from the data distribution pdata (x, y). This is important, because this assures
that we can make certain statements about the learning problem. In the case of Fig. 1 the data distribution
has been chosen as follows:

pdata (x) = U(0, 1)

pdata (y|x) = N (sin(2πx), σ)

We have an underlying sine curve with x-locations between 0 and 1. On this basis we sample a y-location by
adding some gaussian noise with the standard deviation σ to the sine curve.

Clarification of the used terminology:

Capacity refers to the complexity of functions which can be represented by the model f . In Fig. 1, for M = 0/1
the capacity is too low, for M = 3 the capacity is about right, and for M = 9 the capacity is too high.
Underfitting refers to models being too simple, so they don’t achieve low error, not even on the training set.
Overfitting refers to a scenario where the training error is very small, but test error (= generalization error)
is large. So the model overfits the training data and doesn’t generalize well to the test set. We want neither an
underfitting nor an overfitting model, but a model that performs well on the test set and thus generalizes well.
As in the example of the generalization error for various polynomial degrees M, the model selection problem is
as follows: we want to select the model with the smallest generalization error.
In general, we split the dataset into a training, a validation and a test set in order to chose the best model.
We choose hyperparameters (e.g., degree of polynomial, learning rate in neural net, ..) using the validation set.
It is important to evaluate only once on test set (true labels are typically not available). When the dataset is
small, use (k-fold) cross validation instead of a fixed split.

1.3.4 Ridge Regression

Ridge Regression is another way of reducing the model complexity, but not by discretely changing the degree
or the order of the polynomial, but by adding a regularization term. The error function of the ridge regression
problem looks as follows:
N
X M
X
2
E(w) = (f (xi , w) − yi ) +λ w2
i=1 j=0
2 2
= kXw − yk2 +λkwk2

The idea is, to discourage large parameters by adding a regularization term with strength λ. Since this problem
−1
is quadratic in w, it also has a closed form solution: w = (X> X+λI) X> y.

Figure 2: Ridge Regression. Plots of polynomial with degree M = 9 fitted to 10 data points using ridge
regression. Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).

In Fig. 2 we can see, that for the formerly overfitting model with M = 9 we can obtain a much better fit with

8
a mild regularizer. With a strong regularizer the model remains constantly 0, which refers to the underfitting
situation. If we choose a very small regularizer, parameters or model weights can become very large and the
problem becomes ill-conditioned. If we increase the regularizer, the weights become smaller. Again, we want to
select a model with the smallest generalization error on the validation set.

1.3.5 Estimators, Bias and Variance

Point Estimator:
A point estimator g(·) is a function that maps a dataset X to model parameters ŵ: ŵ = g(X ). We denote ŵ
with a hat to explicitly mark that this is an estimate of the point estimator. An example is the estimator of the
−1
ridge regression model: ŵ = (X> X + λI) X> y. A good estimator is a function that returns a parameter set
close to the true one. We assume that the data X = {(xi , yi )} is drawn from a random process (xi , yi ) ∼ pdata (·),
thus, any function of the data is random and ŵ is a random variable. This leads us to the terminology of bias
and variance and the so-called bias and variance dilemma.

Properties of Point Estimators:

The bias of a point estimator is the expected value of the point estimate over all data sets that can be drawn
from the data distribution minus the true value of the parameters:

Bias(ŵ) = E(ŵ) − w

This is how, in expectation, the point estimator deviates from the true parameters. This expectation is over all
possible datasets X . ŵ is unbiased ⇔ Bias(ŵ) = 0. A good estimator has little bias.

The variance of a point estimator is the variance over ŵ over all datasets X :

Var(ŵ) = E(ŵ2 ) − E(ŵ)2

p
The square root of the variance Var(ŵ) is called “standard error”. A good estimator also has low variance.

Bias-Variance Dilemma:
Statistical learning theory tells us that we can’t have a little bias as well as a low variance. There is a trade-off
that we have to make.

Figure 3: Bias-Variance Dilemma. Ridge regression with weak (λ = 10−8 ) and strong (λ = 10) regulariza-
tion. Green: True model. Black: Plot of model with mean parameters w̄ = E(w). Red: Estimates, with high
variance at the right and low variance at the left.

In Fig. 3 we can see that if we have a weak regularization, we have a large variance but the mean is a good
fit to the ground truth, so the bias is small. With a strong regularization, we obtain less variance in the
predictions, however, the model has a strong bias. The mean of the parameters (the red estimates) deviates
strongly from the ground truth. So there is a bias-variance tradeoff which can be expressed mathematically:
E[(ŵ − w)2 ] = Bias(ŵ)2 + Var(ŵ). But we might not have to deal with this in all situations. E.g. in deep
neural networks the test error decreases with network width. For further information see the blogpost at
https://www.bradyneal.com/bias-variance-tradeoff-textbooks-update.

9
1.3.6 Maximum Likelihood Estimation
We now reinterpret our results by taking a probabilistic viewpoint. Let X = {(xi , yi )}N i=1 be a dataset with
samples drawn i.i.d. from the data distribution pdata . And let the model pmodel (y|x, w) be a parametric family
of probability distributions. Then the conditional maximum likelihood estimator for w is given by

ŵM L = argmax pmodel (y|X, w)

w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
N
X
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood

Example: If we assume that the model distribution is a Gaussian, where the mean is a linear function, and
the variance is σ: pmodel (y|x, w) = N (y|w> x, σ), we obtain

N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
1 > 2
e− 2σ2 (w xi −yi )
X 1
= argmax log √
w
i=1 2πσ 2
N N
X 1 X 1 2
= argmax − log(2πσ 2 ) − 2
w > x i − yi
w
i=1
2 i=1
2σ
N
X 2
= argmax − w > x i − yi
w
i=1
2
= argmin kXw − yk2
w

We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to yield exactly the same least
squares estimator derived before:
2
ŵ = argmin kXw − yk2
w

There are various variations possible here. If we were choosing e.g. the model distribution pmodel (y|x, w) as a
Laplace distribution, we would obtain an estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1 .
Or otherwise assuming a Gaussian distribution over the parameters w and performing a maximum a-posteriori
(MAP) estimation yields to ridge regression:argmax w p(w|y, x) = argmax w p(y|x, w)p(w).
This shows that there is a connection between maximum likelihood estimation and the linear least squares
problems that have been explained earlier. This is important, because maximum likelihood estimators are a
very important tool in statistical learning theory because they are known to be consistent under mild assump-
tions. This means, that as the number of training samples approaches infinity N → ∞, the maximum likelihood
(ML) estimate converges to the true parameters. They are also very efficient: The ML estimate converges most
quickly as N increases These theoretical considerations make ML estimators appealing.

10
2 Computation Graphs
2.1 Logistic Regression
We’ve already got to know the maximum likelihood estimator and more precisely the Conditional Maximum
Likelihood Estimator for w:
N
X
ŵM L = argmax log pmodel (yi |xi , w) (1)
w
i=1

because we’re conditioning on some input x we’re not just trying to model a distribution y but we are trying
to model a conditional distribution y given x. We can see here the Maximum Loglikehood estimator, we have
the log of the product, which is the sum of the logarithm. However, the estimate that we obtain - argmax, is
the same as the Maximum Likelihood Estimator would return as the logarithm is a monotonic function. In
practice we prefer to use loglikelihood estimator both for numerical reasons when implementing these algorithms
because this is resulting in more stable computation, but also for mathematical reasons and for connection to
the concepts in information theory. We’ll mostly use loglikelihood from now, it doesn’t return the same value,
but it returns the same maximum, the same parameters w. We denote the estimate as ŵM L . This hat indicates
that this is an estimate of the true parameters.Now we want to perform a binary classification: yi ∈ {0, 1}.
So the output can take only two possible discrete labels - 0 and 1.The question is - how should we choose
pmodel (y|x, w) in this case? Gaussian model isnt a good choice, because it’s a continuous distribution. Bernoulli
distribution is good for this case, it models binary classification problem:

pmodel (y|x, w) = ŷ y (1 − ŷ)(1−y) (2)

where ŷ is a prediction of some model and y is the true label from our dataset. So ŷ is some prediction that
must depend on the input x and it must also depend on some trainable parameter w, we denote that function
as: ŷ = fw (x).
We are working with the discrete distribution, so we have the next requirement fw (x) ∈ [0, 1]. We can use a
sigmoid function: fw (x) = σ(w> x) where σ is the sigmoid function:
1
σ(x) = (3)
1 + e−x

Figure 4: Sigmoid. An illustration of the Sigmoid function.

We take the linear combination of the weights and then we do the non-linear transformation sigma to it. Sigmoid
takes an unbounded range of the real numbers and transform this domain of real numbers to the domain of
from 0 to 1 interval. That’s why this function is called squashing function. And that’s what we want - we want
the output of this function between 0 and 1 to be a proper probability. That’s also a reason why is this called
a logistic regression, because this is logistic transfer function.

11
Lets put it together:
N
X
ŵM L = argmax log pmodel (yi |xi , w) (4)
w
i=1
N
X h i
(1−yi )
= argmax log ŷiyi (1 − ŷi ) (5)
w
i=1
N
X
= argmin −yi log ŷi − (1 − yi ) log(1 − ŷi ) (6)
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )

Here for the model distribution we plug in the Bernoulli distribution, where

ŷi = σ(w> xi )

. Because we have a logarithm, we can transform it, we split the product with a sum. And then instead of
maximizing we can minimize the last expression by writing a minus in front. In machine learning we often don’t
maximize likelihoods, but we rather minimize loss functions and that’s why we are writing it this way here
and this this term in the sum has a specific name in the machine learning community, it is called a binary
cross-entropy loss. It’s a loss between the predicted y and the true y and as the name loss indicates a high
loss is bad and a low loss is desirable, so we want to minimize the loss, we want to compute the overall minimum
over the data set of all these individual losses. In machine learning we use often the more general term loss
rather than what we have used before in linear regression the error function and the reason is that this loss
term is more general, it’s not necessarily an error between observations,it could also be for instance an inductive
bias that we want to encode. Now this formula has the following interpretation, we minimize the dissimilarity
between the empirical data distribution pdata (defined by the training set) and the model distribution pmodel .
We want to find a parameter for our model, such that the prediction of our model is most similar to our data.
Lets look at the Binary cross entropy loss:

L(ŷi , yi ) = −yi log ŷi − (1 − yi ) log(1 − ŷi ) (7)

(
− log ŷi if yi = 1
= (8)
− log(1 − ŷi ) if yi = 0

Figure 5: Curves. An illustration of the those two curves.

We want to minimize those curves. For yi = 1 the loss L is minimized if ŷi = 1. For yi = 0 the loss L is
minimized if ŷi = 0.Thus, L is minimal if ŷi = yi . This loss function can be extended to > 2 classes.

2.1.1 1D example
Lets look at this visually.
We have on the x-axis our features, this is the input 1d and we want to classify between positive and negative.
So we want to tell for a particular x value is it in the positive class or in a negative class. We have set of positive
samples in green and negative in red. We want to fit to dataset X logistic regression model fw (x) = σ(w0 +w1 x)

12
Figure 6: 1D example. An illustration of the simple 1D example.

Figure 7: Model fitting. An illustration of fitting the log regression model.

Figure 8: Probabilistic interpretation. Probabilities of classifier fw (xi ) for positive samples (yi = 1)

How do we interpret this in terms of probabilities and the loss function?

The probabilities of the classifier for the positive samples are shown with green bars here where each of these
green bars is basically extending all the way up to the curve because this is the probability that the model
assigns to these positive values. What we can see is already the criterion that we want to optimize, for all data
points we want to maximize these probabilities being correctly assigned.
And these are the probabilities for the negative points Now we can put all of these probabilities together and

Figure 9: Probability of the negative class. Probabilities of classifier fw (xi ) for negative samples (yi = 0)

we”ll just put them to the bottom, because it makes more sense.

Figure 10: Probability of the both classes. Probabilities of classifier for the both classes

We want to maximize those probabilities or equivalently we want to minimize the negative logarithm.
So then we minimize the mean or the sum over all these negative log probabilities.

2.1.2 Optimizing w.
In contrast to linear regression, the loss L(ŷi , yi ) is not quadratic in w. w appears inside the sigmoid which
is a nonlinear function and then we take the logarithm of this nonlinear function, which is also a non-linear

13
Figure 11: Log loss. Log loss of classifier for both classes

function, so this is highly non-linear transformation of linear model and we don’t have an easy expression as we
had it in the quadratic case for the linear regression task. If there is no closed form analytic solution, we need
to apply some iterative optimization technique which is based on gradients. Luckily we can compute gradients
and so we can use a gradient based optimizer, that slowly follows the gradient to a local minimum starting
from some initialization point.Despite this not being a quadratic function we, it’s still a convex function.It can
be shown that this problem is a convex problem, so we know that we don’t fall into a local optimum. If we
apply a gradient based optimizer, we end up with the global optimum.But we can’t reach it in a single step or
analytically, we need to apply an iterative gradient-based algorithm. And for all the gradient-based optimization
algorithms, where the grading is needed, we need to compute this gradient which is the nabla operator. The
nice thing about this particular objective function is that despite in the in the first step the gradient looks quite
complex, it simplifies a lot when you do the analytical derivation and this is the final form of the gradient of
the binary cross entropy loss of this logistic regression model:
∇w L(ŷi , yi ) = (ŷi − yi )xi (9)
Since we are able to compute this gradient analytically, we can apply it inside a iterative gradient based
optimizer, which tries to go stepwise towards the optimum. The simplest optimizer that we can use and one
that’s actually used nowadays heavily even in deep learning is called gradient descent. We pick the step size η
and tolerance and Initialize w0 . And then we repeat until kvk < . We iterate the procedure:
N
X
v = ∇w L(ŷ, y) = ∇w L(ŷi , yi ) (10)
i=1
wt+1 = wt − ηv (11)
We optimize this until this gradient becomes small and converges to an extreme value of this function.

Figure 12: Gradient descent. Example of the converging variants of gradient descent.

But there’s several variants that work a little bit better, one is line search, that’s the green curve, where you’re
going into the direction of the gradient,but you search for the smallest value. You can do even better for
some problems by using the conjugate gradient method (red), where you’re not going into the direction of the
gradient, but into some conjugate direction, that brings you more directly to the optimal value.

14
2.1.3 Connections to the Information Theory
Maximizing the Log-Likelihood is equivalent to minimizing Cross Entropy or KL Divergence:
XN
ŵM L = argmax log pmodel (yi |xi , w) (12)
w
i=1
| {z }
Log-Likelihood

= argmax Epdata [log pmodel (y|x, w)] (13)

w
= argmin −Epdata [log pmodel (y|x, w)] (14)
w | {z }
Cross Entropy H(pdata ,pmodel )

= argmin Epdata [log pdata (y|x) − log pmodel (y|x, w)] (15)
w
= argmin DKL (pdata kpmodel ) (16)
w | {z }
KL Divergence

We can rewrite log likelihood using an expectation, because what if we would write 1 over N, which would be the
mean, then we would look at the empirical estimate over the data distribution,but one over N is a constant with
respect to w so it gets absorbed into the argmax operator. But equation having a one over N in the beginning
would exactly correspond to the definition of the expectation operator, which is basically expectation over the
probability of the data points. So each of the data points is a sampling based estimate of the log pmodel . We
can write argmax in terms of argmin, if we replace plus with a minus. Now what we can do also is we can take
this term and add the log pdata (y|x).We can do that because the data probability in this data distribution does
not depend on w, so it’s changing the value of this function, but it’s constant with respect to w.It means that
we have the cross-entropy minus the entropy and this in information theory terms is called the KL divergence.
It’s one of many divergences, but it’s a measure of the similarity of two distributions.This is a nice intuition
here, by computing the maximum likelihood estimate of the parameters we’re trying to minimize the distance
between two distributions,more precisely between the data distribution given by the data set - the empirical
data distribution and the distribution of our model given its parameters w.

2.2 Computation graphs.

In the second unit of this lecture we will introduce computation graphs, which is a fundamental concept to
understand larger changes of computations and also the computation of gradients in this large chains of compu-
tations. Unfortunately it’s not true that these gradients are so simple to compute in general.For more complex
models that we’re interested in this lecture, in particular for deep neural networks, where you simply can’t
derive the gradients with respect to any of these gigantic number of parameters just using pen and paper. So
how can we basically efficiently compute the gradients in the general?

2.2.1 Key idea.

The key idea of computation graphs is to decompose complex computations into sequences of very simple,
more atomic assignments. We call this sequence of assignments a computation graph or source code. The
forward pass takes a training point (x, y) as input and computes a loss, e.g.:

L = − log pmodel (y|x, w)

As we will see, gradients ∇w L can be computed using a backward pass. Both, the forward pass and the
backward pass are efficient due to the use of dynamic programming, i.e., storing and reusing intermediate re-
sults.This decomposition and reuse of computation is key to the success of the backpropagation algorithm,
the primary workhorse of deep learning. It would be unthinkable to optimize deep networks with millions or
billions of parameters without this decomposition and reuse of computation. That’s why the backpropaga-
tion algorithm which adjusts these millions and billions of parameters in our deep models is still the primary
workhorse for deep learning today.

2.2.2 Computation graph.

A computation graph has three kinds of nodes: input nodes in green, parameter notes in orange and compute
nodes in red. Also the loss function is a compute node in our setting. The input nodes don’t have parameters
they are the input of our learning problem, they are basically the dataset,this is where the data set goes into
the computation graph. The parameter nodes is where the parameters of the model are stored and this is what
we’re interested in when backpropagating gradients. We need to backpropagate gradients for all nodes, but

15
what we’re really interested in the end is the gradients for these parameters, because this is where the updates
are applied to. Finally we have the compute nodes which could be aver the final node (the loss function) or
intermediate compute nodes that take inputs, for instance input nodes or previous compute nodes or parameters
and produce a result.

Figure 13: Computation graph. Example of the computation graph for linear regression.

We’re looking at a very specific example of linear regression.This is the source code here:

(1) u = w1 x
(2) ŷ = w0 + u
(3) z = ŷ − y
(4) L = z2

This is how we write the linear regression problem, which we could also write in a single line, we write it in as
a sequence of atomic operations. Now there is multiple levels of granularities that we can use. This is a very
fine level of granularity, but we can use more coarse-grained levels of granularity as well and what we want to
choose depends on how big we want to make these atomic units, which level of atomic units we still can handle.
So in this case here we could combine (1) and (2) , then we would change this computation graph.

Figure 14: Computation graph with another level of granularity. Example of the computation graph
with another level of granularity for linear regression.

Steps (1) and (2) have collapsed into one step, which directly computes this affine transformation.
We can also for instance collapse the loss, we can compute not these two things here independently but we don’t
want to compute them jointly:
So now we have only two steps:

(1) ŷ = w0 + w1 x
(2) L = (ŷ − y)2

Here is the example for the logistic regression:

(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

16
Figure 15: Computation graph with another level of granularity. Example of the computation graph
with another level of granularity for linear regression.

Figure 16: Computation graph of logistic regression. Example of the computation graph for logistic
regression.

We could also write this in terms of vectors w and x:

(1) u = w> x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

Figure 17: Computation graph with vector representation. Example of the computation graph with
vector representation for logistic regression.

What we can also do is we can stack two of these operations behind each other:

(1) h = σ(W1> x)
(2) ŷ = σ(w2> h)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)

17
Figure 18: Multi-Layer Perceptron. Example of the computation graph for Multi-Layer Perceptron.

2.3 Backpropagation.
Our goal is to find gradients of negative log likelihood:
N
X
∇w − log pmodel (yi |xi , w) (17)
| {z }
i=1
L(yi ,xi ,w)

or more generally of a loss function

N
X N
X
∇w L(y, X, w) = ∇w L(yi , xi , w) = ∇w L(yi , xi , w) (18)
i=1 i=1

given a dataset X = {(xi , yi )}N

i=1 with N elements. In the following, we consider the computation of gradients
wrt. a single data point: ∇w L(yi , xi , w). The gradient with respect to the entire dataset X is obtained by
summing up all individual gradients.

2.3.1 Chain rule.

In order to understand the backpropagation algorithm there’s basically just one rule that we have to understand
and that’s the chain rule:
d df dg
f (g(x)) = (19)
dx dg dx
We’re also going to need the so-called multivariate chain rule:4
M
d X ∂f dgi
f (g1 (x), . . . , gM (x)) = (20)
dx i=1
∂gi dx

The multivariate chain rule is concerned with compositions of functions, where in the argument of a function
we have a set of functions and each of these functions here depends on the same variable x.

2.3.2 Backpropagation algorithm.

Let’s look at the most simple form of backpropagation algorithm, one of the most basic examples, which is
basically just the chain rule in a slightly different form.For now we’re going to not make any distinction between
node types.
We’re interested in the gradient of this loss with respect to the loss itself or the variable y or the variable x.
Forward Pass:

(1) y = y(x)
(2) L = L(y)

We’re going to after running this forward pass, actually calculating the value of L for a particular input x, we
are going to run a so called backward pass, that’s where the name backpropagation is coming from. Backward

18
Figure 19: Simple computation graph.

Pass:

∂L ∂L ∂L ∂L
(2) = =
∂y ∂L ∂y ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x
We will indicate forward pass with the black arrows and backward pass with the blue. We backpropagate gradi-
ents from the output node, which is always L in our case, to each individual node, such that at each individual
node we can read off the gradients with respect to each of these nodes. We’re interested in gradients of L with
respect to any variable in this computation graph and we’re going to use two different colors here for indicating
two different types of quantities. Red color for backpropagated gradients and blue color for local gradients, that
are locally computed based on this assignment formulas (1) and (2).

Figure 20: Simple computation graph with backprop.

Figure 21: Simple computation graph with backprop, step 2.

The backward pass starts at the output node L and it computes gradients backwards.The first thing it does it
computes the gradient of L with respect to itself which is obviously 1. In order to compute the gradient of L
with respect to y we need to apply the chain rule. We need to compute the gradient of L with respect to the
gradient of L times the gradient of L with respect to the gradient of y. Now we can do one step further,now

19
we’re calculating the first expression the gradient with respect to the first variable, which we have in the forward
pass computed first. Now we compute the gradient of L with respect to x, this is exactly what we wanted to
compute. Again the chain rule, it’s the gradient of L with respect to y times the gradient of y with respect to
x.What we’re ultimately interested in is to backpropagate gradients of course to the weights of a parametric
function such as logistic regressor or a neural network. For all the values at a particular iteration we know the
values , so we will we’ll always insert the actual gradients and not the symbolic gradients.

Let’s look at a slightly more difficult example.This is an example where we have a fan out bigger than one
situation.Fan out means from a particular node there are multiple outgoing connections. In case y has multiple
connections.

Figure 22: Backpropagation: Fan-Out.

Forward pass looks as follows:

(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)

Lets compute the backward pass now:

∂L ∂L ∂L ∂L
(3) = =
∂u ∂L ∂u ∂u
∂L ∂L ∂L ∂L
(3) = =
∂v ∂L ∂v ∂v
∂L ∂L ∂u ∂L ∂v
(2) = +
∂y ∂u ∂y ∂v ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x
It’s important to remember whenever we have such a fan out situation, whenever one variable sends to multiple
variables we need to in the back propagation path is to sum up all the gradients and this is what has happened
here.

2.3.3 Implementation.
Now we are basically ready to implement the backpropagation algorithm. A convenient way to do so is to
consider each variable or node as an object that has certain attributes, for instance the value (x.value) and the
gradient (x.grad). Values are computed in the forward pass. Here is the sequence of assignments in python:

x.value = Input
y.value = y(x.value)
u.value = u(y.value)
v.value = v(y.value)
L.value = L(u.value, v.value)

Tha value of x is an Input, value of y is function y evaluated at x value and then u value is the function u
evaluated at y value and then v is also evaluated at y value. And then finally we have a Loss function.

20
Similarly we can look at the backward pass. First of all we would set the gradients to zero, because we’ll gonna
additively update.
x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
y.grad += u.grad ∗ (∂u/∂y)(y.value)
y.grad += v.grad ∗ (∂v/∂y)(y.value)
x.grad += y.grad ∗ (∂y/∂x)(x.value)
In order to evaluate gradient we need the values that have been computed in the forward pass. The gradient
itself is a function.
Let’s look at a slightly more real example in the sense of machine learning. This is a logistic regression example.
BCE denotes Binary Cross Entropy Loss.
Forward Pass:

(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)

Figure 23: Backpropagation: Log regression

We want to calculate the gradients, but we’re not interested in the gradients with respect to any variable
in this graph, for instance we’re not interested in the gradients with respect to these green input variables
here, we could calculate them and sometimes we really do, like in the case of style transfer.That’s the beauty
about the back propagation algorithm that it can compute the gradients with respect to any variable can even
compute the second derivative by executing it twice. But in this case here when we’re interested in training
the parameters of the logistic regression model. We’re really just interested in back propagating the gradients
from L to w1 and w0 so that we obtain the gradients of L with respect to w1 and w0 . Which are then used
in the gradient based optimization in an iteration loop in order to update the gradient the parameters of the
model.Let’s apply the back propagation algorithm to this example:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u
∂L ∂L ∂u ∂L
(1) = = x
∂w1 ∂u ∂w1 ∂u

2.3.4 Summary
We can write mathematical expressions as a computation graph, that decomposes complex expressions into
much simpler computations that are tractable and where we can apply efficient dynamic programming to solve

21
for both - the values in the forward pass and the gradients in the backward pass.
The values are efficiently computed in the forward pass and the gradients are computed in the backward pass.
Multiple incoming gradients are summed up (multivariate chain rule).
Modularity: Each node must only ”know” how to compute gradients wrt. its own arguments.
One fw/bw pass per data point:
XN
∇w L(y, X, w) = ∇w L(yi , xi , w) (21)
| {z }
i=1
Backpropagation

Figure 24: Backpropagation: forward-backward

2.4 Educational Framewotk.

Now we’re going to introduce the educational framework, which is 160 lines of Python-NumPy code that imple-
ment a deep learning framework. It has been established in the context of a class taught by David McAllester,
one of the pioneers of artificial intelligence at TTI Chicago. This educational framework allows us to under-
stand the inner workings of a deep learning framework in depth. When listening to a lecture or a talk it is
often easy to miss the details that are required to get a good understanding of how the discussed concepts
actually work and only by implementing these concepts oneself one gets a good understanding of what has
actually been missed.The other reason why we’re using this educational framework is that it’s very compact
it allows us to understand each single line, it is a little bit less than 150 lines of Python-NumPy code. Let’s
look at this educational framework a little bit in more detail. It’s using python, which means that variables are
bound to objects. So for instance we have variables that describe the input or the labels or that are parents
simply of particular compute nodes in that computation graph or we have values, which are attributes of these
variables. The different types of nodes in the computation graph are implemented as classes. We have input
classes parameter classes and computational node classes. So for instance from the compute node we can inherit
a class that’s called sigmoid and that implements the functionality of the sigmoid non-linearity within a logistic
regression model or deep neural network.

22
c l a s s Input :
def init ( self ):
pass

d e f addgrad ( s e l f , d e l t a ) :
pass

c l a s s Parameter :
def i n i t ( s e l f , value ) :
s e l f . v a l u e = DT( v a l u e )
Parameters . append ( s e l f )

d e f addgrad ( s e l f , d e l t a ) :
s e l f . grad += np . sum ( d e l t a , a x i s = 0 )

d e f UpdateParameters ( s e l f ) :
s e l f . v a l u e −= l e a r n i n g r a t e ∗ s e l f . grad

c l a s s CompNode :
d e f addgrad ( s e l f , d e l t a ) :
s e l f . grad += d e l t a
Here we can see the abstract definition of the base classes. The input class has an initialization function and
a function for adding the gradients, which are empty, because we are not interested in adding gradients to the
input variables and also because input variables don’t depend on any other variables. The parameter class
has a initialization function that copies and stores the value of that parameter in the class itself and appends
these parameters to a parameter vector, because later on we’ll need to sum up these parameters in order to
get the computation result. It also has a function for adding up gradients, this function simply sums up all the
gradients along the data dimension. This model computes the values in the forward pass and the gradients in
the backward pass for all the data points simultaneously and this is for efficiency reasons, because it allows to
utilize efficient matrix operations in python. The Parameter class also has an UpdateParameter function that
takes a step into the gradient direction, this is what is utilized during gradient descent. The compute node
function also has a function for adding gradients, because we also need to update the gradients of the compute
nodes themselves. In order to execute the computation graph we need to define a forward and a backward
function.
d e f Forward ( ) :
f o r c i n CompNodes : c . f o r w a r d ( )

d e f Backward ( l o s s ) :
f o r c i n CompNodes + Parameters :
c . grad = np . z e r o s ( c . v a l u e . shape , dtype = DT)
l o s s . grad = np . o n e s ( l o s s . v a l u e . shape ) / l e n ( l o s s . v a l u e )
f o r c i n CompNodes [ : : − 1 ] :
c . backward ( )

d e f UpdateParameters ( ) :
f o r p i n Parameters : p . UpdateParameters ( )
Forward function takes the inputs x and y and the current state of the parameters w and goes left to right
through the computation graph in order to compute all the intermediate values all the way until the head node
L. We have stored all these computation nodes in a list CompNodes and we have stored them in a way such
that they are stored left to right, which means that when we arrive at any particular node in the sorted list,
then we know that all the previous nodes have already been computed, so we can utilize their values. Thus in
the forward pass of the backpropagation algorithm we simply go through the sorted list of computation nodes
and for each computation node apply the forward function that is implemented inside that computation node.
For the backward pass we first set all the gradients of all the computation nodes and all the parameters to zero
and then we go backwards. We go backwards starting from L and propagate the gradients backward iteratively
calling the backward function of each compute node in order to send gradient updates to its parents. Then
finally we have an UpdateParameters function that loops through a list of parameters and calls the UpdatePa-
rameter function for each parameter.
Remark: Forward() and Backward() functions compute the forward and backward passes respectively over the

23
entire data set and the reason for this is that matrix operations are very cheap, because they are implemented
very efficiently in python, while if we would go over the data set with loops we would be much slower. This
vectorization is very important to exploit, furthermore if we would have GPUs hardware available, then we
could even parallelize this computation, because the forward pass of each individual data point is independent
of each other data point and the backward pass of each individual data point is independent of the backward
pass of our data points.

Lets look at the concrete example of the computation node of sigmoid function:
1
σ(x) =
1 + e−x
σ 0 (x) = σ(x)(1 − σ(x))

c l a s s Sigmoid (CompNode ) :
def init ( self ,x ):
CompNodes . append ( s e l f )
self .x = x

def forward ( s e l f ) :
bounded = np . maximum( −10 ,np . minimum ( 1 0 , s e l f . x . v a l u e ) )
s e l f . v a l u e = 1 / ( 1 + np . exp(−bounded ) )

d e f backward ( s e l f ) :
s e l f . x . addgrad ( s e l f . grad ∗ s e l f . v a l u e ∗ (1− s e l f . v a l u e ) )

Here we can see the python definition of the sigmoid class, which inherits from the class computation node.
It has three functions: an initialization function, a forward function and a backward propagation function. At
initialization time we simply add the node itself to the list of computation nodes and we store the parent of that
node in that class itself. self.x = x means take the parent which is an input that could be another computation
node let’s say an affine computation node that does some computation and then inputs to the sigmoid function.
In the forward function we calculate the sigmoid expression. So first we compute a bounded value of the value
itself in order to avoid numerical problems and then we implement the sigmoid function. For the backward
pass we implement derivative of sigmoid function, but what we do actually is we implement that function and
multiply that function with the back propagated gradient, the gradient at the node itself and pass this as a
message further on to the parent.
Remark: It’s important to note that in this backward pass the gradient is sent to the parent node self.x.
Let’s execute a concrete minimal example.
import e d f

# data loading
Execution Example: edf . clear compgraph ()
• Load data X and labels y x = e d f . Input ( )
y = e d f . Input ( )
• Initialize parameters w0 x . v a l u e = Load ( data )
y . v a l u e = Load ( l a b e l s )
• Define computation graph
• For all iterations do # initialization of parameters
params 1 = e d f . AffineParams ( nInputs , nHiddens )
– Forward Pass params 2 = e d f . AffineParams ( nHiddens , n L a b e l s )
L(ŷi = fw (xi ), yi )
# definition of computation graph
– Backward Pass h = e d f . Sigmoid ( e d f . A f f i n e ( params 1 , x ) )
∇w L(ŷi , yi ) p = e d f . Softmax ( e d f . A f f i n e ( params 2 , h ) )
L = e d f . Cross EntropyLoss ( p , y )
– Gradient Update
t+1
wP = wt − # gradient descent
N
η i=1 ∇w L(ŷi , yi ) f o r i in range ( i t e r a t i o n s ) :
e d f . Forward ( )
e d f . Backward (L)
e d f . UpdateParameters ( )

24
3 Deep Neural Networks
3.1 Backpropagation with Tensors
So far, we have only discussed how the backpropagation algorithm works on functions of scalars, that is, functions
like y = σ(w1 x + w0 ). But what if we want to optimize parameters in a function like y = σ(Ax + b), where
we have to deal with matrices and vectors? Technically, this function can be completely broken down such that
we have a computational graph consisting only of scalar operations. However, it is often much simpler to deal
with vectors, matrices and tensors (n-dimensional matrices) directly. If we implement a computational graph
composed of tensor-valued nodes, each node still has the attributes value and grad. The value-attribute of
a node contains an array with the components of the tensor. Because the loss function L is still a scalar, all
that is needed for backpropagation are the partial derivatives of the loss with respect to each component of a
tensor. The grad-attribute thus simply contains the gradient, so b.grad stores ∇b L. Note that this means
that A.value and A.grad have the same shape (similarly for all other nodes).

3.1.1 Backpropagation on Loops

How do the values and gradients get computed then? Let us take a look at the following example:

y = σ(|{z}
Ax +b)
=u

Here, the intermediate variable u has been indicated. As already stated, the computational graph might also
consist of scalar operations only and that is certainly one way to implement the computation of the value and
grad attributes. The forward pass would look like this:

for i u.value[i] = 0
for i,j u.value[i] += A.value[i, j] ∗ x.value[j]
for i y.value[i] = σ(u.value[i] + b.value[i])

In the backward pass, first the gradients of the loss with respect to y (so ∇y L) are computed and stored in
y.grad. Then, the gradients of u and b can be computed:

for i u.grad[i] += y.grad[i] ∗ σ 0 (u.value[i] + b.value[i])

for i b.grad[i] += y.grad[i] ∗ σ 0 (u.value[i] + b.value[i])

Finally, the gradients of x and A are evaluated:

for i,j A.grad[i, j] += u.grad[i] ∗ x.value[j]

for i,j x.grad[j] += u.grad[i] ∗ A.value[i, j]

Here, the back-propagated gradients are shown in red and the locally computed gradients in blue.
In deep learning, this kind of implementation remains possible for higher order tensors. If the forward pass
of a computational graph is defined as follows:

for h,i,j,k U.value[h, i, j] += A.value[h, i, k] ∗ B.value[h, j, k]

for h,i,j Y.value[h, i, j] = σ(U.value[h, i, j]),

then the backward pass can also be implemented in terms of loops over indices:

h,i,j U.grad += Y.grad[h, i, j] ∗ σ 0 (U.value[h, i, j])

h,i,j,k A.grad += U.grad[h, i, j] ∗ B.value[h, j, k]
h,i,j,k B.grad += U.grad[h, i, j] ∗ A.value[h, i, k]

In practice, this is usually not how the training of a deep network is implemented. Instead, a method called
minibatching is used.

3.1.2 Minibatching
In addition to too many indices being messy, writing loops manually can be very computationally inefficient. The
languages that are mainly used for writing training loops for neural networks, such as Python or MATLAB, are
not designed for efficient loops over arrays and can be very slow (mostly for type checking reasons). A technique
called vectorization is often advantageous: Instead of looping over an array and applying an operation to each

25
scalar element, apply the operation to a whole vector (or matrix) at once. There are special libraries that provide
this functionality (such as NumPy) and they can speed up the computation dramatically. Some libraries also
add support for computation on a GPU. The goal is, that the bulk of the computation time should go to the
actual calculation of floating-point numbers and not to things such as type checking.
In our context this means not taking a single observation x for an optimization step (forward pass, backward
pass, parameter update), but a minibatch of N observations, stacked together in a matrix X. The equation
from above then becomes:
|{z} +B)
Y = σ(XA
=U
N ×D
Here, each row in X ∈ R is one observation x ∈ RD and the bias vector b ∈ RM is broadcast to RN ×M .
All values that depend on the input x now also have a batch index b:

for b,i U.value[b, i] = 0

for b,i,j U.value[b, i] += X.value[b, j] ∗ A.value[j, i]
for b,i Y.value[b, i] = σ(U.value[b, i] + B.value[i])

For the actual parameter update step, the computed gradients are averaged over one batch.
In a real implementation, vectorization would be used:
def forward ( self ):
self . value = np . matmul ( self . x . value , self . w . A . value ) + self . w . b . value

def backward ( self ):

self . x . addgrad ( np . matmul ( self . grad , self . w . A . value . transpose ()))
self . w . b . addgrad ( self . grad )
self . w . A . addgrad ( self . x . value [: ,: , np . newaxis ] * self . grad [: , np . newaxis ,:])

This is easier to read than for-loops with many indices, but might be harder to implement for NumPy-novices.

3.2 The XOR Problem

Recall now the logistic regression model from earlier:
1
ŷ = σ(w> x) with σ(x) =
1 + e−x
Which problems could be solved with this classifier? Logistic regression defines a linear decision boundary:

w> x + w0 = 0,

(see Fig. 25). Each point x is classified according to the side of this linear boundary that it lies on:
• Decide for class 1 ⇔ w> x > −w0
• Decide for class 0 ⇔ w> x < −w0

1.0
Decision Boundary

Sigmoid

0.8

0.6

Class 0 Class 1
(x)

0.5

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

Figure 25: Decision boundary of a logistic regression model

Because the decision boundary is linear, logistic regression can only solve classification problems with linearly
separable datasets. Some simple examples include the OR and AND functions:

26
x1 x2 OR(x1 ,x2 ) x1 x2 AND(x1 ,x2 )
0 0 0 0 0 0
0 1 1 0 1 0
1 0 1 1 0 0
1 1 1 1 1 1

For these two functions (datasets) optimal parameters can easily be found:

OR AND

x1 x1
1 1 > |{z}
0.5 1 1 > |{z}
1.5
| {z } x2 | {z } x2
−w 0 −w 0
w> w>
| {z } | {z }
x x

From the following figures, it can be seen that these decision boundaries perfectly separate all four data
points in both cases:

OR AND
Class 0 Class 0
Class 1 Class 1
1 1

0 1 0 1

However, there are many datasets that are not linearly separable. One very simple example is the XOR-
function:
x1 x2 XOR(x1 ,x2 )
0 0 0
0 1 1
1 0 1
1 1 0
The dataset can be seen in Fig. 26. Visually, it is obvious that there exists no linear decision boundary that
cleanly separates the two classes. Formally, this can be proven using convex sets.

Class 0
Class 1
1

0 1
Figure 26: The XOR Dataset

Definition 1 (Convex Set) A set S is called convex, if for any two points x, y ∈ S the line segment connecting
the two points also completely lies in S:

∀x, y ∈ S ∀θ ∈ [0, 1] : θx + (1 − θ)y ∈ S

The linear decision boundary divides the plane into two half-spaces (the decision regions). These half-spaces
are convex. If a feasible hypothesis, i.e. a correct decision boundary, were to exist, the two red points must be
in the red region, while the green points must lie in the green region. Because the decision regions are convex,
the connecting line of the two green points must completely lie in the green decision region and the connecting
line of the two red points must completely lie in the red decision region. However, as can be seen in Fig. 26,

27
the point (0.5, 0.5) lies on both of these connecting lines (they intersect here). Thus it must lie in both decision
regions, which is not possible. Hence, no solution can exist.
There are, of course, many problems that cannot be solved with a linear classifier. The XOR-Problem
is particularly well known because of its simplicity. It is partially responsible for the decline of interest in
perceptrons and neural networks in the 1970s.
The trick we need to use to solve the XOR-Problem (or other non-linear problems) using a linear classifier
such as logistic regression, is to employ basis functions. Analogously to how polynomial basis functions allow a
linear regression model to fit nonlinear datasets, the inputs of a classification problem can be transformed using
basis functions in such a way that the dataset becomes linearly separable. The decision boundary no longer
splits the input space of datapoints x in half, but the feature space of feature vectors ψ(x). One such feature
mapping is defined in Tab. 1.

x1 x2 ψ1 (x) ψ2 (x) ψ3 (x) XOR

0 0 0 0 0 0
0 1 0 1 0 1
1 0 1 0 0 1
1 1 1 1 1 0

Table 1: Feature mappings for the XOR problem

The decision boundary can then be defined as follows:

 
x1
w>  x2  > −w0
x1 x2
| {z }
ψ(x)

The feature space and the new decision boundary are visualized in Fig. 27.

Class 0
Class 1

0 1

Figure 27: Feature space and decision boundary for the XOR problem

A different set of features that also transform the input into a linearly separable form are ψ1 (x) = OR(x1 , x2 )
and ψ2 (x) = NAND(x1 , x2 ). These features can then be combined using XOR(x1 , x2 ) = AND(ψ1 (x), ψ2 (x)).
So even though all of the functions OR, AND and NAND are linearly classifiable, they can compute XOR when
composed together. The final computational graph for XOR can then be written as follows:
>
h1 = σ(wOR x + wOR )
>
h2 = σ(wN AN D x + wN AN D )
>
ŷ = σ(wAN D h + wAN D )

Instead of going directly from input to output, the input gets transformed nonlinearly into the feature h. Here,
h is called a hidden layer. The equations can also be written more compactly:
 
 >

 wOR wOR
h = σ x+

>
 wN wN AN D 

AN D
| {z } | {z }
W w
>
ŷ = σ(wAN D h + wAN D )

A visual depiction of this graph is shown in Fig. 28.

Because we now have two layers (one hidden and one output), this is called a Multi-Layer Perceptron. The
real power of deep learning is that these intermediate features can be learned automatically using backpropa-
gation. This is called representation learning.

28
Figure 28: Solving the XOR Problem with a hidden layer

3.3 Multi-Layer Perceptrons

Multi-Layer Perceptrons (MLPs) are feedforward neural networks. This means there are no feedback signals
flowing from later to earlier layers. MLPs compose several non-linear functions f (x) = ŷ(h3 (h2 (h1 (x)))) where
hi (·) are called hidden layers and ŷ(·) is the output layer. The data with which a MLP is trained does not
specify the representations learned by the hidden layers, only the behaviour of the output layer. Each layer i
in a MLP comprises multiple neurons j which are implemented as affine transformations (a> x + b) followed by
non-linear activation functions (g):
hij = g(a>
ij hi−1 + bij )

Network Depth = #Computation Layers = 4

Layer Width = #Neurons in Layer

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

Figure 29: A multi layer perceptron with 4 layers

Linear Regressor / Classiﬁer

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer

Class 0 Class 0
Class 1 Class 1

Transformation
by hidden layers

Figure 30: Hidden layers transform the input into better features

Each neuron in a MLP is fully connected to all neurons in the previous layer. The total number of layers is
also called the depth of the model, this is where the name deep learning comes from. Technically, the neurons
in a MLP are not perceptrons. Perceptrons use a linear threshold activation function which cannot be trained
using backpropagation and is instead trained using the perceptron algorithm. MLPs use (mostly) differentiable

29
activation functions and are trained using backpropagation. In Fig. 29, a complete MLP is shown. The only
function of the hidden layers is to transform the input vectors into features that can be processed by the simple
linear regression or logistic regression output layer, as was done manually in the last section. This is illustrated
in Fig. 30. Some different activation functions that can be used are shown in Fig. 31.

Figure 31: Activation functions that provide the needed nonlinearity

Artificial neural networks are loosely inspired by biological neural networks. Neurons in the brain are also
structured in layers where one neuron has many inputs and computes one output. Even the sigmoid activation
function can be found in biological neurons. However, brains are very different from Multi-Layer Perceptrons
and the goal of deep learning is not to model the brain, but to build models that achieve good statistical
generalization in many different settings.
MLPs can be trained using the backpropagation algorithm and (stochastic) gradient descent as follows:
1. Initialize weights w, pick learning rate η and minibatch size |Xbatch |
2. Draw (random) minibatch Xbatch ⊆ X
3. For all elements (x, y) ∈ Xbatch of minibatch (in parallel) do:
(a) Forward propagate x through network to calculate h1 , h2 , . . . , ŷ
(b) Backpropagate gradients through network to obtain ∇w L(ŷ, y)
1
4. Update gradients: wt+1 = wt − η |Xbatch
P
| (x,y)∈Xbatch ∇w L(ŷ, y)

5. If validation error decreases, go to step 2, otherwise stop

Typically |Xbatch | < |X |, because large datasets often do not fit into GPU memory.
The number of hidden layers and neurons per hidden layer are hyperparameters that can be selected to
adjust the complexity of the model. In a two-layer MLP, the number of neurons in the hidden layer is simply
the dimensionality of the feature space. The higher this number is, the more complex the function that the
network represents can be. This is illustrated in the following figure1 :

2 Hidden Neurons 5 Hidden Neurons 15 Hidden Neurons

1 https://cs.stanford.edu/people/karpathy/CNNjs/demo/classify2d.html

30
Finally, let us discuss what would happen if there are no non-linear activation functions. This following
two-layer MLP

h = g(A1 x + b1 )
y = g(A2 h + b2 )

can be written as
y = g(A2 g(A1 x + b1 ) + b2 ).
If we used a linear activation function, such as g(x) = x, we can rewrite this equation:

y = A2 (A1 x + b1 ) + b2 = A2 A1 x + A2 b1 + b2 = Ax + b

Thus, with linear activations, a multi-layer network can only express linear functions.

3.4 Universal Approximation

If we use non-linear activation functions, what functions can we actually represent with a multi-layer perceptron?
One important result tells us that we can approximate any continuous function arbitrarily well on a given
bounded domain with only a 2-layer feedforward neural network.

Theorem 1 (Universal Approximation Theorem) Let σ be any continuous discriminatory function. Then
finite sums of the form
N
X
G(x) = αj σ(a>
j x + bj )
j=1

are dense in the space of continuous functions C(In ) on the n-dimensional unit cube In . In other words, given
any f ∈ C(In ) and > 0, there is a sum, G(x) for which

|G(x) − f (x)| < for all x ∈ In

Remark: This theorem has been proven for various activation functions (e.g., Sigmoid, ReLU).
If we restrict ourselves to the space of binary functions f : {0, 1}D → {0, 1}, it is not difficult to see why this
result holds. Here, any function f is really just a table, such as the following:

x1 x2 x3 y
.. .. .. ..
. . . .
0 1 0 0
0 1 1 1
1 0 0 0
.. .. .. ..
. . . .
Figure 32: Linear threshold network

If we now take a linear threshold function as the activation function for each neuron, we can construct a
2-layer MLP with the output X
ŷ = [a> x + bi > 0] .
|i {z }
i
hi

See Fig. 32 for an illustration. Here, each hidden neuron hi recognizes exactly one combination of inputs, see
Fig. 32 and the table above to see how the parameters of one such neuron are implemented. If we now do
this for all 2D combinations of inputs, we can construct a 2-layer MLP that can be equivalent to any binary
function. We used a linear threshold function, which is not continuous. However, we can easily approximate
this activation function using sigmoid functions, see Fig. 33
Thus, every binary function can be approximated arbitrarily well with a 2-layer MLP using 2D number of
hidden neurons. This is an important theoretical result, but an exponential number of neurons means that
memory and computation time will also increase exponentially with the size of the input. Another problem
with the approach from above is that the network only memorizes the input-output pairs and thus will not be
able to generalize at all. By increasing the number of layers, a deep network needs far fewer parameters to
learn complex functions. This is because a deep architecture introduces an inductive bias: it assumes that the

31
1.0
(x)
(2x)
(5x)
0.8 (50x)

0.6

0.4

0.2

0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Figure 33: Sigmoid functions can approximate the Heaviside function

complex function learned is a composite of simpler functions. This leads to more compact models and better
generalization performance. As an example, the parity function
( P
1 if i xi is odd
f (x1 , . . . , xD ) =
0 otherwise

requires an exponentially large shallow network but can be computed using a deep network whose size is linear
in the number of inputs D.
If the neurons of a MLP are activated with the absolute value rectification function, there is a intuitive
geometric analog to how the layers of the network transform the input. The weights and biases of a layer define
a hyperplane in the feature space which defines a “mirror”. This means that the inductive bias of such networks
is the assumption, that complex functions arise as mirrored images of simpler patterns. This is illustrated in
Fig. 34. This image also intuitively conveys the exponential advantage of more layers over more parameters.

Figure 34: Space folding intuition for deep networks

It has been shown multiple times that deeper networks generalize better and that introducing more layers
is often more effective than introducing more parameters, see Fig. 35. This means, that the inductive bias of
compositionality is a very useful prior over the space of hypothesis functions that a model can learn.

32
Figure 35: Deeper networks often perform better than shallow networks when using the same number of
parameters

33
4 Deep Neural Networks II
4.1 Output and Loss Functions

Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target

Figure 36: A simple neural network containing a one dimensional output layer and loss function.

Besides the input and hidden layers, an artificial neural network has an output layer containing the prediction
ŷ and a loss function L that compares the output with a given target y (see Fig. 36). The choice of the output
layer depends on the task (discrete or continuous predictions, classification or regression problem).

4.1.1 Loss Function

1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0

0.8 0.8
p(y|x)

p(y|x)

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y
(a) Large KL-divergence leads to a large loss value. (b) Small KL-divergence leads to a small loss value.

Figure 37: Quantifying the divergence of distributions.

The loss function of a neural network quantifies the divergence of model output (=prediction) and the target
value. It evaluates the quality of a prediction by boiling it down to a single or few numbers. Predictions that are
similar to the desired target value should get a small loss value, while strongly diverging predictions should be
assigned a large loss value. In this sense you can think of the loss function as a measure of distance or cost being
paid for a prediction. In this way the loss function enforces similarity between predictions and target values.
Note that the values don’t necessarily have to be single points, but might also be parameters of a probability
distribution as visualized in Fig. 37. In the second case, the Kullback-Leibler divergence (KL divergence) serves
as loss as it measures to divergence of to probability distributions.
But how to design a good loss function? Basically, a loss function can be any differentiable function that we
wish to optimize. But instead of designing a loss function by hand, it is often preferable to derive the cost
function from the maximum likelihood principle. To do so, consider the output of the neural network as
parameters of a distribution over outputs yi . The maximum likelihood principle is used to find the optimal
values for the parameters by maximizing the likelihood function derived from the training data. So using this
approach, we try to find model parameters w that maximize the likelihood of the model for a given training

34
dataset X.

ŵM L = argmax pmodel (y|X, w)

w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
N
X
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood

For example,
let a neural network fw (x) predicts mean µ of Gaussian distribution over y, that is p(y|x, w) =
1 (y−fw (x))2
√
2πσ 2
exp − 2σ 2 . The goal is, to maximize the probability of the target y under this distribution which
would result in the mean being shifted toward y. This setting is also visualized in Fig. 38.
As the loss function depends on the specific task at hand, the following two subsections will show how to derive
the loss function for different regression and classification problems. We will also discuss the design of the
output layer as both considerations are mutually dependent.

0.40

0.35

0.30

0.25

p(y)
0.20

0.15
Target
0.10

0.05
Mean
0.00
6 4 2 0 2 4 6
y

Figure 38: Neural network predicting model parameters µ, σ of a Gaussian distribution.

4.1.2 Regression Problems

The Gaussian distribution:

0.40
0.35
0.30
0.25
p(y)

0.20
0.15
0.10
0.05
0.00
6 4 2 0 2 4 6
y

Figure 39: The Gaussian Distribution

The Gaussian Distribution (see Fig. 39 is parameterized by a mean µ and a standard deviation σ:
!
2
1 (y − µ)
p(y) = √ exp −
2πσ 2 2σ 2

Its characteristics are thin tails: p(y) → 0 quickly as y → ∞. This means outliers are strongly penalized, wrong
datapoints can impact the distribution significantly.

The L2 Loss
The formula of the L2 Loss is similar to the Maximum Likelihood in the first lecture, but now the parameter
µ is the prediction of a neural network. We assume the model distribution is a Gaussian distribution where
the Multi-Layer-Perceptron predicts the mean of that distribution, and σ is constant. When this is plugged
into ŵM L (the Maximum Likelihood objective), the first expression can be removed, as it is constant with

35
respect to w. By reformulating the last expression by removing the minus, we are now minimizing instead of
maximizing, so we have arrived at our loss function (see equation below).

1 (y−fw (x))2
Let pmodel (y|x, w) = √2πσ 2
exp − 2σ 2 be a Gaussian distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X 1 X 1
2 2
= argmax − log(2πσ ) − 2
(fw (xi ) − yi )
w
i=1
2 i=1
2σ
N
X 2
= argmax − (fw (xi ) − yi )
w
i=1
N
X 2
= argmin (fw (xi ) − yi )
w | {z }
i=1
L2 Loss

This loss function is called the squared loss or L2 Loss. As this loss is strongly affected by outliers, often a
different distribution is used - the Laplace Distribution.

The Laplace distribution:

0.5

0.4

0.3
p(y)

0.2

0.1

0.0
6 4 2 0 2 4 6
y

Figure 40: The Laplacian Distribution

The Laplace Distribution (see Fig. 40) has a similar form to the Gaussian distribution, except now in
exponential expression the absolute difference between y and µ is used:

1 |y − µ|
p(y) = exp −
2b b
1
The normalization constant 2b has changed slightly as well due to this. In the formula, µ denotes the location
of the distribution, while b represents the scale, similar to the standard deviation in the Gaussian distribution
determining the width of the distribution.
This distribution has heavier tails than the Gaussian: p(y) → 0 more slowly as y → ∞. This means more
probability mass is at the tails, so outliers are penalized less strongly, which often makes it the preferred choice
for regression problems in practice.

When deriving the loss here we do the same calculations as before, but now using the Laplacian Distri-
bution. Here, the location parameter µ is predicted by the Feed-Forward Neural Network, with b being an
arbitrary constant scale parameter.

1
Let pmodel (y|x, w) = 2b exp − |y−fbw (x)| be a Laplace distribution. We obtain:

36
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2b) − |fw (xi ) − yi |
w
i=1 i=1
b
N
X
= argmax − |fw (xi ) − yi |
w
i=1
N
X
= argmin |fw (xi ) − yi |
w | {z }
i=1
L1 Loss

Refactoring the equation in a similar way to the Gaussian, we arrive at a similar loss function too (see above).
The difference is, that the square has been replaced with the absolute value, meaning that the absolute loss
(called the L1 Loss)is being minimized. This loss is more robust than the L2 loss.

It is also possible to predict more than one parameter. Consider the Laplace Distribution again, but
this time predicting both the location parameter µ and the scale b with a neural network. Technically these
parameters are predicted by different neural networks fw (xi ) and gw (x), but these are typicaly the same except
for different output layer for each network.

Let pmodel (y|x, w) = 2 gw1(x) exp − |y−fw (x)|
gw (x) be a Laplace distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2 gw (x)) − |fw (xi ) − yi |
w
i=1 i=1
gw (x)

The resulting expression is not as simple anymore, as both expressions depend on w and thus cant be removed.
This also results in a more complicated loss function. This loss function is again derived from the maximum
likelihood principle by assuming a certain distribution over the data. So if there exists knowledge of the
distribution of the data, a loss function that fits can be derived that would be hard to specify otherwise.
Predicting both parameters allows for estimating the aleatoric uncertainty (observation noise) with the neutral
network itself. This can be helpful for example in situations where an observation is not clear (for example an
image where part of the lens was covered, that is too dark, etc.), the uncertainty about the prediction can be
predicted too (larger or smaller b). The parameters can then be adjusted based on how certain the model is
for a given prediction. Another example would be predicting the category/pixel depth of an image, where it
is hard to make predictions at boundaries (foreground or background), resulting in higher uncertainty in these
regions (see Fig. 41).

Figure 41: Example for predicting the aleatoric uncertainty

Mixture Density Networks All of the distributions considered so far were unimodal. In some cases like the
depth prediction above, we may not know exactly if a pixel belongs to foreground or background. This is hard
to model with a Laplacian or Gaussian distribution with a single peak. To represent multi-modal distributions,

37
we can also model mixture densities using a mixture model:
M
!
(m)
X 1 |y − fw (x)|
pmodel (y|x, w) = πm (m)
exp − (m)
m=1 2 gw (x) gw (x)

0.8

0.7

0.6

0.5

p(y)
0.4

0.3

0.2

0.1

0.0
6 4 2 0 2 4 6
y

Figure 42: Laplacian Mixture Model

Fig. 42 above is a mixture model of two components using a simple Laplacian distribution that is summed over
the number of components, resulting in a prediction for the location µm and scale bm for all the m modes (in
this case two). Here πm ∈ [0, 1] denotes the weight for each component,
P as all componentes are weighted relative
to each other. A constraint for this parameter is further that m πm = 1.
This model is called a mixture density network, with a mixture of Laplace distributions in this case.

The Output Layer for Regression Problems

It has been previously discussed, that the hidden layers of a neural network represent a simpl eaffine transfor-
mation plus an activation function (e.g. the Sigmoid).
The output layer is also a combination of its inputs, so we also have an affine transformation. This time we
do not have an activation function though, as for most outputs (e.g., µ ∈ R), they are already going into
the correct space. In this case the output is just a linear layer. For some outputs (e.g., b ∈ R+ ), we need a
squashing function such as ReLU or softplus on top of the linear layer, to squash the otuput layer to R+ .

Summary: Loss Functions for Regression Problems

In summary, when assuming a Gaussian or Laplacian model distribution, the loss function corresponds to the
L2/L1 loss. It is also possible to predict uncertainty (variance/scale) or multiple modes using mixture density
networks (MDN).

4.1.3 Classification Problems

Compared to regression problems, in classification a discrete outcome is predicted. This outcome can be binary
or multi-class. The MNIST dataset for example, one of the most popular datasets in machine learning, contains
handwritten digits, so ten different output classes. Each sample consists of one image of a single digit with
28x28 pixels each. In total the dataset contains 60k training samples with labels and 10k test samples.
The Curse of Dimensionality: For images with 28x28 pixels, there exist 102 36 different possible binary
images, when using gray-scale such as in MNIST, there are even 2567 84 different combinations. This means
the image space is etremely large and impossible to enumerate. So how is this classification task with just 60k
labeled images even possible then?
The answer is, that the images are concentrated on a low-dimensional manifold of this high-dimensional space.
Only a tiny fraction of the possible images actually look like hand-written digits, so not the whole space has to
be searched.

Bernoulli distribution: The Bernoulli distribution is a distribution over 2 classes, so it is only appli-
cable for binary classification problems (e.g. classifying cats vs. dogs). It can be written as:

p(y) = µy (1 − µ)(1−y)

Here, the parameter µ denotes the probability for y = 1. As before, the probabilities of both classes must sum
to 1.
Similar as before, we can assume our model distribution as the Bernoulli distribution now and put this in the
Maximum-Likelihood estimator to derive the loss function. Doing this we again obtain the binary cross-
entropy (BCE) loss function (see below). Let pmodel (y|x, w) = fw (x)y (1 − fw (x))(1−y) be a Bernoulli

38
0.7

0.6

0.5

0.4

p(y)
0.3

0.2

0.1

0.0
0 1
y

Figure 43: Bernoulli Distribution

distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X h i
(1−yi )
= argmax log fw (xi )yi (1 − fw (xi ))
w
i=1
N
X
= argmin −yi log fw (xi ) − (1 − yi ) log(1 − fw (xi ))
w | {z }
i=1
BCE Loss

In other words, maximizing the Log-Likelihood with the Bernoulli distribution as the model distribution is
equivalent to minimizing the BCE loss. The last layer of fw (x) can be a sigmoid function (or any other
squeezing function) such that fw (x)y ∈ [0, 1] to get probabilities. In this case, unlike for regression problems,
this is required as we have seperate classes.

Categorical distribution:
So how can this approach be scaled up to multiple classes, such as in MNIST?

0.5

0.4

0.3
p(y)

0.2

0.1

0.0
1 2 3 4
y

Figure 44: Categorical Distribution

In the case of multiple classes, we again use the Maximum-Likelihood principle, but this time with the
Categorical distribution:

p(y = c) = µc

Here, the probability of y taking any of the classes c is µc . Again, all probabilities have to sum to 1.
Discrete distributions such as this one by definition accomodate multiple modes.

Alternative notation:
C
Y
p(y) = µyc c
c=1

Consider the distribution not over one-dimensional categorical labels, but over a vector y, which is a one-hot
vector with yc ∈ {0, 1}, where the length of y is the number of classes. Each element of the vector is either
1 or 0, but the sum over the whole vector is 1, so only one element can be 1 (the true class). For example
y = (0, . . . , 0, 1, 0, . . . , 0)> .

39
With this definition the probability can be rewritten as a product, where we have µyc c . Only where yc = 1 the
term is not 1, but µc .
In the table below we see an example for this representation:
class y y
1 (1, 0, 0, 0)>
2 (0, 1, 0, 0)>
3 (0, 0, 1, 0)>
4 (0, 0, 0, 1)>
In this example fouranimals are representing the diferent classes. Each class is represented by a one-hot vector
y with binary elements yc ∈ {0, 1}, with an index c where yc = 1 determines the correct class and with yk = 0
for k 6= c.
This vector can also be interpreted as a discrete distribution, with all the probability mass at the true class.

The Categorical Distribution / CE Loss

Now we use the Categorical distribution as model distribution in the Maximum-Likelihood formulation:
QC (c)
Let pmodel (y|x, w) = c=1 fw (x)yc be a Categorical distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X C
Y
(c)
= argmax log fw (xi )yi,c
w
i=1 c=1
N X
X C
(c)
= argmin −yi,c log fw (xi )
w
i=1 c=1
| {z }
CE Loss

Again, we can reformulate to turn the maximization problem into a minimization problem and arrive at the so
called Cross-Entropy loss, which goes across all classes. We compare the target for class c (the entry in the
one-hot vector) with the prediction for that c. This means we need a prediction for every c now. In other
words, we minimize the cross-entropy (CE) loss.
The target y = (0, . . . , 0, 1, 0, . . . , 0)> is a one-hot vector with yc its c’th element.

The Softmax
(c)
How can we ensure that fw (x) predicts a valid Categorical (discrete) distribution?
(c) PC (c)
The requirement for that is, that (1) each element fw (x) ∈ [0, 1] and (2) c=1 fw (x) = 1, so the distribution
has to sum to 1.
Using the element-wise Sigmoid for example, we would ensure (1), but not (2). This problem can be solved by
defining a so called softmax function on top of the affine predictions (scores) for each class, which guarantees
(1) and (2):
!
exp(x1 ) exp(xC )
softmax(x) = PC , . . . , PC
k=1 exp(xk ) k=1 exp(xk )

So the exponential of each individual element is divided by the sum of the exponentials of each individual
element.
Let s denote the network output after the last affine layer (=scores). Then:
C
(c) exp(sc ) (c)
X
fw (x) = PC ⇒ log fw (x) = sc − log exp(sk )
k=1 exp(sk ) k=1

So for each class we have the output of the neural network as input to the softmax. Taking the logarithm of
the softmax we arrive at the Log Softmax function above. We already see that sc is a direct contribution to
the loss function, i.e., so it does not saturate.

The Log Softmax:

40
Intuition: Assume c is the correct class. Our goal is to maximize the log softmax:
C
X
(c)
log fw (x) = sc − log exp(sk )
k=1

The first term here encourages the score sc for the correct class c to P increase. The second term encourages
C
that all scores in s jointly decrease. It can be approximated by log k=1 exp(sk ) ≈ maxk sk as exp(sk ) is
insignificant for all sk < maxk sk . Therefore, the loss always strongly penalizes the most active incorrect
prediction. If it is the correct prediction (i.e., sc = maxk sk ), it is not penalized because both terms roughly
cancel each out out, but it it’s the wrong term then it is penalized.

Below we have an example for this:

4
50
3
2 40

exp(sc)
1 30
sc

0 20
1 10
2 0
1 2 3 4 1 2 3 4
c c

Figure 45: Scores sc Figure 46: Exponential scores exp(sc )

On the left we see four classes and their scores predicted by the neural network. On the right we see the
exponentials of these scores. All classes that are not top score P
classes are almost insignificant, when taking
C
their exponentials. For example, the second term becomes: log k=1 exp(sk ) = 4.06 ≈ s3 = maxk sk , so the
value is very similar to the maximum.
(c) PC
Assuming c = 2 is the correct class, we obtain: log fw (x) = sc − log k=1 exp(sk ) = 1 − 4.06 ≈ −3. For c = 3
(c) PC
we obtain: log fw (x) = sc − log k=1 exp(sk ) = 4 − 4.06 ≈ 0 So we get a much larger value, if the correct
class corresponds to the class where we have assigned the highest score with our neural network, and a much
lower value if we didn’t assign the correct class.

The Relation between Softmax and Sigmoid:

When requiring the constraint that the used distribution sums to one is effectively removing one degree of
freedom. Only C − 1 parameters are necessary, because the last one can be calculated as one minus the sum
over the first ones.
Example: Consider C = 2 and fix one degree of freedom (x2 = 0):

exp(x1 ) exp(x2 )
softmax(x) = ,
exp(x1 ) + exp(x2 ) exp(x1 ) + exp(x2 )

exp(x1 ) 1
= ,
exp(x1 ) + 1 exp(x1 ) + 1

1 1
= ,1 −
1 + exp(−x1 ) 1 + exp(−x1 )
= (σ(x1 ), 1 − σ(x1 ))

As seen above, the softmax can be rewritten, such that the first expression is the expression of the Sigmoid
function. This means that the Softmax is effectively a multiclass generalization of the sigmoid function.
In practice, the overparameterized version where all scores are predicted is often used, as it is simpler to
implement and doesn’t make a big difference.

The name Softmax is somewhat confusing, with ”soft argmax” being a more precise name, as it is a
continuous and differentiable version of the argmax function in one-hot representation:
!
exp(s1 ) exp(sC )
softmax(s) = PC , . . . , PC
k=1 exp(sk ) k=1 exp(sk )

41
Example with 4 classes:
4 1.0 1.0
50
3 0.8 0.8
2 40
0.6 0.6

softmax

argmax
exp(sc)
1 30
sc

0 20 0.4 0.4
1 10 0.2 0.2
2 0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
c c c c
Here we first see the scores on the left, then the exponentials of these scores, then the softmax and then the
argmax or one-hot encoding for category 3. As we can see, the softmax and the argmax are almost the same,
so the softmax function can be regarded as an approximation of the argmax.

We have seen that the softmax responds to differences between inputs. It is also invariant to adding
the same scalar to all it’s inputs:
softmax(x) = softmax(x + c)
We can therefore derive a numerically more stable variant:

softmax(x) = softmax(x − max xk )

k=1..L

For all inputs, we subtract the maximum of all the inputs/scores of the softmax. This allows for accurate
computation even with limited fixed precision, even when the x become large. It illustrates again, that the
softmax depends only on the differences between individual scores and not on some global term that we add or
subtract.

Cross Entropy Loss with Softmax Example:

The Cross Entropy Loss for a single training sample (x, y) ∈ X is:
C
X
(c)
CE Loss: −yc log fw (x)
c=1

Example: Suppose we have 4 classes, so C = 4 and 4 training samples x with labels y represented in one-hot
encoding:
Input x Label y Predicted scores s softmax(s) CE Loss
(1, 0, 0, 0)> (+3, +1, −1, −1)> (0.85, 0.12, 0.02, 0.02)> 0.16
(0, 1, 0, 0)> (+3, +3, +1, +0)> (0.46, 0.46, 0.06, 0.02)> 0.78
(0, 0, 1, 0)> (+1, +1, +1, +1)> (0.25, 0.25, 0.25, 0.25)> 1.38
(0, 0, 0, 1)> (+3, +2, +3, −1)> (0.42, 0.16, 0.42, 0.01)> 4.87
Suppose we have a model making predictions (third column): For the first input it does a good job at predicting
the correct class, for the second one it is uncertain between 1 or 2, for the third it is uncertain about all possible
classes and for the last one the model actually predicts the wrong class.
After computing the softmax of these scores, we can then compute the Cross-Entropy loss. We can see, that
the CE-loss is relatively small for the first example, as the the model assigned the correct class with high
probability. For the second example, the CE-loss decreases, as the uncertainty increases. In the third it’s
increasing even more, as the uncertainty is also bigger. In the final example the model predicts the wrong class,
resulting in a very high CE-loss. If this would be a minibatch in our stochastic gradient descent optimization,
then sample 4 would contribute most strongly to the loss function.

Summary: Output Layer and Loss Function for Classification

For the output layer for two classes we can either predict a single value and use the Sigmoid function, or use
the overparameterized representation and predict two values, applying the softmax afterwards to normalize the
distribution. For more than two classes we use the general cross-entropy loss.

4.2 Activation Functions

After having focused on the output layer and loss function, we will now move on to the hidden layers. The
parameters for the hidden layers are the number of hidden layers and the number of nodes in each of these
hidden layers, but also the activation function used for the hidden layer.
A hidden layer can be defined as hi = g(Ai hi−1 + bi ) with a non-linear activation function g(·) and weights
Ai , bi . The activation function is frequently applied element-wise to its input, but not always. The activation

42
functions must be non-linear to learn non-linear mappings. Some activation functions are not differentiable
everywhere, while still suitable for training.

The Sigmoid Activation Function

The first activation function we consider is the Sigmoid (see Fig. 47), which we have already seen in the context
of logistic regression. It is defined as
1
g(x) =
1 + exp(−x)

1.0 Sigmoid

0.8

0.6
g(x)

0.4

0.2

0.0
10 5 0 5 10
x

Figure 47: Sigmoid Activation Function

The Sigmoid maps the input to the range [0, 1], which can be interpreted as a probability or analogously to the
saturated “firing rate” of neurons in the brain.
It does come with some problems however: The saturation on the higher and lower ends “kills” the gradients,
which leads to problems in backpropagation. The second problem is that the output is not zero-centered, but
between 0 and 1, thus introducing a bias from the first layer on (positivity bias).

Problem 1: Saturation killing Gradients:

Consider the activation function in the context of its computation graph (see Fig. 48) with input x, activation

Figure 48: Sigmoid Problem 1

function g and output y.

When backpropagating we want to calculate the gradients of the loss function with respect to all nodes, so we
are backpropagating gradients through the graph. If the input x to g is very small (< −10) or very large (> 10)
we talk about the activation function being saturated. This means the gradient is almost zero: g 0 (x) ≈ 0.
If this happens, the backpropagated gradient will stop, as the products will be close to 0.

Problem 2: The Sigmoid is not zero-centered:

43
Allowed
Gradient
Update
Directions

Allowed
Gradient Update
Path
Update
Directions
Optimal Update

Figure 49: Sigmoid Problem 2 Figure 50: Sigmoid Problem 2: Inefficient Updates

Consider the sigmoid function, which is operating on a linear layer:

1 X
g(x) = x= a i xi + b
1 + exp(−x) i

We want to compute the loss function to the parameters a (see Fig. 49). If we have multiple hidden layers
with a Sigmoid activation function for each of them, we know that the output of each hidden layer is always
positive. Thus, the input to the next layer is also always positive. Furthermore, the gradient of the Sigmoid is
also always positive. This means, that the gradient wrt. the parameters ai is given by:
∂L ∂L ∂g ∂L ∂g ∂x ∂L ∂g
= = = xi
∂ai ∂g ∂ai ∂g ∂x ∂ai ∂g ∂x

As we know, all xi are positive, and the gradient of the sigmoid is also always positive, the blue terms are all
positive. Therefore, sgn(∂L/∂ai ) = sgn(∂L/∂g), so all gradients have the same sign (+ or -).
The problem with this is, that it restricts the space in which gradient updates can happen and leads to
effectively very inefficient optimization (see. Fig. 50). This problem becomes even bigger in higher-dimensional
space, as the subset of possible directions to move into is even smaller relative to the full space. This problem
can be somewhat alleviated by using minibatches.

The Tanh Activation Function:

1.00 Tanh
0.75
0.50
0.25
0.00
g(x)

0.25
0.50
0.75
1.00
10 5 0 5 10
x

Figure 51: The Tanh Activation Function

To adress the problems of the Sigmoid function, the Tanh function has been proposed (see Fig. 51):
2
g(x) = −1
1 + exp(−2x)

It is generally pretty similar to the Sigmoid function, but the output range is fundamentally different. It maps
the input to range [−1, 1]. This is an anti-symmetric mapping from the input domain to the output. The
advantage of this is, that it makes the function zero-centered, so we have negative and positive output values.
This alleviates problem 2 of the Sigmoid, but is also a saturating activation function, therefore also killing
gradients.

The Rectified Linear Unit (ReLU) Activation Function:

44
10 ReLU

g(x)
4

0
10 5 0 5 10
x

Figure 52: The ReLU Activation Function

Another activation function that has been propsed is the Rectified Linear Unit (see Fig. 52), which is the most
commonly used activation function in practice. It is defined as:

g(x) = max(0, x)

It is called Rectified Linear Unit as we have a linear component, but all the values x > 0 are capped, so it does
not saturate. It is not differentiable at 0, but that does not matter for training as we are not going to query
that function exactly at 0.
This function typically leads to much faster convergence than the Sigmoid or Tanh activation functions, and is
also computationally very efficient.
It is also not zero-centered however. Also, there is no learning for x < 0 ⇒, so if the input becomes smaller
than 0, then these ReLus become so called ”dead ReLUs”. It often happens in practice, that some neurons are
dead and don’t learn anymore.

Figure 53: ReLU Problem: Dead ReLUs

As shown in Fig. 53, whenever we have an input value x < 0 this activation function is 0, so the downstream
gradients are also going to be zero, so there is no learning. For this reason, we often initialize these ReLUs
with a positive bias (b > 0).

The Leaky ReLU Activation Function:

One way to prevent dead ReLUs is to use the Leaky ReLU. It is defined as:

10 Leaky ReLU

6
g(x)

0
10 5 0 5 10
x

Figure 54: The Leaky Relu Activation Function

g(x) = max(0.01x, x)

It is generally very similar to the normal ReLU, except that now there is a slope for all values x < 0 as well.
The advantage of this activation function is that it does not saturate as well, so the gradients won’t die. It is

45
also closer to zero-centered outputs, leads to fast convergence and is computationally efficient.
The Parametric ReLU Activation Function;
There are more alternatives to the Leaky ReLU, for instance the Parametric ReLU as a generalization of the
Leaky ReLU, where the factor α is itself a learnable parameter:

g(x) = max(αx, x)

It shares the advantages of the Leaky ReLU activation function.

The Exponential Linear Units (ELU) Activation Function:

10 ELU

6
g(x)

0
10 5 0 5 10
x

Figure 55: The Exponential Linear Units Activation Function

Another activation function that has been proposed is the Exponential Linear Units function (see Fig. 55).
It is very similar to the Leaky ReLU again, with the difference that it is also differentiable at 0 and has a
saturation for small x, which adds robustness to noise for some problems. Otherwise it shares the benefits with
the leaky ReLU. Its default α = 1.

The Maxout Activation Function:

Rectiﬁer Absolute Quadratic

Figure 56: The Maxout Activation Function

A generalization of the ReLU activation function is the Maxout activation function as proposed by Goodfellow
et al.:

g(x) = max(a> >

1 x + b1 , a2 x + b2 )

The Maxout function is using multiple affine predictions as an input (for example two in the case above, a>
1
and a>
2 . It thus increases number of parameters per function, which can be a disadvantage.

Summary Activation Functions:

There is no one-size-fits-all activation function, the choice highly depends on the problem. The activation
functions discussed above only represent the most common ones, but there exist many more. The best
activation function/model is often found using trial-and-error in practice. It is important to ensure a good ”gra-
dient flow” during optimization, so that the gradient flows backwards to all the parameters that we want to learn.

As a rule of thumb it is advisable to use ReLU with a sufficiently small learning rate by default, but
Leaky ReLU, Maxout or ELU can be tried out for some small potential additional gain. In general, Tanh
should be preferred over Sigmoid, and is often used in recurrent models.

46
(a) Original data (b) Zero-centered (c) Normalized

(d) Decorrelated (e) Whitened

Figure 57: The effect of different data preprocessing techniques.

Implementation of Activation Functions:

When implementing the backward pass of an activation, output or loss function it is important to ensure, that
all the gradients are imlemented correctly. The default way to make sure the gradients are correct is to plot
them and to verify them via computing numerical difference using Newton’s difference quotient:

∂f (x) f (x + h) − f (x)
= lim
∂x h→0 h

Or the Symmetric difference quotient:

∂f (x) f (x + h) − f (x − h)
= lim
∂x h→0 2h

For a particular h we can evaluate this expression and compare it to the analytic gradient that we have
implemented and verify it.
But how to choose h? For h = 0 the expression is undefined, so it has to be chosen bigger than 0 but small
enough. When choosing h too small rounding errors may occur due to the finite precision of the data types. If
its chosen too large we may
√ get approximation errors because the approximation for the secant is wrong.
A good choice is usually 3 with the machine precision. Examples are = 6 × 10−8 for single precision (32
bit) and = 1 × 10−16 for double precision (64 bit). (Example ommitted due to space constraints).

4.2.1 Data Preprocessing

Data preprocessing is an integral part of the machine learning pipeline that aims to standardize the input data
such that it is ”well distributed” in the input space. This terms also refers to considerations about how to handle
missing or inconsistent data and different data formats. Those aspects will not be discussed at this point. As we
have seen before neural networks can be sensitive to differences in the magnitude of input features. Furthermore,
preprocessing steps like centering serves numerical stability as it is the case for the softmax function.
The most common preprocessing techniques are zero-centering and normalization. Zero-centering refers to
subtracting the feature-wise mean from all data points:
N
1 X
xi,j ← xi,j − µj with µj = xi,j
N i=1

The result of zero-centering is visualized in Fig. 57b. The original data is shifted such that the mean is 0
for every feature. While being one of the most commonly applied preprocessing steps, note that the exact
implementation of centering can vary slightly in practice: For AlexNet the overall mean image (W × H × 3
numbers) is subtracted. VGGNet and ResNet compute a per-channel mean (mean along each channel: 3
numbers) that is subtracted from the training data for preprocessing. In ResNet, data points are additionally

47
(a) Small Random Values

(b) Large Random Values

Figure 58: The effect of different initialization choices. Tahn was used as activation function for all examples.

normalized using a per-channel standard deviation. Normalization refers to transforming the data to have
unit variance. This can be achieved by dividing the data points by the feature-wise standard variance:
N
1 X
xi,j ← xi,j /σj with σj2 = (xi,j − µj )2
N i=1

The effect is depicted in Fig. 57c. Two further techniques that require to compute the eigenvalues of the
covariance matrix are decorrelation (Fig. 57d) and whitening (Fig. 57e). Decorrelation refers to multiplying
with the eigenvectors of the covariance matrix. Dividing by the square root of the eigenvalues of the covariance
matrix is called whitening. Both, decorrelation and whitening, are less commonly used.

4.2.2 Weight Initialization

The typical training process for neural networks comprises weight initialization as first step to start the training.
This already suggests that the initialization plays a crucial role in the models performance. Indeed, there are
several strategies, some of which will be presented in the following. Efficient parameter initialization remains
an active field of research up until today.

Constant initialization Naively, initializing all weights with a constant value (e.g. 0) seems to be an easily
and practicable idea. Unfortunately, any constant initialization scheme will perform very poorly as it hinders
learning drastically. In the forward pass all weights will have the same influence and thus receive the same
correction signal in the backward pass. This will be the case throughout training and causes all neurons to
learn the same, thus significantly reducing the expressiveness of the neural network.

Random initialization An alternative approach is to initialize the weights at random. However, even with
random numbers it is important to choose them appropriately.
Small Random Numbers: In order to achieve initial weights not equal but close to zero, weights could be
drawn from a Gaussian with small standard deviation (e.g. σ = 0.01). Unfortunately, this leads to the problem
of vanishing gradients, which refers to an exponential decrease of the activation with deeper layers (see Fig. 58a).
Since in backpropagation the error signal is multiplied with the respective activation, the signal that will be
passed on to subsequent layers will get small at an exponential scale and deeper layer will no longer receive

48
meaningful gradients. This causes learning to be very slow or even diverge. Using the chain rule, this can be
∂L ∂g ∂L ∂g
seen by setting the activation xi to 0, which also causes the gradient to be 0: ∂a i
= ∂L
∂g ∂x xi = ∂g ∂x 0 = 0
Large Random Numbers: Reacting to the problem of vanishing gradients the next obvious idea to try would
be to initialize the weights with large random numbers, i.e. draw them independently from a Gaussian with large
standard deviation (e.g. σ = 0.2). This however can cause exploding gradients as the backpropagated gradients
will be multiplied with large values in every layer and can cause the network to oscillate. When using saturating
activation functions large random weights cause all activation functions to saturate such that no meaningful
gradient can be backpropagated which again hinders learning (see Fig. 58b). Looking at the chain rule, this
∂g ∂L ∂g
becomes by setting the local gradient ∂x to 0 (saturated activation function): ∂ai
= ∂L ∂L
∂g ∂x xi = ∂g 0 xi = 0

Xavier initialization A more sophisticated initialization strategy is Xavier initialization which was proposed
by Glorot et al. [6]. The main idea is to adapt the magnitude of the initial weights to the input size of
the respective layer. The authors propose to draw weights independently from a Gaussian distribution with
σ 2 = 1/Din , where Din denotes the dimension of the input to the respective layer. Note that this may vary
across layers
√ as their input size varies. A result of this initialization is shown in figure Fig. 58c. The choice of
σ = 1/ Din can be motivated theoretically. Let us consider y = g(w> x) and assume that all xi and wi are
independent and identically (i.i.d.) distributed with zero mean. Let further g 0 (0) = 1. Then:

Var(y) ≈ Var(w> x) = Din Var(xi wi )

= Din (E[x2i wi2 ] − E[xi wi ]2 ) plug in formula
= Din (E[x2i ] E[wi2 ] 2 2
− E[xi ] E[wi ] ) xi , wi independent
= Din E[x2i ] E[wi2 ] zero mean
= Din Var(xi ) Var(wi )

Thus:
Var(wi ) = 1/Din ⇒ Var(y) = Var(xi )
It is important to note that the Xavier initialization assumes zero centered activation function. This is partic-
uarly visible when comparing the resulting activations when using tanh versus ReLU as activation function. For
tanh the activations distribution appears to be well scaled across all layers. For ReLU and variants, collapsing
activations are observable for deeper layers. Thus Xavier initialization should be used for zero-centered action
function. For ReLU activation functions, a slightly adapted initialization strategy was proposed by He et al. [7].

He initialization As ReLU is restricted to the positive real line, He et al. [7] adapt the Xavier initialization
by doubling the variance. Weights are now drawn from a Gaussian with σ 2 = 2/Din . For ReLU activation
functions this leads to a well scaled activation distribution across all layers.

5 Regularization
5.1 Problem Statement
The primary goal of machine learning techniques (for e.g. polynomial curve fitting, image classification using
MLP’s) is to learn a model from the training data which achieves low generalization error i.e. learn a model
which performs well on new previously unseen samples (test set: assumed to be drawn i.i.d. - independently
and identically from the true data distribution) and not just on the training set.
Now that we have formalized what we want our training model to do, the important question still remains what
should the capacity/ complexity of our training model be to achieve such a low generalization error ? The best
way to answer this is to look at the bias-variance trade-off curve plotted against model complexity and to think
of generalization error in terms of bias/variance. (Fig. 59)

If we choose to train a model with low capacity (lower number of parameters/ weights, low variance) then we
risk underfitting (Fig. 60) i.e. the model is too simple, has high bias and can not even fit our training data
properly let alone perform well on the test set.

Takeaway: Low variance + High bias = High generalization error

On the other hand if we chose a model with very high capacity (higher number of parameters/ weights, high
variance) then we risk overfitting (Fig. 61) i.e. the model has high variance and has fit too well to our training

49
Figure 59: Bias-Variance Trade-Off Curve
1.5
M=1 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5

0.0
y

0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0
x

Figure 60: Example of Underfitting Generalization Error is High

data because of which it cannot generalize well to new unseen samples leading to high test error (=generalization
error).
Takeaway: High variance + Low bias = High generalization error

From the curve(Fig. 59) we can see that the optimal model complexity for which our model would obtain the
lowest total error is also when we have both low bias and low variance. Since at this optimal model capacity
we have both low bias and low variance we can also be confident about obtaining low generalization error.

Takeaway: Low variance + Low bias = Low generalization error (desired)

For example: In the case of fitting polynomial functions we see in Fig. 62 that when the degree of the polynomial
is 3 (optimal model complexity), both the variance and bias are low (because training set error is low) along
with the lowest generalization error (test set error).

So how do we find this sweet-spot (optimum model complexity) empirically for which we have both low bias/-
variance and low generalization error ? This is exactly where the technique of regularization figures in.

5.2 Intuition
We have two logical ways to empirically find the optimal model capacity. One way would be to start off with a
low capacity model (high bias, high training error, underfitting regime) and iteratively increase its complexity
(increase variance / reduce bias) till we obtain an optimal model which has the lowest generalization error and
low training error. The other way would be the exact opposite, i.e. start off with a high capacity model (high

50
1.5
M=9 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5

0.0

y
0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0
x

Figure 61: Example of Overfitting Training Error is Low but Generalization Error is High

1.5
M=3 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5

0.0
y

0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0
x

Figure 62: Example of Good Fit Generalization Error and Training Error are Low

variance, low training error, overfitting) and iteratively reduce its capacity (reduce variance / increase bias) to
obtain an optimal model with lowest generalization error.
In practise, the second option is highly preferred purely for the fact that if we start with a low capacity model
i.e. a model which underfits our data, we can never be sure about how much capacity needs to be added such
that both training error and test error are low. Whereas in the second option we have one less variable to worry
about i.e. the training error, as when we have a model which overfits our training data we can be very confident
that reducing our model capacity by a bit will not affect the training error drastically.
Now that we have chosen the second option i.e. starting with a high capacity model, we have two ways to reduce
the model capacity viz. either reduce the variance OR increase the bias of the model.
Important Note: As can be seen in the bias-variance trade-off curve(Fig. ??), increasing bias would reduce
variance and vice-versa. Our goal is to find the sweet spot where both variance and bias are low as not only
would it lead to low total error but also low generalization error.
Again in practise, increasing the bias is highly preferred over reducing the variance of a model since there is no
way to determine which and how many parameters/ weights need to be removed to decrease model capacity to
the desired optimal capacity. Whereas there are many simple methods to increase the bias of a model. This is
exactly what the concept of regularization is all about.(Fig. 63)

5.3 Concept
Regularization is a technique that trades increased bias for reduced variance of a model. When we apply
regularization to a high capacity model which is overfitting the data (high variance, very low bias), the idea
is that increasing the bias from a very low value to a slightly larger value would decrease the variance of the
model as per the bias-variance trade-off i.e. reduce the model capacity without increasing its bias drastically.

51
1.5
M = 9, = 10 8 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5

0.0

y
0.5

1.0

1.5
0.0 0.2 0.4 0.6 0.8 1.0
x

Figure 63: A High Capacity Model after Regularization

After applying regularization to an overfitting model, we hope to end up with an optimal model which has
both low bias/ variance (=low generalization error) and low total error as is desired. In other words the goal of
regularization is to minimize the generalization error when using large models (high complexity models).
Another good way to understand regularization is to visualize what it does in function space, as can be seen in
the Fig. 64.

Solution

Regularizer

Data
Function Space

Figure 64: Visualizing Regularization in Function Space

1. In the figure, the set Function Space denotes the family of all possible functions that our model can
represent.
2. The intersection between the Function Space and Data denotes the subset of functions within the function
space which are constrained by the Data, i.e. the family of functions belonging to the function space of
our model that maximize the likelihood (minimize negative log-likelihood loss) of the given data.
(Note: Since we do not know what the true generative distribution of our data is, it need not completely
overlap with the function space. Real-life data can be arbitrarily complex and it is very likely that it may
not be able to be fully represented by even the most complex/ deep models. Our goal is to always find a
good functional approximation which fits the data well.)
3. The Regularizer is an additional soft constraint on the function space and forces the model to represent
functions which are close to the constraint boundary.
4. The solution that we want i.e. the optimal model lies within the intersection of the Function Space, Data
and Regularizer.
In summary, both the Data and Regularizer should be viewed as constraints on the Function Space. The goal
of introducing the regularizer is to encourage the model to learn a function which definitely fits the data well
but most importantly is also not too complex.

5.4 Types of Regularization

There are many techniques to perform regularization, i.e. introduce bias into the model so that its variance can
be reduced. Below are some of the most popular regularization techniques.

52
5.5 Parameter Penalties
Let X = (X, y) denote the dataset and w the model parameters. (Note: In the case of a Multi-Layer Perceptron
or a Deep Neural Network w is the vector which contains all the flattened weight matrices of each layer stacked
together into a single column vector) We can limit the end model capacity by adding a parameter norm
penalty R to the loss L
L̃(X , w) = L(X , w) + α R(w) (22)
| {z } | {z } | {z }
Total Loss Original Loss Regularizer

where α ∈ [0, ∞) controls the strength of the regularizer.

In the above loss, α is a hyper-parameter and has to be found empirically by performing cross-validation and
then choosing a value which yields best generalization performance.
Important Note: The above loss is called the ”parameter” penalty loss as the Regularizer R in the loss does
not depend on the dataset X and penalizes only the parameter vector w i.e. the parameters of the model.

Some important points about the Parameter Penalty loss:

• R quantifies the size of the parameters / model capacity, as by penalizing the parameters the regularizer
forces only a few parameters to be active effectively reducing the original model capacity.
• Minimizing L̃ will decrease both L and R. What this means is that we want to end up with a model
which not only fits the data well (because L is minimized) but is also not too complex ((because R is
minimized).

• Typically, R is applied only to the weights (not the bias) of the affine layers of the model. The two
important reasons for doing this are
– The bias term decides how far away the output of the affine transform layer should be from the origin.
Adding the bias term to the regularizer would force the bias of each layer to be close to 0, effectively
constraining the meaningful solution space and thus hampering the representational capacity of the
model.
– Since we typically only have one bias term per layer, the number of bias terms is minuscule when
compared to the number of weight parameters, therefore not constraining the bias would make no
significant difference to our optimization objective.

• Often, R drives weights closer to the origin (in absence of prior knowledge). Since typically no prior
knowledge about the problem is available, constraining the weights to be close to 0 is the best possible
practice. This is because smaller weights lead to less complicated solutions. Thus the regularizer R
encourages the model to be as simple as possible.

5.5.1 Intuition behind Parameter Penalties

Why do we want the weights/inputs to be small?
• Suppose x1 and x2 are nearly identical.
The following two networks make nearly the same predictions:

Figure 65: Example to explain intuition behind penalizing weights to be small.

• But the second network might predict wrongly if the test distribution is slightly different (x1 and x2
match less closely) ⇒ Worse generalization. The reasoning behind this statement is that even though

53
both networks give the same predictions the second network does this by using a complex model i.e. large
weights, whereas the first network does the same using a less complex model i.e. small weights. Now if
both these models were to see a sample which would be slightly different than the test distribution i.e.
if the input varies by a bit, the difference in activations between the current sample and a sample from
the test set would be extremely high for the second network given the large weights, possibly leading to
wrong predictions. Whereas for the first network the difference in activations between the current sample
and any sample from the test set would be lower given the small weights, very likely leading to a correct
prediction. Thus, small weights also add some kind of robustness to small changes in the input to the
first network consequently leading to better generalization performance.

Objective Function

Optimum

Regularizer

Figure 66: Visualizing parameter penalties in parameter space.

What does parameter penalization look like in parameter space? In the Fig. 66, the green contour plot
is that of the objective function and the red contour plot is that of the regularizer. If there were no regularizer
applied, ideally one would expect the Optimum point to be at the centre of the green contour plot, given how we
would be minimizing just the original loss. However, when we minimize the loss L̃(X , w) in (22), we additionally
| {z }
Total Loss
also minimize the regularizer. Given this joint loss objective, the optimum now shifts to the current location
so as to minimize the regularizer loss. As can be seen in Fig. 66, adding the regularizer forces the value of w1
to increase by a bit from the previous optimum at the centre of the green plot and the value of w2 to decrease
significantly to the current location, so as to minimize the regularizer loss. Thus, these two competing losses
i.e. the original loss and the regularizer force the optimum to shift to somewhere in-between the two plots.

5.5.2 L2 Regularization
Also known as Weight decay (=ridge regression, in the context of linear regression) uses an L2 norm penalty
2
R(w) = 12 kwk2 : (All variables below are the same as defined in (22))

L̃(X , w) = L(X , w) + α R(w)

α
= L(X , w) + w> w....(replacing R(w) with the L2 norm)
2

The parameter updates during gradient descent are given by:

wt+1 = wt − η ∇w L̃(X , wt )
= wt − η ∇w L(X , wt ) + α wt

= (1 − η α) wt − η ∇w L(X , wt )....(re-arranging terms)

54
,where
wt+1 are the updated weights obtained after performing a single gradient descent step on wt and,
η is the step-size i.e. learning rate
Important Note: As can be observed from the final gradient descent update step, there is a slight difference
here as compared to the normal gradient descent step done without L2 regularization. The first term in the
L2 parameter update step is (1 − η α)wt as opposed to just wt used in the unregularized parameter updates.
Now, since both 0 < η, α < 1, their product η α < 1, therefore we know that (1 − η α) is a number which is
slightly lesser than 1. Since we are multiplying the weight vector wt with a quantity lesser than 1 before each
parameter update, we are effectively decaying the weights of the model at each training iteration before the
gradient update while performing L2 regularization.
We have now seen what happens during a single gradient update step while using an L2 regularizer.

What happens while using an L2 regularizer over the the entire course of training?

Let w∗ = argmin w L(X , w) denote the solution to the unregularized objective and consider a
quadratic approximation L̂ of the unregularized loss L around w∗ . The quadratic approximation is a
multi-variate Taylor series expansion around the optimum w∗
1
L̂(X , w) = L(X , w∗ ) + g> (w − w∗ )Linear Component + (w − w∗ )> H(w − w∗ )Quadratic Contribution
2
1
= L(X , w∗ ) + (w − w∗ )> H(w − w∗ )
2
with gradient vector g = 0(since the gradient at the optimum w∗ is 0) and semi-positive Hessian matrix H
(second-order derivatives).
When including the regularization term, this approximation becomes:
1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2

The minimum w̃ of the regularized objective L̂(X , w) is attained at ∇w L̂(X , w) = 0:

∇w L̂(X , w̃) = H(w̃ − w∗ ) + αw̃ = 0....(L(X , w∗ ) does not depend on w, thus its derivative wrt. w is 0)
(H + αI)w̃ = Hw∗
w̃ = (H + αI)−1 Hw∗

Thus, as α approaches 0, the regularized solution w̃ approaches w∗ . Note: This is because when α = 0,
w̃ = (H)−1 Hw∗ = Iw∗ This statement also intuitively makes a lot of sense, because as the influence of the
regularizer approaches 0, the current optimum in Fig. 66 would start shifting back to the centre of the green
contour plot i.e. the optimum of the unregularized objective.

What happens if α grows instead?

Consider the eigen-decomposition H = QΛQ> of the symmetric Hessian matrix into
a diagonal matrix of eigenvalues Λ and an orthonormal basis of eigenvectors Q
Note: The Hessian matrix is a symmetric matrix with real values, because according to Schwarz’s theorem
the partial derivatives of each element in the matrix can be swapped, making the matrix symmetric. Since the
matrix is symmetric and real, we can apply spectral decomposition.

w̃ = (H + αI)−1 Hw∗
−1
= QΛQ> + αI QΛQ> w∗
−1
= Q (Λ + αI) Q> QΛQ> w∗ ......(substitute I = QQ> in above equation)

−1
= Q (Λ + αI) ΛQ> w∗ .....(Apply (ABC)−1 = C−1 B−1 A−1 in above equation, also Q−1 = Q> )

Since Λ is a diagonal matrix containing the eigenvalues of H, we know that (Λ + αI)−1 is also a diagonal matrix
with its diagonal components being λi1+α where λi is the i-th diagonal entry of Λ, i.e. the i-th eigenvalue of
the matrix H.
With this in mind lets read the last equation from the R.H.S,
Q> w∗ signifies a change of basis of the vector w∗ onto the eigen-basis of H, i.e. every component of w∗ is now
aligned with the respective eigenvectors of H.
Applying (Λ + αI)−1 Λ to Q> w∗ would thus mean that the component of Q> w∗ that is aligned with the i-th
eigenvector of H is rescaled by a factor of λiλ+α
i
.

55
Now multiplying this scaled vector by Q simply signifies a change of basis from the eigenbasis of H back to the
original basis.
Thus, in summary what the additional term of L2 regularization does is that it takes every component of w∗
that is aligned with the i-th eigenvector of H and rescales it by a factor of λiλ+α
i
. From the previous term it is
easy to see that Regularization affects directions with small eigenvalues λi α the most.

Visual Representation of L2 Regularization

Unregularized Objective

L2 Regularizer

Figure 67: Visualizing L2 regularization in parameter space.

• Contours of unregularized objective L(X , w) and L2 regularizer R(w)

• At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
• Along w1 , eigenvalue of H is small (low curvature) ⇒ strong effect of regularizer. Value changes much
more in the direction of w1 to reach w̃
• Along w2 , eigenvalue of H is large (high curvature) ⇒ small effect of regularizer. Value changes very
less in the direction of w2 to reach w̃, as even small changes would incur large penalties.

5.5.3 L1 Regularization
The L1 regularization is another parameter penalization method to increase bias in a model. Unlike L2 regu-
larization, the L1 regularizer R(w) = kwk1 , tries to minimize the L1 norm of the weight vector w instead of
minimizing its L2 norm. Since the regularization objectives for both regularizers are different, their contour
plots are different as well as can be seen in Fig. 68 and Fig. 67.

Unregularized Objective

L1 Regularizer

Figure 68: Visualizing L1 regularization in parameter space.

• Contours of unregularized objective L(X , w) and L1 regularizer R(w)

• At w̃, the competing objectives reach an equilibrium (solution to regularized loss)
• L1 regularized loss function: L̃(X , w) = L(X , w) + αkwk1

56
• L1 Regularization results in a solution which is more sparse (compared to L2 )
Important Note: If we observe the contour plots of both the L1 regularizer (Fig. 68) and the L2
regularizer(Fig. 67) closely we can see that the optimum of the regularized objective w̃ in the L1 case is
closer to the origin along the direction of w1 than it is in the case of L2. What this signifies is that the L1
regularizer tries to squeeze and set most of the parameters as close to 0 as possible (w1 in the L1 example)
and allots some amount of flexibility and freedom to the remaining parameters (w2 in this example, which
in-fact increased by a bit to reach w̃ in the L1 case). Since the output weight vector after applying L1
regularization would consist of mostly 0’s, it is said to encourage ”sparsity” of solutions.

5.5.4 L2 vs. L1 Regularization

Example: Assume 3 input features: x = (1, 2, 1)>
The following two linear classifiers fw (x) = σ(w> x) yield the same result/loss:

• w1 = (0, 0.75, 0)> ⇒ ignores 2 features

• w2 = (0.25, 0.5, 0.25)> ⇒ takes all features into account
But the L1 and L2 regularizer prefer different solutions!
L2 Regularization: L1 Regularization:

• kw1 k2 = 0 + 0.752 + 0 = 0.5625 • kw1 k1 = 0 + 0.75 + 0 = 0.75

• kw2 k2 = 0.252 + 0.52 + 0.252 = 0.375 • kw2 k1 = 0.25 + 0.5 + 0.25 = 1
As we can see in the above example, the L2 regularizer prefers the model w2 where every feature is taken
into account during activation computation, whereas the L1 regularizer prefers the model w1 which has sparse
solutions, i.e. the model which gives importance to only one feature and ignores the other two.

A good way to visualize what the L1 and L2 regularizers do is to look at Fig. 69 and Fig. 70. In both these
figures, we can think of the high level features in the orange box as being the output of the penultimate layer
of our model. The model is now supposed to make a classification decision based on these high level features.
The figures illustrate what solutions such a model trained with either an L2 or an L1 regularizer would prefer.

Figure 69: Effect of L2 regularizer in high level feature space. [9] The L2 regularizer prefers a model
which takes all the input features into account, i.e. every feature will have a corresponding weight

57
Figure 70: Effect of L1 regularizer in high level feature space. [9] Since the L1 regularizer prefers sparse
solutions, it will prefer a model which pays attention to only a few key features, i.e. the weights for the other
features are set to be close to 0 and thus ignored.

5.5.5 Interpretation of parameter penalties as MAP inference

L2 regularization can be interpreted as Bayesian maximum-a-posteriori (MAP) estimation of the network
parameters w with a Gaussian prior applied to w:

w̃ = argmax p(w|y, X)...Maximize apriori i.e. distribution of parameters given data

w
= argmax p(y|X, w) p(w)...After applying Bayes rule
w
= argmax log p(y|X, w) + log p(w)...argmax does not change as log is a monotonous function
w
= argmax log p(y|X, w) + log N (w|0, α−1 I)...Choose such a gaussian distribution with 0 mean as prior
w
α >
= argmin − log p(y|X, w) + w w...The choice of prior turns out to be equivalent to L2 penalty
w 2

With computation similar to the one done above, the L1 regularizer can also be interpreted as the MAP
inference of w with a Laplace distribution prior.

5.5.6 How does the Computation Graph of a Parameter Penalty Regularizer look like ?

Figure 71: Computation Graph for a Parameter Penalty

5.6 Early Stopping

Important Note: An iteration means that we have iterated through a single mini-batch. An epoch means
that we have iterated through all the mini-batches i.e. iterated through our complete training dataset one time.
In Fig. 72,
• While training error decreases over time, validation error starts increasing again

58
Figure 72: Typical Loss Curve

• The graph shows that training for long amounts of time does not always help in reducing validation error,
for e.g. in this case, the validation error actually steadily increases with the number of epochs, even though
the training error decreases slowly.
• Thus, the most logical approach is to: train for some time and return parameters with lowest val-
idation error i.e. store model parameters every fixed number of epochs and at the end rollback to the
saved parameters which give the lowest validation error.
• There is a small additional cost in terms of training resources and time, as to do early stopping the
validation set needs to be forward propagated every fixed number of epochs so that we can keep track of
the validation error.

5.6.1 Early Stopping vs. Parameter Penalties

Unregularized Objective

L2 Regularizer

Figure 73: Early Stopping (Left) vs. Parameter Penalties (Right)

Early stopping:
L2 Regularization:
• Dashed: Trajectory taken by SGD
(Starts from near the origin as we ini- • Regularize objective with L2 penalty
tialize our weights to be close to 0)
• Penalty forces minimum of regularized
• Trajectory stops at w̃ before reaching loss w̃ closer to origin
the minimum w∗

• Under some assumptions, both early stopping and parameter penalties can be considered equivalent, as
can also be understood intuitively by looking the similar solutions that both approaches obtain in Fig. 73

5.6.2 Summary of Early Stopping

• Most commonly used form of regularization in deep learning
• Effective, simple and computationally efficient form of regularization
• Training time can be viewed as hyperparameter ⇒ model selection problem

59
• Efficient as a single training run tests all hyperparameters (unlike weight decay)
• Only cost: periodically evaluate validation error on validation set
• To reduce this cost the Validation set can be made smaller than the training data (which is usually the
case) OR/ AND the evaluation of the validation set can be done less frequently

Remark: If little training data is available, one can perform a second training phase where the model is
retrained from scratch on all training data using the same number of training iterations determined by the early
stopping procedure. It is important to note that the early stopping procedure was carried out on a training
split (split the little training data into training/ validaton set as is done normally). We then use the number of
training iterations obtained from this procedure to then train a model on the whole dataset (no splitting into
sets unlike earlier).

5.7 Ensemble Methods

Idea:
• Train several models separately for the same task
• At inference time: average results
• Thus, often also called “model averaging”

Intuition:
• Different models make different errors on the test set
• By averaging we obtain a more robust estimate without a better model!
• Works best if models are maximally uncorrelated
• Winning entries of challenges are often ensembles (e.g., Netflix challenge), as empirically speaking it is
very likely that using ensemble methods gives a 1-2% performance improvement in most tasks
• Drawback: requires evaluation of multiple models at inference time

5.7.1 Why are Ensembles a good idea ?

Consider K regression models, each of which has an error of k ∼ N (0, Σ) with variances E[2k ] = v and
covariances E[k l ] = c. The expected square error of the ensemble predictor (with each model having
the same weight) is given as:
 !2    
1 X 1 X
2k +
X
E k  = 2 E  k l  ....Expectation of square of sum has been broken into
K K
k k l6=k

Expectation over sum of squares and sum of cross terms

 
1 X 2 X X
= 2 E k + E [k l ] ......Linearity of Expectations
K
k k l6=k
1
= 2 (Kv + K(K − 1)c) ....Plug in defined terms v and c
K
1 K −1
= v+ c .....Ensemble Error
K K

• If errors are correlated (c = v), the ensemble error becomes v ⇒ no gain (Since v is the variance of the
error of a single model, there is no performance that has been gained by ensembling)
• If errors are uncorrelated (c = 0), the ensemble error reduces to K 1
v (We have gained performance by
1
reducing the variance of the errors of each individual model in the ensemble from v to K v when the errors
are uncorrelated)

Thus:
• Ensemble maximally effective if errors maximally uncorrelated i.e. when c = 0

60
5.7.2 Different types of Ensemble Methods
• Initialization: Train networks starting from different random initialization on same dataset or using
different minibatches (via stochastic gradient descent). This often already introduces some independence.

• Model: Use different models, architectures, losses or hyperparameters

• Bagging: Train networks on different random draws (with replacement) from the original dataset. Thus,
each dataset likely misses some of the examples from the original dataset and contains some duplicates.

Example of Bagging

Figure 74: Using bagging as an ensemble method

• First model learns to detect top “loop”, second model detects bottom “loop”
• The intuition behind using these models as part of an ensemble is that, each model learns to perform a
simple task which is independent of each other (detecting top loop for the model trained on first resampled
dataset and detecting the bottom loop for the model trained on the second resampled dataset. We can
thus, average these models which learn to perform simple tasks into an ensemble which gives better
performance by averaging over the predictions of each individual model.

5.8 Dropout

Figure 75: Dropout [11]

Idea:

61
• During training, set neurons to zero with probability µ (typically µ = 0.5)
Important Note: Removing a neuron or setting it to zero means that we remove all the incoming and
outgoing connections of the neuron from the model architecture.
Also this procedure of dropout is applied to all neurons i.e. every neuron in the network will be dropped
out with probability µ.

• Each binary mask (after performing dropout we obtain a binary mask i.e. is a given neuron in the network
active or not) is one model. This is because every different binary mask would give rise to a different
model architecture.
• This probability-based dropout mask is applied in every iteration. Since in every iteration we flip a biased/
unbiased coin for every neuron to determine whether it will participate in training, the binary mask of
the complete network changes randomly with every training iteration
• Creates ensemble “on the fly” from a single network with shared parameters. While performing dropout
we get random binary masks in every iteration i.e. random models in every iteration. Thus, dropout is a
technique which allows us to take advantage of ensemble methods i.e. to average predictions of different
models (generated in each iteration) from just a single network without any of the computational overhead
introduced in traditional ensemble methods like creating/ training multiple models from scratch, storing
parameters of each model, etc.
Important Note: The ensemble created by dropout is exponentially large and grows with the number
of training iterations. This is because the probability of getting the same binary mask i.e. same model
in any future iteration is extremely low. Thus, dropout effectively adds a new model to the ensemble in
every iteration.

5.8.1 Why is Dropout a good idea ?

Figure 76: Forward Pass with Dropout [9]

• Forces the network to learn a redundant representation ⇒ regularization.

This is because the model understands that any neuron can be dropped randomly in a training iteration,
therefore it learns to not rely heavily on the output of certain neurons/features for making its predictions
given how these neurons could easily be dropped out in future iterations. This is exactly why dropout
forces the model to represent information in a more distributed or redundant way, so that this randomness
of neurons dropping out does not affect its final prediction drastically and thus also acts as a form of
regularization.

• Reduces effective model capacity ⇒ requires larger models, longer training

Since dropout forces the model to learn redundant representations, it reduces the variance of the model i.e.
reduces the capacity of the model. Thus, one downside of dropout is that because it reduces the capacity
of a model we need to use bigger networks and train longer to achieve good training performance. However
it is worth remembering, that we make this sacrifice to achieve very good generalization performance.

• Prevents co-adaptation of features (units can’t learn to undo output of others)

In classical deep learning, neurons can learn to undo the effect of other nearby neurons. For e.g. consider a
scenario where one neuron learns to have a very high activation and another nearby neuron learns to have
a very low activation. When the outputs of both these neurons are combined they effectively cancel out

62
the effect of each other. This phenomenon is called co-adaptation and is undesirable as in such a scenario
the network does not learn any useful representation, given how it is busy cancelling out its own neurons.
Whereas when using dropout since any neuron can be dropped out randomly, the network cannot learn to
change the output of nearby neurons by relying on the output of a certain neuron. Thus dropout prevents
co-adaptation of features from occurring.

• Requires only one forward pass at inference time (Explanation in Section 5.8.2)

5.8.2 Dropout at Inference Time

• Dropout makes the output random. Formally, we have:

ŷz = fw (x, z)

Important Note: The output of the neural network ŷ now not only depends on the weight vector w but
also on the binary mask z generated by dropout. Since the binary mask is random, the model architecture
is random and thus the output with dropout is also random.
Here, z is a binary mask with one element per unit drawn i.i.d. from a Bernoulli p(zi ) = µ1−zi (1 − µ)zi
where zi = 0 if neuron i is removed from the network
• At inference time, we want to calculate the ensemble prediction:
X
ŷ = fw (x) = Ez [fw (x, z)] = p(z)fw (x, z)
z

Important Note: Every binary mask generated by dropout corresponds to a new model. To take benefit
of the idea proposed in ensemble methods(Section 5.7), we then take the expectation of our individual
model predictions over the generated ensemble i.e. the set of all models (binary masks) that have been
generated by dropout.
• The number of ensembles generated by dropout is exponential in the number of neurons. That is, if we
have M neurons in our original model, we have 2M possible ensembles i.e. 2M possible binary masks.
Important Note: We have to sum over all z (i.e. over all generated binary masks) to calculate the
Ensemble Prediction using dropout.
Thus to obtain a single ensemble prediction we have to sum over 2M possible terms (since we have 2M
possible binary masks), leaving the calculation intractable. (As the number of terms in the summation
will rise exponentially with the number of neurons)

How do we solve the intractability of calculating an Ensemble Prediction while using dropout ?

Let us consider a simple linear model:

fw (x) = w1 x1 + w2 x2
fw (x, z) = z1 w1 x1 + z2 w2 x2

Note: fw (x, z) is the dropout equivalent of making a prediction, where z1 and z2 are the binary masks of a
neuron that can either be 0 (neuron has been dropped out) or 1 (neuron will remain active). Assuming µ = 0.5,
during training we optimize the expectation over the ensemble:
1 1 1 1
Ez [fw (x, z)] = (0 + 0) + (w1 x1 + 0) + (0 + w2 x2 ) + (w1 x1 + w2 x2 )
4 4 4 4
1 1
= (w1 x1 + w2 x2 ) = fw (x)
2 2

Note: We have 4 terms in the summation as we have 4 possible combinations of z1 and z2 . Each of the
terms are weighted equally ( 14 ), since the probability of getting a given combination out of the 4 possible is
p(z1 )p(z2 ) = 0.5 ∗ 0.5 = 0.25.
Important Note: In the above equation, we see that the Expected Prediction of the trained model is half
(µ) times the output of the original network. This implies that the weights learnt by the model trained using
dropout must be double (1/µ) times the weights of the original model, as only then would both the models give

63
the same prediction.
It is also important to note that dropout is only used during training. Therefore while performing inference we
want our model trained using dropout to behave the same way as the original model.
Thus, at test time, we must multiply the trained weights (which are 1/µ times the weights of the original
model) by the dropout probability µ.
Remark: This weight scaling inference is only an approximation for non-linear models. However, this approx-
imation still works really well empirically. (As can bee seen in Fig. 78)

5.8.3 Visualizing Effect of Dropout

Figure 77: Comparing the outputs of an autoencoder trained without (left) and with dropout

• Features of an Autoencoder on MNIST with a single hidden layer of 256 ReLUs

• Left: Without dropout ⇒ more co-adaptation, as many units respond to cancel out the effect leading to
noisy representation of the image
• Right: With dropout ⇒ less co-adaptation, as every unit learns to recognize information that appear in
different places in the input image leading to more meaningful features and thus better generalization

Figure 78: Dropout helps reduce generalization error

5.9 Data Augmentation

Motivation:
• Deep neural networks must be invariant to a wide variety of input variations. We want our model to
be invariant (not affected by) to input variations, as we want it to classify all inputs irrespective of their
variations correctly. For e.g. we would want our model to correctly classify both an otter lying on its back
and an otter sitting on a rock as an otter, despite the variations in these 2 images.

64
Figure 79: Variations of images from a single class

• Often in real-life data there exist large intra-class variations in terms of pose, appearance, lighting,
etc. This makes image classification in particular an extremely hard task.

How do we tackle intra-class variations and improve generalization ?

• Best way towards better generalization is
to train on more data. However, data in
practice often limited.

• Goal of data augmentation: create “fake”

data from the existing data (on the fly)
and add it to the training set. This pro-
cess is ”on the fly” as we do not store the
augmented images. The augmented images
are generated randomly (by applying ran-
dom transformations) from each image in
our batch at the beginning of every train-
ing iteration.
• New data must preserve semantics i.e.
the augmented data of a class should not
change the semantic meaning of the class
category. For e.g. augmenting the images of
a class of dogs to look like cats (with label
as dogs) is not a good idea.
• Even simple operations like translation
or adding per-pixel noise often already
greatly improve generalization
• https://github.com/aleju/imgaug is a pop-
ular library on GitHub where implementa-
tions of different image augmentation tech-
niques can be found.

5.9.1 Geometric Transformations

Image Cropping:

• Randomly crop and re-scale images to original image size

• Do not crop regions from the image which are too small, as then semantics of the image category could
be lost

65
iaa.Crop(px=(1,64))

Image Cropping and Padding:

• Similar to Image Cropping the only difference being that the cropped images are not resized, but are
instead padded with either a constant color or expanded image border colors etc. to reach the original
image size.

iaa.CropAndPad(percent=(-0.2, 0.2),pad mode=iaa.ia.ALL,pad cval=(0, 255))

Horizontal Image Flipping:

• Flip the given images horizontally based on the specified probability
• Not always good to use this augmentation naively, as it possible that semantic meaning of an image
changes after flipping. For e.g. in street scene semantic segmentation even though horizontal flipping
might be a good idea to generate more realistic data, the notion of traffic lanes would change from right
sided traffic to left sided traffic and vice-versa leading to undesirable results
• Depending on the task images can also be flipped upside-down

iaa.Fliplr(0.5)

Affine Transformation:
• Apply a single affine/ linear transformation to the 2D image space i.e. scale, rotate, shear an image etc.
• Blank spaces left after the transformation can be filled by using different strategies like using constant
color, expand image colors, replicate colors, etc.

iaa.Affine()

66
Piecewise Affine Transformation:
• Similar to Affine transformation albeit now a single transformation is not applied to the whole image
• The image is perceived as an underlying grid. A piece-wise affine transformation is applied independently
to each grid cell. This causes different regions in the image to be distorted differently

iaa.PiecewiseAffine(scale=(0.01, 0.1))

Perspective Transformation:
• Similar to Affine transformation albeit a perspective transform has more degrees of freedom
• The perspective effect causes certain regions of the image to be squeezed and other regions to be enlarged

iaa.PerspectiveTransform(scale=(0, 0.4))

5.9.2 Local Filters

Gaussian Blur:
• Applies the gaussian blur filter onto the given image with the σ of the gaussian distribution being chosen
randomly from an interval specified by the user
• Helps the model to recognize objects at different resolutions
• Also the process of capturing data using a camera induces a small amount of blur. Thus applying gaussian
blur also helps the model to be robust to such camera-induced variations in captured images

iaa.GaussianBlur(sigma=(0.0, 10.0))

Image Sharpening:
• Does the opposite of introducing blur i.e. introduces sharpness into input images

67
iaa.Sharpen(alpha=(0, 0.5), lightness=(0.75, 1.25))

Emboss Effect:
• Another effect which introduces pixel-wise light/shadow and could improve generalization performance

iaa.Emboss(alpha=(0, 1.0), strength=(0, 2.0))

Edge Detection:
• Uses the edge detected version of a given image
• Is important to verify whether the edge detected version are still reasonable and do not change / make it
impossible to decipher semantics of the class

iaa.EdgeDetect(alpha=(0, 1.0))

5.9.3 Adding Noise

A popular data augmentation technique which typically involves introducing a more structured random
per-pixel noise.

Why is studying noise important ?

• Deep Networks are highly sensitive to noise in images. Thus, it is very likely that adding even a small
amount of noise (just to the test data) would lead to drastic reduction in performance, even though to
the human eye the ”noisy” test dataset would look almost similar to the original test dataset

Important Note: In Fig. 80 we can see that when a deep network was trained on a dataset which
added uniform noise to both the train and test dataset, it managed to achieve super-human performance.
However, when the network was trained on the same dataset albeit this time with salt-and-pepper noise
(a noise where pixels of the image are set to white or black based on a specified probability), the network
managed to obtain only chance-level performance i.e. as good as random-guessing. This result is especially
surprising given how both datasets (one with uniform noise and the other with salt-and-pepper noise look
like they have similar noise distributions) and yet the difference in performance of the network between
both these visually-similar datasets is drastic. This validates the point that networks are highly susceptible
to noise.

68
• Noise is present everywhere while collecting data, for instance noise induced by camera sensors while
capturing data, noise induced by the scene illumination conditions, etc.

Figure 80: How is classification performance affected by different kinds of noise ? [5]

Following are some of the most popular additive-noise augmentation techniques to improve generalization
performance:

Gaussian Noise:

• Add a fixed Gaussian noise to each pixel in an image. The additive noise is chosen randomly for different
images

iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.2*255)

Salt and Pepper Noise:

• Each pixel in an image is turned into either black or white depending on a specified probability

iaa.SaltAndPepper(0.1)

Dropout Noise:
• Unlike salt-and-pepper noise which does not retain color, in dropout noise each pixel is only turned to
black based on a specified dropout probability

69
iaa.Dropout((0.01, 0.5))

Cutout Noise:

• A more structured form of noise which involves cutting-out i.e. setting pixels to a specific color of regions
in an image.

iaa.Cutout(nb iterations=(1, 5), size=0.2, squared=False)

Can noise be added only to inputs ?

Figure 81: Adding noise to hidden layers

• Noise can also be applied to the hidden units, not only to the input

• Prominent example of applying noise to the hidden inputs is : Dropout. Another example would be
adding random noise individually to each hidden unit
• The advantage of adding noise to the hidden units and not just the input is that we are adding robustness
not just to input-level features but also at intermediate levels where higher level concepts and higher level
representations of the neural network are established
• As seen in Section 5.8, adding noise to the hidden units of a network also works really well empirically to
gain better generalization performance

70
5.9.4 Color Transformations
Why are Color Transformations important ?
• Color transformations have played a key-role in the success of most neural networks that work well on the
Imagenet dataset
• Cameras produce different color spectrum’s based on the type of sensors used and the kind of white-
balancing performed. A more important reason behind this is the change in lighting conditions while
capturing data, for e.g. the colors captured by a camera would change dramatically for the same scene
depending on the time of the day i.e. sunset would have warmer colors, mid-day would have brighter colors,
etc. We thus want our models to be invariant to such variations so that our generalization performance
increases. This is exactly why color transformations are an essential form of data augmentation
Following are some of the most popular color transformations techniques to improve generalization performance:

Contrast:
• Change the contrast of images to either become faint or have stronger contrast.

iaa.LinearContrast((0.1, 2.0), per channel=0)

Brightness:
• Change the brightness of the whole images to make them either darker or brighter

iaa.Multiply((0.5, 1.5), per channel=0)

Brightness per Channel:

• Instead of changing brightness of the whole image, change brightness for each channel instead

iaa.Multiply((0.5, 1.5), per channel=0.5)

Local Brightness:
• Brightness are applied locally per channel, the locations of which are determined by a specified frequency
noise

71
iaa.FrequencyNoiseAlpha(exponent=(-4, 0), first=iaa.Multiply((0.5, 1.5), per channel=True))

Hue and Saturation:

• Change hue and saturation of images

iaa.AddToHueAndSaturation((-50, 50))

Color Inversion:
• Even though color inverted images are not a realistic choice as they look nothing like natural images,
depending on the task they could still be a good strategy to improve generalization performance

iaa.Invert(0.5, per channel=0.75)

Grayscale:
• Convert the images into a spectrum between grayscale and colored images, the strength of which is
detrmined by the specified α value

iaa.Grayscale(alpha=(0.0, 1.0))

5.9.5 Weathers
Even though weather effects might sound complex they are still very easy to compute and can be done
on-the-fly without requiring any sophisticated graphics engine.

Snow:

72
• Introduces a snow-like effect on the images

iaa.FastSnowyLandscape(lightness threshold=(100, 255),lightness multiplier=(1.0, 4.0))

Clouds:
• Introduces an effect on the images which look like clouds

iaa.Clouds()

Fogs:
• Introduces a fog effect on the images

iaa.Fog()

5.9.6 Random Combinations

In practice, all of the aforementioned augmentation techniques are combined and then applied randomly on-
the-fly to different images drawn randomly from within a mini-batch.

5.9.7 Output Transformations

In some cases it is possible that after applying a transformation on the input, we have to apply transformations
on the corresponding output/ class as well.

In Certain Classification Tasks:

• For some classification tasks, e.g., handwritten letter recognition, be careful to not apply transfor-
mations that would change the output class
• Example 1: Horizontal flips changes the interpretation of the letter ’d’:

73
Figure 82: Random Data Augmentation Combinations applied to a single input image

Horizontal
Flip

Figure 83: Random Data Augmentation Combinations applied to a single input image

• Example 2: 180◦ rotations changes the interpretation of the number ’6’:

• Remark: For general object recognition, flips and rotations can often be useful!

In other tasks with structured and more complex output

• For dense prediction tasks (depth/instance/keypoints), also transform targets. For e.g. in tasks like
instance segmentation, depth prediction or stereo depth estimation, any affine/ warping/ translation
transformations done to the input would affect the ground-truth predictions (whereas per-pixel noise
would make no difference). Thus, in such dense prediction tasks, the output need to be transformed
accordingly so that we obtain the valid ground-truth for the transformed input image

5.9.8 Important Remarks about Data Augmentation

• When comparing two networks, make sure you use the same augmentation. This is because data
augmentation is an extremely powerful strategy and could significantly improve performancce. If the
same augmentation strategy is not used in both networks, one could be misled into thinking that their
idea was responsible was better performance, whereas in reality it was just a better data augmentation
strategy
• Consider data augmentation as a part of your network design

• It is important to specify the right distributions (often done empirically). Not all transformations would
be applicable or beneficial to improving performance in a given task. Thus the kind of transformations
that would help in improving performance and their corresponding strength (hyperparameters) need to
be chosen empirically
• Can also be combined with ensemble idea:

74
180°
Rotation

Figure 84: Random Data Augmentation Combinations applied to a single input image

Figure 85: Random Data Augmentation Combinations applied to a single input image

– At training time, sample random crops/scales and train one model

– At inference time, average predictions for a fixed set of crops of the test image
• AutoAugment [4] uses reinforcement learning to find strategies automatically:

Figure 86: Reinforcement Learning to find best Data Augmentation Strategies

6 Optimization
We have previously already discussed the most basic form of optimization stochastic gradient descent and even
simpler gradient descent. In the following we will learn about some of the variants of these. As well as some
strategies on how to build up a deep learning project and on how to debug our model if there is some bug or
expected behaviour.

75
6.1 Optimization Challenges
First, let’s revisit gradient descent and in the following look at the most frequent challenges that we encounter
while optimizing a deep learning model.

6.1.1 Gradient Descent

The standard formulation for gradient descent, that we’ve already seen before. The following algorithm describes
the progress of optimization over time:
Let w be the parameters that we want to optimize, ∇w L(w) the gradients of an arbitrary loss function, η the
learning rate and t the timestep/iteration.

w0 = winit
wt+1 = wt − η ∇w L(wt )

First we initialize w0 at timestep 0 to some initial value, then in every timestep we update wt+1 by adding a
fraction of the negative gradients to the weights wt of the current timestep.

Figure 87: Gradient Descent A 1D example of a few gradient descent update steps.

In Fig. 87 we can see a 1D example loss function L(w) wrt. to the network parameters w. The loss is a non-
convex function, which is the case for most neural network losses. Therefore there are multiple local minima
and one global minima, that we want to reach in the optimal case. Although we can only find one of those
through optimization. The good news is that many local minima in deep neural networks are good ones. Here
we can nicely see why we need to add the negative gradient to progress further. The gradient at w0 is negative,
which means that if add it directly to the weight w, the weight (x-axis) would get smaller, resulting in a new
w1 that is to the left of w0 , thus worse. By adding the negative gradient, in this case a positive number, to w0
we end up at the w1 as shown in the figure, thus closer to the local minima.

6.1.2 Learning rate

In the case of a convex loss function as seen in Fig. 88, we can see one of the challenges that we face in deep
learning - choosing the right learning rate η. A learning rate too low leads to a smaller step size, which then

Figure 88: Divergence A 1D example of a few gradient descent update steps that diverge.

results in too many steps that are needed to converge in the minimum. Whereas a learning rate too high might
lead the network to divergence. Fig. 88 shows this divergence behaviour, where the step size is too large such
that we jump over the minimum, to a possibly steeper part of the loss function (higher gradient). This might
repeat itself, leading to divergence.

76
6.1.3 Cliff
Another challenge regarding the gradients is a cliff (Fig. 89), where the updated weight wt jumps over the
minimum onto a steep cliff (very high gradient), which then catapults the next updated parameters wt+1
further away from the minimum. A common heuristic to counteract such effects is to clip the gradient to a a
priori selected range, which introduces another hyperparameter we need to search.

Figure 89: Cliff A 1D example of a few gradient descent update steps that jumps off a cliff.

6.1.4 Saddle point

Figure 90: Saddle point An example for a saddle point of a simple loss function.

The saddle point (see Fig. 90) of loss functions is a tricky part, because the gradient ∇w L(w) = 0, although we
are not at a minimum - not even a local minimum. In reality, this is mostly not a problem, even though there
are many saddle points in DL, because the chance to exactly hit a saddle point is very low. For every parameter
direction the gradient needs to be exactly zero, which is unlikely when working with millions of parameters.

6.1.5 Plateau
A region similar to the saddle point is the plateau (Fig. 91), where the gradients are close to zero (e.g. saturated
sigmoid activation function, dead ReLUs).

Figure 91: Plateau An example for a plateau of a loss function.

6.1.6 Ravine
A ravine (Fig. 92) is a very narrow valley with a small gradient along the slope of the valley. The ravine in
the loss function leads to the minimum. Due to the narrow valley it is very hard easy to diverge, similar to
Section 6.1.2 and Section 6.1.3. Furthermore due to the small gradient it is hard to follow the narrow path to
the minimum.

77
Figure 92: Ravine An example for a ravine of a loss function.

6.2 Optimization Algorithms

In this section we will take a look at some optimization algorithms that are more advanced than basic (stochastic)
gradient descent and try to overcome some of the issue that we discussed in the previous Section 6.1.

6.2.1 Gradient Descent Algorithm

Let us first look at the basic gradient descent algorithm:
1. Initialize weights w0 and pick learning rate η
2. For all data points i ∈ {1, . . . , N } do:
(a) Forward propagate xi through network to calculate prediction ŷi
(b) Backpropagate to obtain gradient ∇w Li (wt ) ≡ ∇w L(ŷi , yi , wt )
3. Update gradients: wt+1 = wt − η N1 i ∇w Li (wt )
P

4. If validation error decreases, go to step 2, otherwise stop

Be aware that we update the parameters by taking the average gradient over all training samples. This creates a
problem when we use a lot of parameters and training points (e.g. 1 million or more), because the computation
of one forward pass for every point is extremely expensive. Furthermore to compute it we need to hold all of
our data in memory, which does often not fit.

6.2.2 Stochastic Gradient Descent

Luckily there is a solution for that problem - Stochastic Gradient Descent. This algorithm solves the
problem of slow computation and large memory usage and thus is the basis for the subsequent algorithms.
Stochastic gradient descent uses the fact that the total loss over the entire training set can be expressed as an
expectation:
1 X
Li (wt ) = Ei∼U {1,N } Li (wt )

N i
where a sample is drawn uniformly from the entire training set. This means that we do not need the every
training set to estimate the total loss.
We can approximate this expectation by a smaller subset - a minibatch B N (e.g. 8, 16,32,64,128 or as
large as the (GPU) memory allows) of the data:
1 X
Ei∼U {1,N } Li (wt ) ≈ Lb (wt )

B
b

Thus the total loss can also be approximated by the following formula:
1 X 1 X
∇w Li (wt ) = Ei∼U {1,N } ∇w Li (wt ) ≈ ∇w Lb (wt )

N i B
b

This represents a noisy approximation of the total loss, because training on a batch estimates the total loss,
but adds a little noise as the smaller batch sizes lead to a larger variance in the gradients. These batches can
either be chosen randomly or by partitioning the dataset. Either way they should be as independent as possible,
therefore make sure to shuffle the training set. We also introduce some terminology here to make sure we talk
about the same concept.

78
• Iteration = a single gradient update based on a single minibatch wt → wt+1
• Epoch = complete pass through the training set (= N
B iterations)
The algorithm for stochastic gradient descent now looks like this:
1. Initialize weights w0 , pick learning rate η and minibatch size |Xbatch |
2. Draw random (shuffled) minibatch {(x1 , y1 ), . . . , (xB , yB )} ⊆ X (with B N )
3. For all minibatch elements b ∈ {1, . . . , B} do:
(a) Forward propagate xb through network to calculate prediction ŷb
(b) Backpropagate to obtain batch element gradient ∇w Lb (wt ) ≡ ∇w L(ŷb , yb , wt )
4. Update gradients: wt+1 = wt − η B1 b ∇w Lb (wt )
P

5. If validation error decreases, go to step 2, otherwise stop

In the gradient update step we see that we changed the code to average over a minibatch and not the full
dataset anymore.

Loss
1.5 Initialization
1.0 1.300 1.500
1.100 Minimum
0.5 0.300
0.0
w2

0.010
0.100
0.5 0.500
0.700 0.900
1.0 1.100
1.500 1.300
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 93: SGD loss example An example loss function for 2 parameters shown as contour plot.

Let’s look at an example (Fig. 93) for a 2D parabola loss curve described by L(w) = (0.1 w1 )2 +w22 and shown as
a contour plot, where each ellipse represents a loss value. As we can see by looking at the loss the gradient slopes
upward more quickly in the w2 dimension and more slowly in the w1 dimension (factors 1 and 0.1 respectively).
The resulting gradient can be formulated by:
>
∇w L(w) = (0.02 w1 2w2 ) + N (0, 0.03)
∂L ∂L
with ∂w 1
= 0.02 w1 and ∂w 2
= 2w2 . To simulate the process of stochastic gradient descent of minibatches, we
have added Gaussian noise to the gradient as well.
Although SGD helps with gradient descent’s efficiency, the problem of how to choose the learning rate still
exists:
In Fig. 94 we can observe the effect a learning rate that is chosen too low (very slow convergence - top) or too
high (divergence - bottom) has on the convergence of the model. In the case of a learning rate that is too high,
every update overshoots the previous w2 which leads to oscillation and divergence, although it gets closer to
the optimum in the w1 dimension. We can also see the effect of the noise as the points along w1 are not equally
distributed.
When choosing a better learning rate, the effect is obvious (Fig. 95):
The update path is still oscillating and slow, but does not diverge anymore.

6.2.3 Finding the right Learning Rate

The question arises on how to find the right learning rate. One possible solution is the line search:
1. Compute minibatch gradient: ∇w LB (wt ) ≡ B1 b ∇w Lb (wt )
P

2. Find optimal step size: η ∗ = argmin η LB (wt − η∇w LB (wt ))

3. Update weights: wt+1 = wt − η ∗ ∇w LB (wt )
In which we try to find the optimal learning rate for each update independently such that we get the best
update for the parameters. But this creates another optimization problem on its own, because we need to solve
a very large system at each step. Thus making it impractical for deep learning and not useful in practice.

79
SGD ( = 0.10)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
SGD ( = 1.01)
7.5 w1, . . , w100
5.0 w0
2.5
0.0
w2

2.5
5.0
7.5
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 94: SGD bad learning rates Different learning rates on the previous toy example. Top: very low
learning rate. Bottom: too high learning rate.

SGD ( = 0.99)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 95: SGD good learning rate Good learning rate on the previous toy example.

6.2.4 Convergence of SGD

A general problem that SGD has over GD is that due to the stochasticity a fixed learning rate η will never
lead to convergence, because of the noise that is added by using minibatches. This is nicely shown by Fig. 96
where the weights have been initialized at the optimum: w0 = 0. Stochastic gradient descent will always step
over the optimum with a fixed learning rate, even when starting at the optimum. Let’s look at this more in
depth in the following.

In general, a series is the sum of terms of an infinite sequence of numbers (a1 , a2 , . . . )

n
X
sn = ak n→∞
k=1

A series is convergent if there exists a number s∗ such that for every arbitrarily small positive number , there
exists an integer N such that for all n ≥ N :
|sn − s∗ | < (23)
Which means that if we go for long enough we will be arbitrarily close to the optimal solution.

80
SGD ( = 0.98)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 96: SGD not converging Good learning rate on the previous toy example shows non-convergence at
optimum.

In the case of the SGD update step wt+1 = wt − η ∇w LB (wt ), this results in:

w0 = winit
w1 = w0 − η ∇w L0
w2 = w1 − η ∇w L1 = w0 − η ∇w L0 − η ∇w L1
w3 = w2 − η ∇w L2 = w0 − η ∇w L0 −η ∇w L1 −η ∇w L2
| {z } | {z } | {z }
=a1 =a2 =a3

with elements a1 , a2 and a3 being elements of a series. By applying the convergence criterion (23) we can
conclude that optimization converges if there exists a vector w∗ such that for every arbitrarily small positive
number , there exists an integer T such that for all t ≥ T :

kwt − w∗ k <

The convergence theorem is given by [10]:

Let (η1 , η2 , . . . ) be a sequence of positive step sizes with

∞
X ∞
X
ηt = ∞ and ηt2 < ∞
t=1 t=1

and let gt be an unbiased estimate of the gradient ∇w L(wt ), i.e., E[gt ] = ∇w L(wt ).
Then, the series
wt+1 = wt − ηt gt t→∞
converges to a local minimum of L(w).

This means on the one hand that we can guarantee that SGD converges to a local minimum when us-
ing a decaying learning rate (like ηt = ηt ), on the other hand we cannot guarantee that it converges to a global
minimum for a non-convex loss function. Although we can guarantee that it converges to the global minimum
for a convex loss function, because the local and global minima are the same.

In conclusion, the problems of SGD are that the contribution of gradients to the update are scaled
equally across all dimensions. SGD requires a conservative learning rate to avoid divergence, which then slows
down the process. In general, finding a good learning rate is difficult.

6.2.5 SGD with Momentum

To help improve the update process of SGD, we introduce SGD with Momentum [12]. By applying a exponential
moving average (Section 6.2.6) of gradients to the weight update, we intend to dampen the oscillation and
accelerate the progress towards the optimum. The resulting improved path of weight updates is seen in Fig. 97.

The velocity term m in the update equations

mt+1 = β1 mt − η ∇w LB (wt )
wt+1 = wt + mt+1

81
SGD + Momentum ( = 0.99, 1 = 0.9)

1.5 w1, . . , w100

1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 97: SGD with Momentum SGD with Momentum learning on the previous toy example.

is responsible for the improvement, by keeping a little bit (β1 ; typically β1 = 0.9) of the motion from the
previous iteration mt in the current update mt+1 . By setting β1 = 0, we can simulate the standard stochastic
gradient descent. Although momentum is introduced as stated above, a better parameterization is the following
linear combination

mt+1 = β1 mt + (1 − β1 ) ∇w LB (wt )
wt+1 = wt − η mt+1

as it decouples the momentum β1 and learning rate η hyperparameters. We can now independently change the
momentum and the learning rate and inspect their behaviour.

6.2.6 Exponential Moving Average

What both of the previous equations for m implement is effectively an exponential moving average of the
gradient. Let’s take a look at the following, to see why it is a exponential moving average.

Let us abbreviate the gradient at iteration t with gt ≡ ∇w LB (wt ). We have:

mt+1 = β1 mt + (1 − β1 ) gt (with m0 = 0)
m1 = β1 m0 + (1 − β1 ) g0 = (1 − β1 ) g0
m2 = β1 m1 + (1 − β1 ) g1
= β1 (1 − β1 ) g0 + (1 − β1 ) g1
m = β1 m2 + (1 − β1 ) g2
3

= β12 (1 − β1 ) g0 + β1 (1 − β1 ) g1 + (1 − β1 ) g2

We see that the weight decays exponentially:

t−1
X
mt = (1 − β1 ) β1t−i−1 gi
i=0

This shows that the contribution of gradients in the past is much smaller than the contribution of gradients
that are closer to timestep t, as the −i in the exponent of β1t−i−1 marks. Although the contribution gets
smaller and smaller for previous iterations, it’s always there.

Depending on how β1 is set, we either track the gradient more quickly (see Fig. 98 left) or slowly (
Fig. 98 right) and sometimes not even reach the original value of the gradient.
We see that the average effectively dampens the behaviour of the function.

82
1 = 0.80 1 = 0.95
5 gt 5 gt
mt mt
4 4

3 3

2 2

1 1

0 0
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
t t

Figure 98: Exponential Moving Average An example that shows the dampening behaviour of EMA on the
momentum.

SGD + Nesterov Momentum ( = 0.99, 1 = 0.9)

1.5 w1, . . , w100

1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 99: SGD with Nesterov Momentum An example that shows the improved dampening behaviour of
SGD with Nesterov Momentum.

6.2.7 SGD with Nesterov Momentum

Another approach of adding momentum to SGD is with the Nesterov Momentum that builds upon the standard
momentum, by looking one step ahead to calculate the gradient wrt. predicted parameters ŵt+1 :

ŵt+1 = wt + β1 mt
mt+1 = β1 mt − η ∇w LB (ŵt+1 )
wt+1 = wt + mt+1

Again, this expression can be rewritten such that the hyperparameters are decoupled:

ŵt+1 = wt − η β1 mt
mt+1 = β1 mt + (1 − β1 ) ∇w LB (ŵt+1 )
wt+1 = wt − η mt+1

The predicted parameters ŵt+1 estimate where we might end up by updating with the previous velocity mt ,
but without the current gradient. This alternative to the standard momentum increases the responsiveness of
momentum, because we take a look into the future, thus it leads to faster dampening as shown in Fig. 99.

6.2.8 RMSProp
An approach with a the same motivation as momentum is RMSprop. We want to have an even distribution of
gradients on each weight dimension, in contrast to the standard SGD (see Fig. 95) with an very uneven gradient
distribution. The idea of RMSprop is to divide the learning rate by a moving average of squared gradients,
which means that we change the learning rate per parameter, i.e. w1 and w2 in Fig. 95.
The moving average of squared gradients or the running variance v is used in this approach to adjust the

83
RMSprop ( = 0.10, 2 = 1.0, = 1e 08)
2.0 w1, . . , w100
1.5 w0
1.0
0.5
w2

0.0
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 100: RMSprop An example that shows the update behaviour of RMSprop.

per-weight step size (e.g. division in w2 direction will be larger, division in w1 direction will be small).

vt+1 = β2 vt + (1 − β2 ) ∇w LB (wt ) ∇w LB (wt )

(24)
t
∇w LB (w )
wt+1 = wt − η √ (25)
vt+1 +
By only considering the squared gradient and not the mean, we end up with an uncentered variance of the
gradient v. To make sure that we only apply it per parameter, all operations are elementwise. The effect of this
per-weight scaling can be seen in Fig. 100, which converges much faster to the optimum. However we see that
in the first few iterations the updates show a jumping behaviour. That is due to a bias towards zero, because
we divide by a very small number v at the beginning due to initialization of v0 = 0. This problem is solved
with the next algorithm.

6.2.9 Adam
Adam is the most used and de facto default optimizer, due to its robustness, as it combines all previous mentioned
ideas of Section 6.2.5 and Section 6.2.8. It is also possible to combine Adam with Nesterov’s Momentum,
although not many people use it. The benefits can be seen in Fig. 101 in contrast to Fig. 95.

Adam ( = 0.30, 1 = 0.9, 2 = 1.0, = 1e 08)

1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2

0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1

Figure 101: Adam An example that shows the update behaviour of Adam.

Given the update equation we can clearly see the influences of Momentum and RMSprop as both the first-
moment velocity term m and the second-moment variance term v are used.

mt+1 = β1 mt + (1 − β1 ) ∇w LB (wt )
vt+1 = β2 vt + (1 − β2 ) ∇w LB (wt ) ∇w LB (wt )

mt+1 vt+1
m̂t+1 = v̂t+1 =
1 − β1t+1 1 − β2t+1
m̂t+1
wt+1 = wt − η √
v̂t+1 +
In addition, Adam uses a bias correction, namely m̂t+1 and v̂t+1 . But why does this help remove the bias?

84
6.2.10 Bias Correction
Let gt = ∇w L(wt ) denote the gradient of the stochastic objective L(wt ).
Let further m0 = 0. Then, the update mt+1 = β1 mt + (1 − β1 ) gt can be written as:
t−1
X
t
m = (1 − β1 ) β1t−i−1 gi
i=0

Thus, the expectation over mt is given by:

" t−1
#
X
t
E[m ] = E (1 − β1 ) β1t−i−1 gi
i=0

Because the expectation is a linear operation we can push it through into the sum. We can approximate the
expectation of previous time steps by E[gt ] and pull it out of the sum again. Also we switch the sum afterwards
from counting downwards to counting upwards:
t−1
X
≈ E [gt ] · (1 − β1 ) β1i
i=0

The sum is now a mathematical series for which a simple solution exists:

= E [gt ] · (1 − β1t )

Here we can see that the term (1 − β1t ) is the factor under the fraction in the bias correction term m̂.

6.2.11 Second-order Methods

There are also some second-order methods that we’ve not looked into, like Newton, Gauss-Newton etc. These
second-order methods are faster as they exploit the curvature of the loss function, however they are not applicable
to mini-batches as the Hessian matrix estimates are too inaccurate if computed from small batches. Also they
are not tractable as they require the inversion of a very large matrix, thus they are typically not applied in deep
models.

6.3 Optimization Strategies

6.3.1 Learning Rate Schedules
We have already seen that the learning rate should decrease with time, although not too fast as we might get
stuck in a region that is far from a good one, i.e. a local minimum. This is where learning rate schedules can
help. The following are some learning rate schedules that we can make use of:
• Fixed learning rate (not a good idea: too slow in the beginning and fast in the end)

• Inverse proportional decay: ηt = η/t (Robbins and Monro)

• Exponential decay: ηt = ηαt
• Step decay: η ← αη (every K iterations/epochs, common in practice: α = 0.5)

6.3.2 Monitoring the Training Process

Maybe just as important as choosing the right learning rate schedule is to recognize common training patterns
that can indicate to us how the model is currently training. In the following we see multiple learning loss plot
patterns that indicate some learning behaviour:

6.3.3 Hyperparameter Search

It is also important to choose the hyperparameters right, because they are parameters that are not optimized
by gradient descent, e.g. parameters of the network are not hyperparameters. In general, it is good to have
less hyperparameters to search the space of possible values efficiently as hyperparameter search is a difficult
problem. Thus often still based on human intuition.
The most common methods of finding hyperparameters are:

85
Figure 102: Underfitting Model does not Figure 103: Not converged Model requires
have enough capacity to decrease losses. The more iterations to converge. We have to run
typical steep decline at the start is missing. training for longer time.

Figure 104: Overfitting Training loss de-

Figure 105: Good example Example of
creases, but validation loss increases. We
train and validation curves that show a good
can use regularization (e.g. early stopping)
fit.
or change the model capacity to avoid.

Figure 106: Noisy validation curves The

Figure 107: Validation set easier It might
validation set might be too small such that
also happen that the validation set is easier
we cannot observe something meaningful.
to predict than the training set.
Increase the validation set size.

6.3.4 How to Start

When starting a new DL project and we want to make sure that our model is correct, we can go through the
following steps:

1. Start with single training sample and use a small network

• First verify that the output is correct

• Then overfit, accuracy should be 100%, fast training/debug cycles

86
Grid search
1

Second Parameter
0.8

• Manuel Search 0.6

It is still the most common. E.g. look 0.4

in the neighborhood of your current hy- 0.2

perparameters. 0
0 0.2 0.4 0.6 0.8 1
• Grid Search First Parameter
Define ranges in which you want to sys- Random search
tematically evaluate hyperparameters 1

Second Parameter
(use human intuition to define ranges). 0.8
Works well, but is very expense, be- 0.6
cause we need to run a lots of full train- 0.4
ings.
0.2

• Random Search Like grid search, 0

0 0.2 0.4 0.6 0.8 1
but hyperparameters selected based on
First Parameter
random draws.
Figure 108: Grid and Random search
An exmaple for hyperparameter search using
grid and random search.

• Choose a good learning rate (0.1, 0.01, 0.001, ..)

2. Increase to 10 training samples

• Again, verify that the output is correct
• Measure time for one iteration (< 1s) ⇒ identify bottlenecks (e.g., data loading)
• Overfit to 10 samples, accuracy should be near 100%

3. Increase to 100, 1000, 10000 samples and increase network size

• Plot train and validation error ⇒ now you should start to see generalization
• Important: Make only one change at a time to identify causes

6.3.5 Improving Gradient Flow

One of the most important requirements of a deep neural network is that the gradient flow has to work
correctly. If we don’t have a proper gradient flow, there might be some parameters that are not or
incorrectly updated. To make sure that this is not the case, we can visualize the gradient flow. There are mul-
tiple possible solutions to improve the gradient flow: Initialization, Batch Normalization and Residual Networks.

Initialization
The initialization of the network’s weights is very important, because depending on the initialization the
activation of layers might get worse layer after layer. Therefore we should consider using Xavier or He
initialization to ensure that the activation distribution is a constant Gaussian across all layers.

Batch Normalization
The Batch Normalization normalizes each channel individually by mean and variance over the batch as
shown in the follow equations by the indices b and c.
B
1 X
µc = xb,c
B
b=1
B
1 X
σc2 = (xb,c − µc )2
B
b=1
xb,c − µc
x̂b,c = p
σc2 +
yb,c = γc x̂b,c + βc

87
Similar to Momentum and RMSprop, we want to make sure that any bias is removed before applying the
subsequent operations. Because we have removed any bias, we want to add a learnable bias βc as well as a
learnable scale parameter γc after that. While training the batch normalization layers, that are placed before the
activation function, saves a running average of mean and variance that will be applied at test time to estimate
the mean and variance during training.
Of course there are other forms of normalization that we can add to our DNN:

Figure 109: Normalization Variations An example of different normalization variations.

Residual Networks
Especiall deep networks have the problem that gradients are propagated very slowly and once they arrive at
early layers they are very small (vanishing gradient-problem). The observation is that deeper networks often
perform worse than more shallow ones, which is counter-intuitive, because if we have a shallow network and
add identity layers/transformations, we get the same performance as the shallow network. So the deep model
should be in theory at least as good in terms of accuracy as the shallow model, which is not the case. Therefore
residual networks have been proposed to learn the residual mappings of a few layers i.e. 2, by introducing a
skip connection across these layers (see Fig. 110), which helps the gradient flow. These layers now only have to
learn a delta of the input towards their output.

Figure 110: Residual Network An example of a residual network unit.

6.3.6 Training Schedules

Lastly, there are some common training schedules that are useful to know when dealing with DL.

Pretraining
Pretraining is most useful, when we don’t have a large enough dataset to train a full network. We first pretrain
our backbone of the DNN (e.g. convolution layers) on another task for which a large dataset with labels is
available (e.g. ImageNet). Then we finetune the last layers on our target task/dataset, which should yield a
significant improvement.

Self Supervision
Similar to pretraining, we want to pretrain the backbone on a task for which supervision is generated from the
data itself (e.g. denoising, inpainting, contrastive learning).

Curriculum Learning
Curriculum Learning is the process of starting training on a easy dataset and then successively increasing the
difficulty for the network. We want the network to work its way step-by-step to the original full dataset, such
that very difficult samples don’t take the training off course.

88
6.4 Debugging Strategies
In deep learning many things can go wrong, therefore we will discuss some common failure/errors cases and
how to solve them in the following section about debugging strategies.

6.4.1 Emergency First Response

Sometimes it is hard to find the problems for unexpected outputs, especially when training with a large dataset,
where one training might take days or weeks. For these cases there are the following points as a emergency first
response to cover the most common problems that could occur. As a general advise, it is best to try one step
at a time to see what each change results in.

• Start with a simple model that is known to work for this type of data (for example, VGG
for images). Use a standard loss if possible. Make sure that we have something working to build
up on.
• Turn off all bells and whistles, e.g. regularization and data augmentation. We want to make
sure that we simplify complex systems to make it easier to debug. Also reduce the size of the dataset to
increase debug cycles.
• If finetuning a model, double check the preprocessing,
for it should be the same as the original model’s training.
• Verify that the input data is correct. It is very easy to get the input wrong, try to visualize to
double-check if it is correct.
• Start with a really small dataset (1–10 samples).
Overfit on it and gradually add more data.
• Start gradually adding back all the pieces that were omitted: augmentation/regularization,
custom loss functions, try more complex models.

6.4.2 Dataset Issues

Check your input data
Check if the input data you are feeding the network makes sense. Make sure to not mix up width and height
of an image. Or that the image is actually filled with data, not all zeros. So print/display a couple of batches
of input and target output and make sure they are OK.

Try random input

Try passing random numbers instead of actual data and see if the error behaves the same way. If it does, it’s a
sure sign that your net is turning data into garbage at some point. Try debugging layer by layer /op by op/
and see where things go wrong.

Check the data loader

Your data might be fine but the code that passes the input to the net (i.e. the data loader) might be broken.
Print the input of the first layer before any operations and check it.

Make sure input is connected to output

Check if a few input samples have the correct labels. Also make sure shuffling input samples works the same
way for output labels.

Verify noise in the dataset

I can happen that your dataset has a lot of bad labels, such that the network cannot learn. Check a bunch of
input samples manually and see if labels seem off.

Shuffle the dataset

If your dataset hasn’t been shuffled and has a particular order to it (ordered by label) this could negatively
impact the learning, as nearby samples in the same minibatch might be highly correlated. Shuffle your dataset
to avoid this. Make sure you are shuffling input and labels together.

Reduce class imbalance

Are there a 1000 class A images for every class B image? Then you might need to balance your loss function if
both classes are equally important or try other class imbalance approaches.

89
Verify number of training examples
If you are training a net from scratch (i.e. not finetuning), you probably need lots of data. For image
classification, people say you need a 1000 images per class or more. If this is not possible, try to use more data
augmentation or pretraining to reduce the number of needed samples.

Make sure your batches don’t contain a single label

This can happen in a sorted dataset (i.e. the first 10k samples contain the same class). Easily fixable by
shuffling the dataset.

Use a standard dataset

When testing new network architecture or writing a new piece of code, use the standard datasets first, instead
of your own data. This is because there are many reference results for these datasets and they are proved to
be ‘solvable’. There will be no issues of label noise, train/test distribution difference etc.

6.4.3 Data Normalization / Augmentation Issues

Standardize the features.
Make sure to standardize your input to have zero mean and unit variance.

Check for too much data augmentation

Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight
L2, dropout, etc.) can cause the network to underfit.

Check the preprocessing of your pretrained model

If you are using a pretrained model, make sure you are using the same normalization and preprocessing as the
model was when training and testing. For example, should an image pixel be in the range [0, 1] or [0, 255]?

Check the preprocessing for train/validation/test set

Any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then
applied to the validation/test data. E.g. computing the mean and subtracting it from every image across the
entire dataset and then splitting the data into train/val/test splits would be a mistake.

6.4.4 Implementation Issues

Try solving a simpler version of the problem
This will help with finding where the issue is. For example, if the target output is an object class and
coordinates, try limiting the prediction to object class only.

Make sure your training procedure is correct

It is easy to forget toggling train/validation mode. Make sure you are using the right inputs. It is also easy to
forget setting the gradients to zero before backpropagation.

Check your loss function

If you implemented your own loss function, check it for bugs and add unit tests. Often, wrong losses hurt the
performance of the network in a subtle way.

Verify loss input

If you are using a loss function provided by your framework, make sure you are passing to it what it expects.
For example, in PyTorch I would mix up the NLLLoss and CrossEntropyLoss as the former requires a softmax
input and the latter doesn’t.

Adjust loss weights

If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct.
This might involve testing different loss weights.

Monitor other metrics

Sometimes the loss is not the best predictor of whether your network is training properly. If you can, use other
metrics like accuracy. In general, it is desirable to use as little loss functions as possible to reduce the number
of hyperparameters.

90
Test any custom layers
Did you implement any of the layers in the network yourself? Check and double-check to make sure they are
working as intended. Make sure the output has the right format (e.g., did you pass a softmax to a loss that
expects raw logits?)

Check for “frozen” layers or variables

Check if you unintentionally disabled gradient updates for some layers/variables that should be learnable.

Increase network size

Maybe the expressive power of your network is not enough to capture the target function. If you observe
underfitting, try adding more layers or more hidden units in fully connected layers.

Check for hidden dimension errors

If your input looks like (k, H, W) = (64, 64, 64) it’s easy to miss errors related to wrong dimensions. Use weird
numbers for input dimensions (for example, different prime numbers for each dimension) and check how they
propagate through the network and that all dimensions of intermediate/hidden layers are correct.

Explore gradient checking

If you implemented Gradient Descent by hand, gradient checking makes sure that your backpropagation works
like it should.

6.4.5 Training Issues

Solve for a really small dataset
Overfit a small subset of the data and make sure it works. For example, train with just 1 or 2 examples and
see if your network can learn to differentiate these. Move on to more samples per class.

Check weights initialization

If unsure, use Xavier or He initialization. Also, your initialization might be leading you to a bad local minimum,
so try a different initialization and see if it helps.

Change your hyperparameters

Maybe you using a particularly bad set of hyperparameters. Try a grid search.

Reduce regularization
Too much regularization can cause the network to underfit badly. Reduce regularization such as dropout, batch
norm, weight/bias L2 regularization, etc.

Give it time
Maybe your network needs more time to train before it starts making meaningful predictions. If your loss is
steadily decreasing, let it train some more.

Switch from Train to Test mode

Some frameworks have layers like Batch Norm, Dropout, and other layers behave differently during training
and testing. Switching to the appropriate mode might help your network to predict properly.

Visualize the training process

• Monitor the activations, weights, and updates of each layer. Make sure their magnitudes match. For
example, the magnitude of the updates to the parameters (weights and biases) should be 1-e3.
• Consider a visualization library like Tensorboard and Crayon. In a pinch, you can also print weights/bi-
ases/activations.
• Be on the lookout for layer activations with a mean much larger than 0. Try Batch Norm, ELUs or other
forms of activations.
• Weight histograms should have an approximately Gaussian (normal) distribution, after some time. For
biases, these histograms will generally start at 0, and will usually end up being approximately Gaussian.
Keep an eye out for parameters that are diverging or biases that become very large.

Try a different optimizer

Your choice of optimizer shouldn’t prevent your network from training unless you have selected particularly

91
bad hyperparameters. However, the proper optimizer for a task can be helpful in getting the most training in
the shortest amount of time.

Exploding / Vanishing gradients

Check layer updates, as very large values can indicate exploding gradients. Gradient clipping may help. Check
layer activations. A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly
outside of this range may indicate vanishing or exploding activations.

Increase/Decrease Learning Rate

A low learning rate will cause your model to converge very slowly. A high learning rate will quickly decrease
the loss in the beginning but might have a hard time finding a good solution. Play around with your current
learning rate by multiplying it by 0.1 or 10.

Overcoming NaNs
• Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations.

• NaNs can arise from division by zero or natural log of zero or negative number.
• Try evaluating your network layer by layer and see where the NaNs appear.

7 Convolutional Neural Networks

The deep learning revolution started with a significant performance increase on the ImageNet competition by
AlexNet in 2012 [8]. AlexNets’ success is due to using many layers, using convolutional layers, and a decrease
of features and the depth of the network. In the following section, techniques relating to convolutional neural
nets (CNNs) are introduced.

7.1 Prerequisites
In the following, we use Einstein Notation. Capital letters denote tensor slices. One element of a matrix A
is denoted by A[i, j]. The i’th row of matrix A is accessed with A[i, J] and the j’th column by A[I, j]. A full
matrix uses both capital letters such that A[I, J] denotes A. The same holds for higher-order tensors, H[i, j, k]
denotes one element of the tensor H. Repeated capital letters in a product denote summation over those letters.
The Einstein Notation of a matrix product y = Ax is given by
X
y = Ax ≡ y[i] = A[i, j]x[j]
j

≡ y[i] = A[i, J]x[J]

Capital J indicates the summation over the J’th dimension. The analog case for y = xT A can be written as a
summation over the I’th dimension
X
y = x> A ≡ y[j] = A[i, j]x[i]
i
≡ y[j] = A[I, j]x[I]

7.2 Convolution Layers

In contrast to multi-layer perceptrons (MLP), a CNN layer adds a spatial extension. W and H denote the
width and height of an image. C denotes the number of feature channels. In a colored picture, there are three
feature channels corresponding to Red, Blue, and Green. b. B represents the batch size. Figure 111a shows
the spatial dimensions of two succeeding feature maps Hi , Hi+1 , with input channels Cin and output channels
Cout . Using a gray-scale image as input results in Cin = 1. Consequently a colored image yields Cin = 3.
In a standard MLP (Figure 111b), the whole input is connected to a single feature in the output layer, yielding
a fully connected layer. The number of weights for a fully connected layer is calculated by

#Weights = W × H × Cout × (W × H × Cin + 1)

In contrast, the convolutional layer has fewer connections (Figure 111c). The weight matrix A defines a filter of
a certain size. A feature in the output layer is computed by multiplying a subset of the input neurons with the
weight matrix A. The subset of neurons that influence an output field is in the corresponding input field’s local

92
neighborhood. The same filter operation is applied at every spatial location (weight sharing). A convolution
filter has the size K × K. The number of weights for a convolutional layer is calculated by

#Weights = Cout × (K × K × Cin + 1)

Usually, multiple convolution kernels are convolved with the input, each producing an output channel.

Filter Kernel

(a) (b) (c)

Figure 111: (a) spatial dimensions of two succeeding feature maps (b) a fully connected layer (c) convolutional
layer. Only three input, one output channel and a single convolution are shown for clarity.

7.2.1 Convolution layer

The mathematical formulation for a convolutional layer is given by
 

Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + ∆X, y + ∆Y, Cin ] + bi [cout ]
 
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias

where b is the batch index, x, y are the spatial locations and cin , cout the feature channels. The field x, y in
hidden layer Hi+1 is computed by applying the activation function g to the product of a subset of A with a
subset of Hi added with the output channels bias term. The subset of A is a square, selected by ∆X, ∆Y ,
of a certain input channel Cin corresponding to a certain output channel cout . The subset of Hi is a square
spanned by the lengths ∆X, ∆Y , fixed around coordinates x, y of the same channel Cin . This operation is done
for every batch b. A simple example using a filter size of two can be seen in Figure 112, where the resulting
field in output map Hi is computed by point-wise multiplying the fields in Ai with the selected fields in Hi−1
summed up and added to a bias term.

1 0 2 1 0 2
2 0 +1 5 2 0 +1 5 1
-2 0 0 -2 0 0
-1 1 -1 1
0 2 -1 0 2 -1

(a) 2 · 1 + (−1) · (−2) + 1 = 5 (b) 0 + 1 = 1

1 0 2 1 0 2
2 0 +1 5 1 2 0 +1 5 1
-2 0 0 -2 0 0
-1 1 -1 -1 1 -1 -2
0 2 -1 0 2 -1

(c) 2 · (−2) + 1 · 2 + 1 = −1 (d) (−1) · 2 + 1 · (−1) + 1 = 2

Figure 112: A convolution of input matrix Ai with input map Hi−1 resulting in the output map Hi .

Technically, CNNs implement correlation and not convolution, which would be the case if the kernel was flipped
in the above operations. This misnomer does not matter as a flipped kernel does not decrease the expressiveness
of the CNN.

93
7.2.2 Convolution Operator
The star operator (∗) denotes the convolution operator. The formula of convolving feature map H with filter
kernel A is given by X
[A ∗ H](x) = A(∆x) H(x + ∆x)
∆x∈Z2

where A convolved with H and is evaluated at a particular position x. Moreover the convolution is translation
equivariant. An operation f (·) is invariant to a transformation Tθ if the functions output is the same for any
transformed input Tθ [H]
f (H) = f (Tθ [H])
Whereas an operation f (·) is equivariant if its output transforms as its input for some specific transformation
type
Tθ [f ](H) = f (Tθ [H])
CNNs are translation equivariant but not translation invariant. The CNN is equivariant because shifting the
input results in a shifted output. Put in another way, by transforming the input, the feature maps are also
transformed. If e.g., an image is shifted by a certain amount of pixels, the output feature maps are also shifted
in the same way. The proof for the convolution operations translation equivariance is given by
X
[A ∗ Tt [H]](x) = A(∆x) Tt [H](x + ∆x) definition of convolution
∆x∈Z2
X
= A(∆x) H(x + ∆x − t) expanding translation operator
∆x∈Z2
X
= A(∆x) H((x − t) + ∆x) rearranging
∆x∈Z2
= [A ∗ H](x − t) = Tt [A ∗ H](x) definition of convolution

.
The ConvLayer can be implemented as computation graph. Figure 113 shows the convolutional operations
dependencies between input tensor (h11 , h12 , h13 ), the output tensor (h21 , h22 ) and the kernel (w1 , w2 ), that is
swiped along the input. The gradients get accumulated across locations due to the kernels weight sharing.

Figure 113: Computation graph of a convolutional layer. The arrows show dependencies in the summation.
The weight w1 appears in the summation for h21 and h22 . Therefore w1 receives more gradients/information in
back-propagation than it would have in a fully connected layer.

7.2.3 Padding
Applying a convolution kernel to all fields of the input map (Figure 114a) decreases the succeeding feature maps’
size (Figure 114b). Convolutions can only be executed if the kernel lies entirely within the input domain. This
decreasing in size undesirable as it couples architecture and input size. Padding adds a boundary of appropriate
size with zeros around the input tensor (Figure 114d). Consequently solving the questions which values should
be used for the bordering fields (Figure 114c). The convolution with a padded tensor H 0 [·] is given by
 
0
Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + ∆X, y + ∆Y, Cin ] + bi [cout ]
 
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias

. If the input is padded, then ∆x and ∆y are non-negative.

94
(a) (b) (c) (d)

Figure 114: (a) A convolution (b) the resulting receptive field. (c) Fields with question-marks indiciate that it
is unclear what values the kernel should be convolved with. (d) Padding: the input tensor is extended to the
appropriate size with zeros (blue) in the rightmost image.

7.3 Downsampling
Convolution operations can sustain the resolution of the input; therefore extra downsampling operations are used
to reduce the resolution of the spatial input. Downsampling is necessary when the network’s input dimension
is larger than the output dimensions. An example is CNNs that reduce images to few labels. Furthermore,
reducing the spatial resolution increases a deep in the network located neurons receptive field.

7.3.1 Pooling
Reducing the spatial dimensions can be achieved by a pooling operation. Pooling requires no parameters and
is typically one of a max, min, or mean operation. A pooling layer is defined as

Hi [b, x, y, c] = max Hi−1 [b, s · x + ∆x, s · y + ∆y, c]

| {z } ∆x,∆y | {z }
Current Layer Prev. Layer
where s is the stride, the steps a kernel takes. Typically a stride s = 2 and kernel size 2×2 are used, consequently
reducing the spatial dimensions by a factor of two (Figure 115a and 115b). Pooling is applied to each channel
separately, therefore retaining the number of channels from input to output. Furthermore, max-pooling provides
invariance to small translations of the input. In practice, max pooling is often replaced with strided convolution
(ResNet).

1 1 2 4
5 6 7 8 6 8
3 2 1 0 3 4
1 2 3 4

(a) (b) (c) y = 3sinx

Figure 115: (a,b) A pooling operation with stride s = 2 for Hi ’s first two pixels. (c) Two consecutive feature
maps Hi−1 and Hi after applying a max-pooling operation to Hi−1 .

7.3.2 Strided convolution

A convolutional filter is moved, similarly to pooling, over the feature map Hi with a stride s. The operation is
given by the equation
 

Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, s · x + ∆X, s · y + ∆Y, Cin ] + bi [cout ]
 
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias

7.3.3 Receptive field and arithmetics

A neurons’ receptive field is defined as all pixels in the input X that influence it, as shown in Figure 116b.
Given an input image by width and height Win × Hin padded by P zeros and a kernel K × K moving with a

95
stride of s, we can calculate the output tensors spatial dimensions with

Win + 2P − K Hin + 2P − K
+1 × +1
s s
| {z } | {z }
Wout Hout

The equation to calculate Wout is similar to Ho ut. In the case of Wout , we apply twice the padding (both left
and right). The kernel length is subtracted because the total width minus the kernel length is the maximal
index the kernel can be positioned. The term Win + 2P − K is divided by the step size and rounded to the
bottom. An example with parameters set can be seen in Figure 116a.

(a) (b)

Figure 116: (a) Strided convolution for an 5 × 5 image, a 3 × 3 kernel, 1 pixel zero padding and a stride of 2
resulting in a 2 × 2 output image. (b) The receptive field for the feature map H2

7.3.4 Fully Connected Layers

In many CNN architectures, there exist fully connected (FC) layers. However, the most memory intensive
part for, e.g., a VGG architecture are those FC layers. In order to apply a FC layer after a CNN, the spatial
dimensions have to shrink. An example would be shrinking from 7 × 7 to 1 × 1 in Figure 117a. The tensor is
reshaped, so that channel dimensions C absorb the width and the height dimensions, see Figure 117b. We can
express a FC layer as
 

Hi [b, cout ] = g Ai [cout , Cin ] Hi−1 [b, Cin ] + bi [cout ]

 
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias

(a) (b)

Figure 117: (a) VGG16 (b) The reshape operation, transforming Hi−1 [B, X, Y, C] into Hi [B, C]

7.4 Upsampling
If pixel-level outputs are desired, it is necessary to upsample the features again. Downsampling is still required
to provide good features with large receptive fields in the intermediate layers. Upsampling yields outputs at
the same resolution as the input.

96
7.4.1 Nearest neighbor
Each channel is scaled using nearest-neighbor interpolation, as shown in Figure 118a.

7.4.2 Bilinear
Each channel is scaled using bilinear neighbor interpolation.

7.4.3 Bed of Nails

Each element is inserted at a sparse location, and the in-between values are set to zero as shown in Figure
118b. The next layer often applies a convolution to fill up the values. Bed of nails sometimes produces ”nails”
artifacts.

7.4.4 Max-Unpooling
For unpooling, the indices of the maximum elements in the earlier corresponding pooling operation are stored.
When applying the unpooling operation, the values are set at the locations from the previous pooling operation.
As a result, corresponding pairs of downsampling and upsampling layers are required. An example can be seen
in Figure 118c. The max-unpooling approach has been used in SegNet.

6 6 8 8 6 0 8 0 1 1 2 4 0 0 0 0
6 8 6 6 8 8 6 8 0 0 0 0 5 6 7 8 Pooling 6 8 4 3 Unpooling 0 4 0 3
3 4 3 3 4 4 3 4 3 0 4 0 3 2 1 0 3 4 1 2 1 0 0 0
3 3 4 4 0 0 0 0 1 2 3 4 0 0 0 2

(a) (b) (c)

Figure 118: (a) nearest neighbor upsampling, (b) example bed of nails upsampling shows, (c) max-unpooling
operation with the corresponding downsampling counterpart.

7.4.5 Dilated Convolution

The goal of dilated convolution is to increase the receptive field without increasing parameters or reducing
spatial dimensions. Consequently, no pooling is required to increase the receptive field, and predictions can be
produced at the same resolution as the inputs. The dilation factor d leads to a more ”distributed” sampling of
the input. A layer that implements dilated convolution is expressed with the formula
 

Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + d · ∆X, y + d · ∆Y , Cin ] + bi [cout ]
 
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias

Dilated convolutions are used in, e.g., semantic segmentation, depth, optical flow.

(a) (b)

Figure 119: (a) Dilated convolution applied with a dilation factor d = 2. (b) The receptive fields exponential
expansion without loss of resolution or coverage.

7.5 Architectures
In the following, we look at standard CNN architectures and paradigms they introduced.

97
7.5.1 LeNet-5
In 1998 the LeNet-5 was introduced using two convolution layers (5 × 5), two pooling layers (2 × 2) followed by
two fully connected layers. In 1998 LeNet-5 achieved state-of-the-art accuracy on MNIST (before ImageNet).
The architecture is shown in figure 120a.

7.5.2 AlexNet
The deep learning revolution started in 2012 with the introduction of the AlexNet architecture. AlexNet consists
of eight successive layers. It used ReLUs, dropout, data augmentation and was trained on 2 GTX 580 GPUs.
The number of feature channels increases with depth, resulting in decreasing spatial resolution. AlexNet showed
that CNNs work well in practice. The architecture is displayed in figure 120b.

7.5.3 VGG
In 2015 the VGG architecture was introduced. The novelty was using 3 × 3 convolutions everywhere but
maintaining the same expressiveness with fewer parameters. The paper backed the paradigm that using small-
sized kernels is better. A second variant has 19 instead of only 16 layers. The architecture is shown in figure
117a.

7.5.4 Inception
In 2015 the inception architecture was introduced. Inception consists of 22 layers. The modules utilize conv/pool
operations with varying filter size. Backpropagating an error through 22 layers turns out to be very hard. To
mitigate a vanishing gradient, multiple intermediate classification heads improve gradient flow. The use of 1 × 1
convolutions reduces the number of features and leads to higher efficiency.

(a) y = x (b) y = x (c) y = x

Figure 120: (a) LeNet-5 (b) AlexNet (c) Inception Network.

7.5.5 ResNet
ResNet was introduced in 2016. Residual connections allow for training deeper networks (up to 152 layers).
The network uses a very simple and regular structure with 3 × 3 convolutions and strided convolutions for
downsampling. ResNet and ResNet-like architectures are dominant today in computer vision.

7.5.6 U-Net
U-Net was introduced in 2015 and produced image-level segmentation by combining max-pooling as downscaling
and up-convolution as upscaling operation. U-Net is the defacto standard for many tasks with image output
(e.g., depth, segmentation).
1 64 64

128 64 64 2

input output
image segmentation
392 x 392

390 x 390

388 x 388
388 x 388

map
570 x 570
568 x 568
572 x 572

128 128
256 128
200²

198²
196²
280²
284²
282²

256 256 512 256

conv 3x3, ReLU

104²
140²

138²

136²

102²

100²

copy and crop

512 512 1024 512
max pool 2x2
56²
68²
3x3 conv, 128, /2

3x3 conv, 256, /2

3x3 conv, 512, /2

7x7 conv, 64, /2

3x3 conv, 128

3x3 conv, 256

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

3x3 conv, 128

3x3 conv, 512

3x3 conv, 64

32² 64²
66²

54²

52²
avg pool
pool, /2

fc 1000

1024 up-conv 2x2

conv 1x1
30²

28²

(a) (b)

Figure 121: (a) ResNet (b) U-Net

98
7.6 Visualization
Visualization provides researchers with methods to open the black box of neural network classifiers. In the
following section, a few visualization techniques are presented.

7.6.1 First Layers

For the first layers of a CNN, the convolutional filters can be directly displayed as shown in figure 122a. The
first layers learn Gabor filters or simple gradients. Gabor filters for example extract spatial orientations. The
interpretation of deeper layers is more difficult.

ResNet-18: ResNet-101: DenseNet-121:

64 x 3 x 7 x 7 64 x 3 x 7 x 7 64 x 3 x 7 x 7
AlexNet:
64 x 3 x 11 x 11

(a) (b)

Figure 122: The right image shows weight matrix visualizations of a linear classifier. The left image shows the
first layers learned filters of different deep learning architectures.

7.6.2 Last Layer

Nearest Neighbors
The last layer before the classification layer encodes an image representation for each image class. An image
feature space is populated by recording the previous layers activation for all validation images, For one image,
the k-nearest neighbors can be determined and displayed, showing semantic relationships between the visual
embeddings. An example is displayed in figure 123a.

Dimensionality Reduction
Using dimensionality reduction techniques such as t-SNE or PCA, similar items can be identified or grouped.
Figure 123b shows the FC7 architectures visual embedding of the MNIST dataset where similar numbers
cluster in the same regions.

Maximally Activating Patches

Many images are passed through a network. The patches which maximally activate a specific neuron of interest
are recorded. This way, we can determine what a neuron attends to.

(a) (b)

Figure 123: (a) Test images the k-nearest neighbors utilizing the visual embedding of the layer preceding the fully
connected classification layer. (b) A lower dimensional embedding of the high dimensional visual embedding
for the MNIST dataset.

7.6.3 Saliency
Saliency determines the relevant parts of an image for a particular classification decision and can uncover
unwanted biases in data or the model.

99
Saliency via masking
An image of a particular class is passed through the network multiple times. Each time a different image region
is masked, and the classification response is recorded. A saliency map can be constructed from these responses.
A pixel in the saliency map defines how much the classifier is sensitive to an occlusion at that pixel wrt. to the
class. An example is shown in figure 124a.

Saliency via backpropagation

Saliency via backpropagation computes the gradient of the class score for image pixels. The saliency map is
generated by displaying the magnitude of the gradients. This shows how changing pixels of the input affects
class scores the most. Saliency via backpropagation is cheaper than saliency via masking and is used to uncover
biases in the data or the model.

(a) (b)

Figure 124: (a) Saliency via masking w.r.t to the classification of an elephant. The saliency map reveals that
the elephant’s head was most important for the classification decision. (b) A saliency map generated wrt. to
gradients.

7.6.4 Deconvolution
The idea of deconvolution is to reconstruct the CNN in reverse order starting from one neuron. Figure 125a
shows the deconvolution of two layers. The first layers display simple edge detectors, whereas the second layer
shows contours or image parts up to the last layers that display certain objects.

7.6.5 Deep dream

Amplify activation of a chosen layer wrt. to a given image. The backpropagated gradient is set to the neuron’s
activation itself, leading to increased activation of neurons that are already activated. The results can be seen
in figure 125b.

(a) (b)

Figure 125: (a) Deconvolution (b) Deep Dream.

8 Sequence Models
In real-life, our input and output might not always fit into a tensor of a fixed size. It might expand or shrink
continuously across time. Speech, for example, might consist of a single word or a whole sentence. However,
when we start speaking the model has no way of knowing whether our sentence might be 3 words or 30 words
long. Sequence models have been developed to handle such input and output. The most popular type of
sequencing models are recurrent networks which will be explained in section 8.1. Next, we will discuss different
applications of such models to demonstrate their flexibility before moving on to discuss an improved version
of the classical recurrent networks, i.e. gated recurrent networks. In the last part of this unit, auto-regressive
models will be discussed.

100
A little remark on how computation graphs will be displayed in this section. For better clarity, the weight
input nodes will be consider implicit for all hidden and output layer nodes. Furthermore, output nodes will be
coloured in blue from now on.

8.1 Recurrent Networks (RNNs)

The main idea of RNNs is so-called feedback connections which feed information from one node back to
itself. Thus, the hidden state h is updated based on the current input and the previous hidden state using
the same parameters at each time step. This update is indicated by the arrow looping back to the compute
node h in the central computation graph of Fig. 126. This procedure can be repeated t times. To keep track of
how many recurrences already took place and to see the forward pass more clearly, one can ”unroll” the RNN
over time (cf. Fig. 126, left).
Although the input feature maps might change over time, the computations in h1 , h2 , h3 , . . . always stay the
same. Assuming the RNN is meant to translate from a language A into a language B, the inputs x1 , x2 , x3 , . . .
as characters or words of language A which are translated into the characters or words ŷ1 , ŷ2 , ŷ3 , . . . of language
B.
RNNs allow processing of sequences of variable length and have – at least in theory – an infinite memory
since h is a function of all previous inputs, i.e., has long-term dependencies.
Feedforward Recurrent Neural Network (RNN) Recurrent Neural Network (RNN)
Neural Network with feedback connection unrolled over time (index = time t)

...

Figure 126: Feed-forward neural network vs. Recurrent neural network computation graphs.

In RNNs, computation nodes like h are called cells. RNN cells like the one in Fig. 127 receive input from the
previous time-step ht−1 and xt and output the tensor ht and ŷt . After concatenating the two input tensors,
some transformations are applied to the resulting tensor to compute ht . However, the actual translation ŷ of
our input needs one final affine transformation of the previously computed ht .

Output

Concatenate Read
RNN Cell

Memory Write to Memory Memory

Input

Figure 127: A basic RNN cell.

Mathematically, these steps are described by

ht−1 = fh (ht−1 , xt )
(26)
ŷt = fy (ht )
This general formula does not specify the form of the output mappings. Furthermore, neither fh nor fy
change over time, unlike in layers of feed-forward networks.
The single layer RNN consisting of a cell like the one in Fig. 127, makes this more concrete:

ht−1 = tanh(Ah ht−1 + Ax xt + b)

(27)
ŷt = Ay ht

101
This hidden state of this single layer (vanilla) RNN ht is the linear combination of the input xt and the
previous hidden state ht−1 . The output is the linear prediction based on the current hidden state ht . The
tanh(·) activation function used here, is the standard for most RNNs. It both takes and returns data in the
range [−1, 1] and comes with all the benefits (e.g., zero-centred) and disadvantage (e.g., saturation and vanishing
gradient) (cf. section 4.2.2. on Tanh function). As mentioned before, the weight parameters Ah , Ax , Ay , and b
are constant over time, although the number of time steps might vary due to differences in the sequence length.
Formulated in Einstein-Notation the RNN equation for a single layer RNN will look like this:

Ht [b, cout ] = tanh(Ah [cout , Cin ]Ht−1 [b, Cin ] + Ax [cout , Cin ]Xt + b[cout ])

Ht−1 [b, Cin ]
= tanh Ah [cout , Cin ] + b[cout ] (28)
Xt [b, Cin ]
Ŷt [b, cout ] = Ay [cout , Cin ]Ht [b, Cin ]

Ht−1 [b, Cin ]
By stacking the input tensors are concatenated to . The concatenation operation of the two
Xt [b, Cin ]
inputs is represented by the c node in Fig. 127.
Since, RNNs allow for processing of inputs and outputs of variable length, they can be categorised into four
types based on their input-output mappings:

• One to Many, e.g., image captioning (image to sentence)

• Many to one, e.g., action recognition (video to action)
• Many to Many, e.g., machine translation (sentence to sentence)
• Many to Many, e.g., object tracking (video to object location per frame)

The for types are visualised in Fig. 128.

To allow the model to determine the output’s length, we can predict a stop symbol. For languages, this could
be a full stop at the end of a sentence.
One to One One to Many Many to One Many to Many Many to Many

Figure 128: Types of input and output mappings.

Backpropagation. RNNs return an output at multiple times-steps with the number of time-steps not being
fixed. At the same time, each output generates a loss. Thus, each of the losses at a time-step need to be taken
into account when back-propagating, i.e. gradients are back-propagate through time (cf. ).

Loss

Output

Hidden
Layer

Input

Figure 129: Backpropagation through time in an single layer RNN.

The gradient derived from each loss at a time-step is back-propagated to the previous time-step. There, it
is added to the gradient of that time-step and so on. This process is repeated until the initial time-step is
reached. Here, the weight parameters are updated. All of the hidden RNN cells share their parameters; hence,
the gradients are accumulated. However, gradients are getting intractable very quickly – in terms of memory
capacity – for more extensive sequences such as Wikipedia because all previously calculated time-steps had to
be stored along the way.

102
A practical solution is truncated backpropagation through time. While hidden states continue to be
carried forward forever, backpropagation is stopped before reaching the first time-step. In this way, only a
limited number of time-steps is back-propagated. For example, given t = 9 time-steps, backpropagation could
be split into three chunks, i.e., time-steps 1 to 3, 4 to 6, and 7 to 9. This truncation eliminates long-term
dependencies as their gradients are no longer dependent on each other (cf. ).

Loss Loss Loss

Output Output Output

Hidden Hidden Hidden

Layer Layer Layer

Input Input Input

Figure 130: Truncated backpropagation through time in an single layer RNN.

Unfortunately, it will also cause the RNN to lose its memory. Thus, the truncation can not simply be set to a
random value. Instead, it has to be chosen such that necessary context is preserved. In general, the truncation
ranges between 50 - 60.2
Although it is not uncommon to find RNNs consisting of only a single layer, they can be built with multiple
layers. RNNs are often kept very shallow in practice, i.e., only a couple of layers deep. Multi-layer RNNs are
constructed by adding a second cell/hidden layer after the first one, like in typical feed-forward architectures.
As both of the hidden layers in the RNN feedback to themselves, a computation graph would look like the
diagram in Fig. 131 if unrolled.
Alternatively, the cell itself could be made deeper. In other words, instead of having a single affine transformation
and activationc)function fh inside a cell, you can stack multiple layers inside of it. Such an approach is often
combined with residual connections in the vertical direction, i.e., between the layers.

Output

Hidden
Layer 2

d)
Hidden
Layer 1

Input

Figure 131: Schema of a multi-layer RNNs with 2 cells/layers.

8.2 Recurrent Network Applications

One task that can be addressed using RNNs is multiple object recognition. If you are given a house number
and want to find the right house in a street, you need to recognize the correct sequence of digits when it appears.
Given an image of a house number, an RNN would also have to recognize the sequence of numbers in the picture
to make the correct classification. At each time-step, it would get a glimpse at an image region and predict
a saccade. A saccade is the location where the model should shift its attention to in the next time-step. An
example of how an RNN would operate on such a sequential task is presented in Fig. 132a).
RNNs can also be used for recurrent instance segmentation. First, an image gets transformed into a feature
map by a fully-convolutional network. Then, an RNN produces one segment of the image at each time-
step. To know which parts of an object have not yet been segmented, it relies on the segmented parts’ union
fed back into the model. In Fig. 132b) the colours indicate which part of an object was segmented at which
time-step.
RNNs can also carry out object tracking of one or multiple objects in a video. Therefore, it updates each
object’s hidden state for each video frame (cf. Fig. 132c)). The updated information could consist of the
location, the shape, the size, or the velocity of the object.
An interesting ability of RNNs is that they can solve tasks that are not sequential by nature. For instance, we
can use them for image generation. An example is given in Fig. 133a) where a red rectangle indicates the
attended region where the model can draw.
Instead of creating a single digit or sequences of digits, an RNN can also learn how to generate/complete images
based on partially occluded inputs as demonstrated in Fig. 133Fig: 90 b). The occluded input is shown on
2 However, it is task-dependent. In our language example, the truncation might differ, depending on whether the sequences are

characters, words, or sentences.

103
a) b)

Figure 132: (a) Sequence recognition in multiple object recognition, (b) recurrent instance segmentation, (c)
object tracking in video frames (right).

a) b)

Figure 133: Examples of sequential image generation. a) Generation of handwritten digits; b) Generation of
images based on partially occluded inputs.

the left and the original on the right. All pictures in between are completions generated by the model. This
prediction is based on pixel intensities in the non-occluded part of the input. Some of the resulting images are
more plausible than others.
The wide range of application possibilities also includes image annotation. The Polygon-RNN is an in-
teractive object annotation tool that iteratively annotates the outline of a 2D object instance (e.g., a car) in
an image with polygons (cf. Fig. 134a)). The annotator can manually correct the suggested outlines. Such
tools are beneficial when creating an annotated data set to train an object detection model as they significantly
reduce the amount of time spend on annotation.

a) b)

Figure 134: Examples of object annotation (left) and road layout modeling (right)

Similar to the inference of object outlines in the image annotation, RNNs can model or generate road layouts.
The resulting spatial graphs are based on an image input. An example of road layout generation and aerial
road parsing can be seen in (cf. Fig. 134b)).
With the help of RNNs, we are even able to transform an image into a sentence that describes the picture content
(cf. Fig. 135Fig: 92). For this task, an RNN is trained to sequentially generate words based on a condensed
feature representation of the input image. A CNN previously computed this representation. In Fig. 135 the
attention shifts from one region in the image to another one. The Figure also demonstrates how the attended
regions correspond to the words that are generated.
Although this is quite impressive and it works (cf. Fig. 134c)), there is a lot that can go wrong (cf. Fig. 134d)).
For instance, taking the second picture to the left in the first row of Fig. 134d) indicates that the clock-like
shape identified by the model, is in fact, printing on the woman’s hoodie. Thus, the errors made by such a

104
a) b)

c) d)

Figure 135: Examples of image captioning. a) general process flow; b) Shifted attention of the RNN (top: soft
attention, bottom: hard attention); c) Successful image captioning examples; d) Unsuccessful image captioning
examples.

model can be partially attributed to the regions with high attention. Analyzing these patterns of attention
allow improvements in the model’s performance.
A more recently proposed task for the usage of RNNs is visual question answering. The model receives an
image and a question (and multiple possible answers) as input and predicts the correct answer. For instance,
given the image of a Teddy Bear and the question ”What kind of stuffed animal is shown?” the model should
answer: ”Teddy Bear”.

Figure 136: Google’s neural machine translation system: An Encoder-Decoder model for machine translation.

One of the more common non-image related fields is neural machine translation. When translating from
one language to another, many to many mapping is required as it is not possible to translate sentences word
by word. For example, a German metaphor like ”Ìch verstehe nur Bahnhof” would be translated to ”I only
understand train station.” However, this is not what the model should return. A more appropriate translation
would be something like ”I don’t understand what you just said.”. For this reason, translation models use an
encoder that builds a meaning representation of the original sentence and a decoder to construct a sentence
with the corresponding meaning in the target language (cf. Fig. 136).
RNNs for language generation. In his blog post from 2015 Andrej Karpathy demonstrated the effectiveness
of simple character-level language models to generate natural language character-by-character. To be able
to generate the word “hello”, an alphabet containing four characters: “h”, “e”, “l” and “o” is needed. Each of
the character is represented by a 1 hot vector, e.g., the letter “h” is represented by the vector (1, 0, 0, 0))T . If
fed with the “h”-vector, the model predicts a distribution over the next character via a Softmax function.
The character drawn from the distribution is then in turn fed as an input to the RNN at the next time-step.
However, this little toy example would have to be extended to work on more than just the simple word “hello”.
Thus, a three layer RNN with 512 hidden nodes was trained on William Shakespeare’s works, which consisted
of 4.4 million characters in total. Although the meaning and grammar of the produced sentences are debatable,

105
this simple model makes surprisingly few orthographic errors, as can be seen from Fig. 137a). It can also
understand that there are some underlying structures in natural language such as spacing, punctuation, main
and sub-clauses, and dialogue structure without being told that such things exist.
A closer look at the training process sheds light on what the model learns and at which stage it learns what.
It starts by detecting that space separates words. At the next stage, the model starts to get the hang of some
principles of sentence structuring. It also caught on to some words like “I”, “here”,“in”, “at”, “on”. Then, the
model constantly improves and fine-tunes until it produces actual words and non-sense sentences.

a) b)
PANDARUS
Alas, I think he shall be come approached and the da
When little srain would be attain'd into being never fed
And who is but a chain and subjects of his death
I should not sleep

Second Senator
They are away this miseries, produced upon my soul
Breaking and strongly should be buried, when I peris
The earth and thoughts of many states

DUKE VINCENTIO
Well, your wit is in the care of side and that

Second Lord
They would be ruled after this chamber, an
my fair nues begun out of the fact, to be conveyed
Whose noble souls I'll have the heart of the wars.

c) d)
Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy
Marylen Hammine Janye Marlise Jacacrie Hendred Romand
Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn
Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon
Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie
Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge
Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine
Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica
Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella
Roselina Alessia Chasty Deland Berther Geamar Jackein
Mellisand Sagdy Nenc Lessie Rasemy Guen Gavi Milea Anneda
Margoris Janin Rodelin Zeanna Elyne Janah Ferzina Susta Pey
Castina

Figure 137: Examples of text generation with character-level language model. a) Output from model trained
on Shakespeare text; b) Output from model trained on Latex source code file for mathematics book; c) Output
from model trained on programming source code; d) Output from model trained on baby names

Moving to a more practical application of such a model, it is also possible to train a model to create output in
latex format, source code, or baby names (Fig. 137b) - d)).
When looking at what the single neurons in such a model have learned, one can find neurons that detect a
line’s
:
:
:
.
:
end or the
.
beginning
d
.
,
,
,
y
h
,
and end of a quote. In the source code example, there are neurons trained to the
structure if-statements. However, there are way more cells whose function is not so easily explained. Overall
the behavior of only ∼ 5% of the hidden neurons is logical and can be interpreted by a human.

8.3 Gated Recurrent Networks

RNNs can be quite difficult to train. To improve trainability, almost all modern RNNs use gates – hence the
name gated recurrent networks. Most of the applications mentioned in the previous section use such gated
RNNs.
But why is it so difficult to train a vanilla RNN? Assuming a 1D hidden state ht ∈ R with ht = tanh(ah ht−1 +
ax xt + b) and its gradient ∂h∂ht−1
t
= tanh0 ah with tanh0 ah = ∂tanh(x)
x a longer chain of time-steps would result in

∂ht ∂ht ∂ht−1 ∂ht−k+1 t
= + ... = Π tanh0i akh (29)
∂ht−k ∂ht−1 ∂ht−2 ∂ht−k i=t−k+1

during backpropagation where k refers to the number of earlier time-steps. Thus, the gradient vanishes as soon
as tanh(·) saturates. Hence, the initialization of the network needs to be chosen carefully to avoid saturation.
Unfortunately, the gradient might still not behave as it should. Assuming the weights have been initialised
properly, it is possible that the activation functions do not saturate at all. The hidden state ht = tanh(ah ht−1 +
ax xt + b) would be ≈ ah ht−1 + ax xt + b ∈ [−1, 1] . If the gradient for the hidden state ht is calculated, i.e. ∂h∂ht−1
t
,
it would be approximately ah . A backpropagation over k previous hidden states would then return a gradient

106
∂ht ∂ht ∂ht−1 ∂ht−k+1
= ... = akh (30)
∂ht−k ∂ht−1 ∂ht−2 ∂ht−k
where the weights are multiplied to the power of k previous time-steps. This becomes a problem if ah > 1.
In this case the gradients will explode, i.e. become very large and lead to divergence. Consider, for example,
ah = 1.1 and k = 100. The resulting gradient would be ∂h∂ht−k
t
= akh = 13781.
Gradient clipping is often applied to prevent exploding gradients in RNNs. First, think about what happens in
case ah < 1. Given ah = 0.9 and k = 100, the gradient will become really really small, ∂h∂ht−k
t
= akh = 0.0000266.
Such vanishing gradients will prevent the model from learning at earlier time-steps. This issue can not be solved
by simply clipping the gradient. Instead, the architecture of the model needs to be changed. Unfortunately,
introducing residual connections will not work as the parameters are shared across time. On top of that, the
input and desired output at each time-step are different.
The same difficulties will arise when extending the RNN to vector- or tensor-valued

ht = tanh(Ah ht−1 + Ax xt + b) ≈ Ah ht−1 + Ax xt + b

∂ht ∂ht (31)
≈ Ah ⇒ ≈ Akh
∂ht−1 ∂ht−k

Let Ah = QΛQ−1 be the eigendecomposition of the square matrix Ah . Thus, the gradient is Akh =
(QΛQ−1 )k = QΛk Q−1 with a diagonal eigenvalue matrix Λ. Similar to the 1D hidden states before, the
model will encounter exploding gradients for components with eigenvalues > 1 and vanishing gradient for
components with eigenvalue < 1. Again the weight matrix Ah is shared across time.
Now, gradient clipping can be used to dampen the effects of exploding gradients. Gradient clipping refers to a
simple heuristic which clips the gradient to a fixed value τ , before applying the gradient update during SGD.
If the norm of the gradient vector, i.e. ||A.grad||2 , is smaller then the threshold τ , nothing happens. Should
the norm exceed τ , the gradient vector will be divided by its norm and the result will be multiplied with τ . In
other word, first the norm is ”reset” to 1 before it is set to τ , i.e. the maximum gradient value. Thus,

(
A.grad if ||A.grad||2 ≤ τ
A.grad = A.grad (32)
τ ||A.grad||2
otherwise

The maximal gradient magnitude τ is a hyperparameter of the model and often ranges between [1, 10].
Avoiding vanishing gradients is more complicated than that. Here, gates come into play. There are several
types of gates. The first, most complex, and most influential one, the Long- Short-Term Memory (LSTM),
was introduced back in 1997 by Hochreiter. There are two more recent and simpler types of gated RRNs, namely
the Gated Recurrent Unit (GRU, by Cho 2014) and the Update Gated Recurrent Neural Network
(UGRNN, by Collins 2017).
All three gate architectures use their gates to filter information. However, the number of gates and the way
they work is different (σ in the cell architectures in Fig. 138).

a) b) c)
LSTM Cell

UGRNN Cell
GRU Cell

Figure 138: Three types of gated RNNs: a) LSTM, b) GRU, c) UGRNN.

8.3.1 Update Gate Recurrent Neural Network (UGRNN)

Starting with simplest gated architecture, the UGRNN, Fig. 138c) shows that it only uses a single gate to
control the gradient flow. Like the vanilla RNN cell, a UGRNN cell receives input from the previous hidden
state Ht−1 and the current data input Xt . These two are again concatenated to a single matrix at c.
From there, the concatenated tensor is passed to two linear transformations. One is the update gate Ut which
determines if the hidden state H is updated or not. The other is the next target state St that is added to
Ht−1 with element wise weights Ut .

107
At the first gate a Sigmoid function is computed. In addition to the concatenated input matrix, it uses two
weight matrices Auh and Axx – the former is applied to the previous hidden state Ht−1 , the later to data input
Xt – and a bias bu all of which are shared across time-steps. Thus,

Ut [b, cout ] = σ(Auh [cout , Cin ]Ht−1 [b, Cin ] + Aux [cout , Cin ]Xt [b, Cin ] + bu [cout ]) (33)

If the output of the Sigmoid function is 1, then the information from the previous time-step is passed on and
used to update the hidden state. If it is 0, the information accumulated in the previous time-step(s) is discarded.
The concatenated input matrix is also passed into the tanh activation function at the target gate. Like the
Sigmoid function, it also receives two weight matrices As h and As x for Ht−1 and Xt respectively and a bias bs .
Hence,

St [b, cout ] = tanh(Ash [cout , Cin ]Ht−1 [b, Cin ] + Asx [cout , Cin ]Xt [b, Cin ] + bs [cout ]) (34)

One last calculation is necessary to arrive at the output of the current state Ht . Using the output of the update
gate Ut and the target gate St , the following computation takes place

Ht [b, cout ] = Ut [b, Cout ]Ht−1 [b, Cout ] + (1 − Ut [b, Cout ])St [b, Cout ] (35)

The first term of the equation is an element-wise multiplication of the Sigmoid function’s output and the previous
hidden state. It controls how much information of the previous time step should be part of the current hidden
state output. In places where the Sigmoid is 0, the previous state’s information is erased and kept otherwise.
The second term decides how much information from the current state should be passed into the output. By
element-wise subtracting the Sigmoid function’s output from 1 before element-wise multiplying the result with
the target gate, the output information is kept where the memory is erased and deleted where the memory is
preserved. In other words, Ut is a linear weighting between the previous state and the new target state.
In the schematic cell, this equation is indicated by the four operations in circles that follow after calculating
the Sigmoid and Tanh function. The element-wise multiplication is indicated by . This symbol denotes the
Hardamad product. The Hardamad product means that the first element in matrix Ut is multiplied with
the first element in matrix Ht−1 , then the second element in matrix Ut with the second element in matrix Ht−1
and so on. We can drop the Einstein notation of the Equation above as it does not have to be stated explicitly.
However, one can also make it explicit by using standard vector notation and writing the function as

Ht [b, cout ] = Ut [b, Cout ] Ht−1 [b, Cout ] + (1 − Ut [b, Cout ]) St [b, Cout ] (36)

After understanding how a gate works, the question of what does one gain from them remains. Let’s compare
the backpropagation in a regular RNN with that of a gated RNN such as the UGRNN (cf. Fig. 139).

a) b)
UGRNN Cell
RNN Cell

Figure 139: Backpropagation in a regular RNN cell (left) and a gated UGRNN cell (right).

Expressed mathematically, the backpropagation of the two RNN types are:

ut = σ(Auh ht−1 + Aux xt + bu )

st = σ(Ash ht−1 + Asx xt + bs )
ht = tanh(Ah ht−1 + Ax xt + b) ht = ut ht−1 + (1 − ut ) st (37)
∂ht
h0t = tanh0 Ah ≈ Ah with h0t = h0t = u0t ht−1 + ut + (1 − u0t ) st + . . .
∂ht−1
c)

108
with the backpropagation of a regular RNN on the left and a UGRNN on the left. As seen previously, when
discussing vanishing and exploding gradients in RNNs, the gradients can be approximated by the weight matrix
Ah which gets multiplied with itself according to the number of time-steps. In a gated RNN cell, as the UGRNN
referred to here, this is no longer the case. Instead, it can maintain gradient flow despite small Ah by setting
its gate to u ≈ 1. Put differently; the cell can determine how much of the gradient is passed backward (similar
to the forward pass). It can learn to ignore some of the information of a state vector for a very long time and
thus pass it backward for a very long time, too. In the equation this can be seen from ut . No matter how small
any of the other terms in h0t get, we always multiply the backward propagated gradient with a value close to 1,
i.e., the already computed gradient will just be passed through.
Thus, an UGRNN is able to keep the state of a variable over a long time horizon (u ≈ 1). Going back
to the character-level language models, such a model can, for instance, keep track of being inside a quote/if-
statement/etc. or not.

8.3.2 Gated Recurrent Unit (GRU)

Turning to Gated Recurrent Unit (GRU) as the one in Fig. 138b), it becomes apparent that the GRU cell relies
on two gates. Although the right part of the cell that computes the update gate and the target state has the
same behavior as in the UGRNN cell, the GRU cell’s input must pass through a so-called reset get. The reset
gate decides which parts of the previous state are used to compute the current state.
Put into an equation, a GRU cell computes a new current state as follows:

rt = σ(Wrh ht−1 + Wrx xt + br )

ut = σ(Wuh ht−1 + Wux xt + bu )
(38)
st = tanh(Wsh (rt ht−1 ) + Wsx xt + bs )
ht = (1 − ut ) ht−1 + ut st

8.3.3 Long Short-Term Memory (LSTM)

The Long Short-Term Memory makes the current hidden state’s processing even more complicated by passing
along an additional cell state c and using three gates. The first gate, the forget gate, determines which
information is erased from the cell state. The second gate, the input gate, decides which values of a cell state
should be updated. The third gate, the output gate, is in charge of picking the cell state elements that are
revealed at a time t.
More formally, the operations in a LSTM cell can be formulated as

ft = σ(Wf h ht−1 + Wf x xt + bf )
it = σ(Wih ht−1 + Wix xt + bi )
ot = σ(Woh ht−1 + Wox xt + bo )
(39)
st = tanh(Wsh ht−1 + Wsx xt + bs )
ct = ft ct−1 + it st
ht = ot tanh(ct )

8.3.4 UGRNN vs. GRU vs. LSTM

A comparison of the three types:

UGRNN GRU LSTM

• 1 gate • 2 gates • 3 gates
• expose entire state • expose entire state • control exposure

• single update gate • single update gate • input/forget gates

• few parameters • medium parameter • many parameters

In their systematic study, Collins et al. 2017 [3] find that GRUs are the most learnable of the three types of
gated RNNs when it comes to shallow architectures. It is followed by the UGRNN. This finding supports the
observations that the simpler and more recently developed gated RNNs perform as well as (or even better) as
the more complex ones. Thus, GRU and UGRNN can be thought of as improved and simplified versions of the
original LSTM cell.

109
8.4 Auto-regressive Models
RNNs have long been believed to be the only model capable of processing sequential input. Recently, a class
of feed-forward neural networks has be found that are at least on par with RNNs. Such models are called
Auto-regressive Models. A k’th order auto-regressive model predicts the next variable xt in a time series
based on the k previous variables xt−1 , xt−2 , ·. In the example in Fig. 140, one can see that x3 , x4 and x5 are
connected to the previous two inputs, i.e. x5 on x3 and x4 , x4 on x2 and x1 , and x3 on x2 and x1 . Thus, k
would be 2.

Figure 140: Example of input dependencies of a a 2nd order autoregressive model

Like RNNs, parameters are shared across time, i.e. same function f (·) at each t. What makes them different
from RNNs is that they make a strong conditional independence assumption, e.g. x5 is independent of x2
given x3 and x4 .
The simple concept of autoregressive model can be extended to situations with varying or different input
and output lengths. For instance, as illustrated in this single-layer model in Fig. 141. Here an output ŷt is
dependent on the k-last inputs, i.e. {xi |t − k ≤ i ≤ t}, and independent of everything that came before, i.e.
{xi |i < t − k}. Thus, ŷt = f (xt , xt−1 , . . . , xt−k ).

Figure 141: Example of a single-layer auto-regressive model

From this most basic form of an auto-regressive model, it becomes evident that the past information is not
summarised in a hidden state h. Thus, such a model does not have infinite memory. However, it is easier to
train because there is no need for backpropagation through time.
Like all other neural networks, auto-regressive models can be extended to deep models with multiple layers
by adding some hidden layers between the input and output layers (cf. Fig. 142). Thus,

ht = f1 (xt , xt−1 , . . . , xt−k )

(40)
ŷt = f2 (ht , ht−1 , . . . , ht−k )
where each hidden layer ht takes into account k previous inputs xt , xt−1 , . . . , xt−k and each output at the
current time ŷt k previous ht , ht−1 , . . . , ht−k hidden layer outputs.
Such a model would effectively perform multiple causal temporal convolutions. The weights are shared
across time, which means we are swiping a kernel filter over the sequential inputs to arrive at the outputs.
These convolutions are causal because they depend only on past and present inputs but not on future inputs.3
Since autoregressive models are similar to any other deep neural network (e.g. ConvNets), residual connections
and dilated convolutions can be integrated into its architecture.
The first successful model of that type is WaveNet by Oord et al. 2016 [14] (cf. Fig. 143 a)). WaveNet is a
generative model for raw audio wave-forms which generates realistic speech. It consists of four hidden layers.
3 Theoretically, it is possible in both autoregressive models and RNNS to also include future information into the computation

process. Those models would then be bi-directional.

110
Figure 142: Example of a multi-layer auto-regressive model

At each hidden layer, we consider a larger amount of past input information – similar to a growing receptive
field in ConvNets – due to the usage of dilated convolutions. The model architecture also contains residual
and skip connections.
The success of WaveNet was surprising as the generation of audio signals is considered hard as it needs 16k
samples per second and structure at multiple time scales. RNNs usually avoid raw audio wave-forms for
this reason. However, WaveNet was able to do this and outperform traditional LSTMs on speech synthesis, as
demonstrated for North American English and Mandarin Chinese by the authors.

a) b)
Output
Dilation = 8

Hidden Layer
Dilation = 4

Hidden Layer
Dilation = 2

Hidden Layer
Dilation = 1

Input

Figure 143: Example of auto-regressive models. a) WaveNet; b) TCNs.

This idea has been developed into a simpler type of autoregressive models called Temporal Convolution
Networks (TCNs) by Bai et al. 2018 [1] (cf. Fig. 143 b)). These networks use zero-padding to handle
sequences of arbitrary length. Aside from the residual layers, the rest of the deep multi-layer architecture
c)
is kept simple. Like the WaveNet, it also uses dilated convolutions to increase the receptive field size, also called
”context”. TCNs have been shown to perform better than LSTMs, GRUs, and RNNs on several different
sequence tasks.
How come that these simpler models that do not have a memory like an RNN, are working so well? This
blogpost concludes that – at least in theory – a stable RNN can be approximated by a feed-forward network for
both inference as well as training. Based on current literature, it states that the strength of RNNs, namely their
”infinite memory”, has no chance to unfold its power in practice. Furthermore, the unlimited context offered
by RNNs is not strictly necessary in some tasks (e.g., language modeling). Models that truncate at a sequence
length of 13 or 25 are often competitive with infinite memory models.

9 Natural Language Processing

9.1 Language Models
A language model models the probability distribution over a sequence of discrete tokens x = (x1 , . . . , xT ).
Each of these tokens can take a value from a vocabulary V (xt ∈ V) and it can be for example a word, a
character or a byte, depending on the model. The last token of such a sequence is a special <EOS> token to

111
indicate the end of a sentence. Therefor: xT = <EOS>. This means that whenever we predict the next word or
token, we can either predict a word form the vocabulary or we can predict the end of sentence token.
T
Y
p(x) = p(x1 , . . . , xT ) = p(xt |x1 , . . . , xt−1 )
t=1
= p(x1 ) p(x2 |x1 ) p(x3 |x1 , x2 ) . . .

We can model a probability distribution over a sequence x of discrete tokens (x1 , . . . , xT ). The joint distribution
over this sequence decomposes into a product of t conditional distributions p of xt given all of the x with an
index smaller than xt which is a consequence of the product rule or chain rule of probability. This is not an
approximation. We simply can rewrite or decompose any joint distribution in such a form. By decomposing it
that way each token in a sequence only depends on the previous tokens in that sequence by iteratively applying
the product rule.
We have already seen in the lecture about recurrent networks and feed forward models that there are models
that operate on the character level but there’s also language models that operate on the word level as we will
see in this lecture. The difficulty of operating on the word level is that the vocabularies are much larger. But
it is also easier to operate on the word level since it is easier to model long-term dependencies while at the
character level long-term dependencies are hard. At character level vocabularies are typically small, like 10, 20
or 30 characters, while at the word level typical vocabularies are of size 10.000, 20.000 or 30.000, depending on
the language. So therefor we have to predict distributions over a very large state space.

Word Language Model Example:

p(The dog ran away <EOS>) = p(The) p(dog|The) p(ran|The dog)

p(away|The dog ran) p(<EOS>|The dog ran away)

As we can see in the example above, the sentence ’The dog ran away <EOS>’ decomposes based on the product
rule into the conditional distributions p of ’The’, times p of ’dog’ given ’The’, times p of ’ran’ given ’The’ and
’dog’ and so forth. We can see that language models in general but also in particular the given word language
model are autoregressive models that predict the next token given all the previous tokens in this sentence or
sequence. If a model is good then it has a high probability of predicting likely next words.

9.1.1 Applications
Language Recognition:
One of the most straightforward applications of language models is language recognition. Assume we have
trained two language models p and p0 which assign probabilities to sentences.

p(x) = p(x1 , . . . , xT ) p0 (x) = p0 (x1 , . . . , xT )

Since they are different models they might assign a different probability to the same sentence. Lets assume that
p has been trained on a large text corpora of English sentences and p0 on French sentences.
We can then determine which sentence a language is from by simply classifying according to:
(
English if p(x) > p0 (x)
Language(x) =
French otherwise

Generative Model:
We can also use a language model to sample new sequences. Assume a language model over sentences x:

p(x) = p(x1 , . . . , xT )

And p(x) has already been trained. By using the decomposition into conditional distributions

p(x1 , . . . , xT ) = p(x1 ) p(x2 |x1 ) p(x3 |x1 , x2 ) . . .

we can efficiently sample new sentences from the model distribution. Sow e start by sampling x1 , the first
word, and then we sample the second word conditioned on the first word, then we sample the third world
conditioned on the first two words and so forth. By doing this we can sample from this model in linear time
which is a sampling of words based on all the previous words.

Bayesian Inference: (Machine Translation)

We can also use these language models to do bayesian inference and in this case we have an example for machine

112
translation. Assume we have a language model trained p of x and lets consider this as a prior over possible
sentences x in the form of a language model:

p(x) = p(x1 , . . . , xT )

Now assume a likelihood or proposal mechanism that tells us how likely sentence x translates to sentence y:

p(y|x) = p(y1 , . . . , yT 0 |x1 , . . . , xT )

Which tells us according to this model how likely this translation is. This is how machine translation have
worked in the past. These were rule based and very complicated systems that propose sentences and assign
likelihoods to the translated sentences given the input sentence. It therefor assigns a probability to y given x.
The original sentence x and the translated sentence y might not have the same length. This is indicated by
the last indice of x being T and for y being T 0 . If we have p(x) and p(y|x) we can use Bayes rule to infer the
posterior over translated sentences:
p(y|x) p(x)
p(x|y) =
p(y)
We multiply these two, p(x) and p(y|x), together to get the posterior probability of x in the target language
given the sentence in the source language.
Modern machine translations don’t work like this. In this example we are using a generative model to infer
a discriminative decision rule. In modern machine translation systems the probability of x given y is directly
modeled by conditioning on the source sentence and modeling a distribution for the output sentence. We will
be looking at some examples of these type of models in the last units.

9.1.2 Training
So far we have considered just generic distributions but now to make it more precise we use pmodel (x|w) to
distinguish the model from the data distribution and to indicate that the model has some parameter w that we
want to train. So now let X = {xi }N i i
i=1 denote a training set with sentences xi = {x1 , . . . , xTi }.
We train the unconditional language model pmodel (xi |w) via maximum likelihood:
N
Y
ŵM L = argmax pmodel (xi |w)
w
i=1
N
X
= argmax log pmodel (xi |w)
w
i=1
= argmin −Epdata [log pmodel (x|w)]
w

This is a standard formulation that we have seen before where the maximum likelihood prediction is the argmax
over the parameters of the product of the model distribution for the entire dataset, where N is the size of the
dataset. By applying the logarithm, which is a monotonic function, we can turn the product into a sum. This
on the other hand corresponds to a minimization problem of the negative expectation of pdata of the logarithm
of pmodel x given w. So we minimize the cross entropy between the data and the model distribution. If we
manage to minimize this then the model is as similar as possible to the data distribution.

9.1.3 Evaluation
We want to be able to measure the performance of such a trained language model. Character language
models typically measure performance in bits per character.

In order to do that we first need to introduce some basic quantities from information theory:
Shannon Information:
Given a character sequence x of length T with probability p(x), the so called surprise, or shannon information,
normalized by the sequence length T is the normalizedT negative log-likelihood of x:
1 1X
I(x) = − log2 p(x) = − log2 p(xt |x1 , . . . , xt−1 ) [bits]
T T t=1

The − T1 is a normalizing constant to normalize with respect to the sequence length and thereby getting inde-
pendent of the sequence length. Since p(x) can be decomposed into this sequence of conditionals we then have

113
the sum of t from 1 to t over this decomposition. The unit of this is bits since we are using the basis of 2 for
the logarithm.
Intuitively if we consider a distribution over x. If we observe an x and the probability for that x is low under
that distribution then the negative logarithm of that is large, therefor the surprise is large. If we are observing
something that is unlikely then the information or the surprise is large. Conversely if we observe something
that has a high probability of occurring then the surprise is low. We are not surprised to see that because it is
likely to occur under our model.

Cross Entropy:
The expected surprise of the model under the data distribution (for sequences of length T ) is thus given by
the (normalized) cross entropy:

1
H(pdata , pmodel ) = Epdata − log2 pmodel (x)
T

Here we took the expression from the shannon information and took the expectation over the data distribution,
we average over the entire dataset. In other words this is the expected surprise given our trained model pmodel .
A model is better if the expected surprise is low. So if the model models the data distribution well then the
expected surprise is low and better models have a smaller number of bits.

Let us now consider sequences of arbitrary length by taking the limit T → ∞.

By the Shannon-McMillan-Breiman theorem, the cross-entropy simplifies further

1
H(pdata , pmodel ) = Epdata − log2 pmodel (x)
T
1
≈ − log2 pmodel (x) [bits]
T

as each sequence occurs in proportion to its probability anyways if we consider long enough sequences (think
of an infinite text x generated from the data distribution). Because this model distribution factorizes into
conditional probabilities where a word depends on all the previous words. So one can think of this with the
logarithm as a large sum. So we have to sum over the entire dataset of for example sentences and then we have
to sum over each sentence. Now if we let the sequence or sentence length go to infinity then the expectation
does not matter anymore because each sequence occurs in proportion to its probability anyways. So each of
these terms that are added together anyways occur in proportion to their probability. So for long sequence
lengths this is a reasonable approximation to make. Therefor in practice this is typically computed when cross
entropy in bits is reported, which is the measure for character level language models. Additionally in practice,
H(pdata , pmodel ) is evaluated on a test or validation sequence x. It is not evaluated on the training sequence
because that might just measure overfitting, but we are measuring generalization performance by evaluating
on a test or validation sequence x.

Example 1:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ). So the probability distribution over sequences of length t factorizes completely into
the probability of each individual token in a sequence. In other words in this simple example the probability of
the next word is completely independent of the previous words. Let’s further assume that the probability for
each of the symbols in the vocabulary is fifty percent: p(xt ) = 21 for both data and model distribution.
Then
10
1 1
H(pdata , pmodel ) = − log2 = log2 2 = 1 bit
10 2
The amount of information needed to predict the next character is 1 bit with this simple (unigram) model as
the next character is either A or B with equal probability. In other words, we can’t find a better encoding of
this language than using 1 bit per character.
Remark: A uniform distribution always maximizes the entropy (⇒ upper bound for |V|).

Example 2:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 1 and p(xt = B) = 0
for both data and model distribution.
Then

114
1
H(pdata , pmodel ) = −log2 110 = log2 1 = 0 bits
10
In this case, the amount of information needed to predict the next character is 0 bits as the next character
is always A. In other words, we don’t need any capacity to transmit this language through some channel, it
contains no information.
Remark: 0 bits is the minimal value for the entropy or cross-entropy (⇒ lower bound).

Example 3:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1 and p(xt = B) = 0.9
for both data and model distribution.
Then
1 9
1 1 9
H(pdata , pmodel ) = − log2 = 0.47 bits
10 10 10
We need 0.47 bits now as we sometimes observe A, but most often B. Thus, the information conveyed in this
language is larger than 0 bits (lower bound) and smaller than 1 bit (upper bound).

Example 4:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1 and p(xt = B) = 0.9
for the model distribution and p(xt = A) = 1 for the data distribution.
Then
10
1 1
H(pdata , pmodel ) = − log2 = log2 10 = 3.32 bits
10 10
Because in the model we have a probability of 0.1 of observing A but in the data we only observe A.
We need more than 1 bit now as the model fits the data badly. We need 0.47 bits to encode any possible
outcome of pdata using the code optimized for pdata and 2.85 bits to encode any possible outcome of pdata using
the code optimized for pmodel :

H(pdata , pmodel ) = H(pdata ) + DKL (pdata ||pmodel )

| {z } | {z }
≥0 ≥0

In other words, and this is universally correct, the cross entropy of the model distribution with respect to the
data distribution is th entropy of the data distribution plus the KL divergence between the data and the model
distribution. Because this KL divergence must always be bigger or equal to zero we know that the entropy of
pdata is a lower bound on the cross entropy of pmodel with respect to pdata . So the cross entropy that we obtain,
because our model is always imperfect, must always be bigger or equal to the entropy of the data distribution.

Evaluating Word Language Models

It would be natural to measure word language models in bits per word.
However, word language models are traditionally measured in perplexity. Which is a measure that is very
related to bits per weight to the cross entropy:

Perplexity(pdata , pmodel ) = 2H(pdata ,pmodel )

1
≈ 2− T log2 pmodel (x)

1
= pmodel (x)− T
T
!− T1
Y
= pmodel (xt |x1 , . . . , xt−1 )
t=1

Thus, perplexity can be interpreted as the inverse probability of the test set, normalized by the sequence
length T which acts as a geometric mean.
Again, Perplexity(pdata , pmodel ) is evaluated on a test or validation sequence x.

Example 1:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Again we assume unigrams: p(x) = t=1 p(xt ) with p(xt ) = 31 for both data and model distribution.
Then

115
1
10 !− 10
1
Perplexity(pdata , pmodel ) = =3
3

We see that the perplexity models the number of possible next tokens to choose from (i.e., here the model is
maximally confused which of the 3 tokens A, B or C to pick). Because they all occur independently with equal
probability of one third.
Thus, perplexity is often also called the average weighted branching factor. Because we have three
branching opportunities here that are all equally likely, so the model has a average branching factor of three.
Remark: A uniform distribution maximizes the perplexity (⇒ upper bound for |V|).

Example 2:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 1 and p(xt ∈ {B, C}) = 0,
for both data and model distribution.
Then
− 1
Perplexity(pdata , pmodel ) = 110 10 = 1
We see that the perplexity reduces in this case as the next choice is certain. The average branching factor is
one, we know we have to choose A. The model is not surprised to see the test set as it is able to predict the
test set exactly (all A’s).
Remark: 1 is the minimal value for the perplexity measure (⇒ lower bound). However, this is only achievable
for languages that contain only a single token.

Example 3:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1, p(xt = B) = 0.9, and p(xt = C) = 0 for both data and model
distribution.
Then
1
1 9 !− 10
1 9
Perplexity(pdata , pmodel ) = = 1.38
10 10
In this case, the perplexity is slightly larger than 1 as the model is quite certain to predict B as the next
character, but sometimes it should predict A.

Example 4:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1, p(xt = B) = 0.9, and p(xt = C) = 0 for the model distribution
and p(xt = A) = 1 for the data distribution.
Then
1
10 !− 10
1
Perplexity(pdata , pmodel ) = = 10
10
In this case the perplexity is larger than 3 as the model fits the data badly.

9.1.4 Summary
• For character language models, current performance is roughly 1 bit per character
• For word language models, perplexities of about 60 were typical until 2017
• According to Quora, there are 4.79 letters per word (excluding spaces)
• Assuming 1 bit per character, we have a perplexity of 25.79 = 55.3
• State-of-the-art models (GPT-2, Megatron-LM) yield perplexities of 10 − 20
• Be careful: Metrics not comparable across vocabularies or datasets

Additional Resources:
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
https://towardsdatascience.com/the-relationship-between-perplexity-and-entropy-in-nlp-f81888775ccc

116
9.2 Traditional Language Models
In this section we’re going to be talking about n-grams which have been the default language model for several
decades. A quick recap about language models:
T
Y
p(x) = p(x1 , . . . , xT ) = p(xt |x1 , . . . , xt−1 )
t=1
= p(x1 ) p(x2 |x1 ) p(x3 |x1 , x2 ) . . .

• Probability distribution over a sequence of discrete tokens x = (x1 , . . . , xT )

where each token can take a value from a vocabulary V (xt ∈ V)
• Decomposes into a sequence of conditional distributions
• The history (conditioning variables/tokens) is called context in NLP

• We can represent each conditional distribution with a probability table and

9.2.1 N-gram Models

• An n-gram is a sequence of n tokens (e.g. 2-grams or bigrams: “The dog”, “dog ran”, “ran away”)
• n-gram models approximate the history context by the last n − 1 tokens, they shorten the context.
So we are cutting off the first words and only keeping the last n minus one words in the history context
to predict the current word
• In other words, they make a Markov assumption (the model is memoryless), which makes this much
more tractable, in particular if we consider small ns
• This idea is similar to the autoregressive models that we have introduced, except that each conditional
p(xt |xt−n+1 , . . . , xt−1 ) is represented by a probability table and not a neural network
• Early language models considered bigrams (n = 2) and trigrams (n = 3)

Word Language Model Examples:

• bigram: (n = 2, or in other words: history length = 1)

p(x1 , x2 , x3 , x4 ) = p(x1 ) p(x2 |x1 ) p(x3 |x2 ) p(x4 |x3 )

p(The dog ran away) = p(The) p(dog|The) p(ran|dog) p(away|ran)

• trigram: (n = 3, history length = 2)

p(x1 , x2 , x3 , x4 ) = p(x1 , x2 ) p(x3 |x1 , x2 ) p(x4 |x2 , x3 )

p(The dog ran away) = p(The dot) p(ran|The dog) p(away|dog ran)

9.2.2 Training of n-gram Models

The conditional probability can be written as:

p(xt−n+1 , . . . , xt ) p(xt−n+1 , . . . , xt )
p(xt |xt−n+1 , . . . , xt−1 ) = =P
p(xt−n+1 , . . . , xt−1 ) xt p(xt−n+1 , . . . , xt )

For a bigram model this would yield:

p(xt−1 , xt ) p(xt−1 , xt )
p(xt |xt−1 ) = =P
p(xt−1 ) xt p(xt−1 , xt )

117
We see that we simply need to count the number of n-grams and (n-1)-grams in the training set to populate the
probability table of the n-gram model. We simply need to count um how often is xt−1 followed by xt divided
by the number of occurrences of xt−1 .
Smoothing: For large n, the n-gram probabilities are often zero as they haven’t been observed in the training
set but they might occur in the test set which means that we are overfitting. A simple heuristic is to add one
to all n-gram counts.

9.2.3 Sampling from n-gram Models

• As n-gram models are autoregressive models, sampling is easy
• We just need to iteratively draw tokens from p(xt |xt−n+1 , . . . , xt−1 )
until we reach the end of sentence symbol <EOS>
Example of a random sentence drawn from a Jane Austen trigram model:
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.”
As as one can read, this might not make a lot of sense but it’s still natural language as it could potentially occur.

Samples from a Shakespeare language model:

–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less f rst you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.

Figure 144: Samples from a Shakespeare language model. https://web.stanford.edu/∼jurafsky/slp3/

One can see that as we increase the context size from a unigram, to a bigram, to a trigram, to a 4-gram, that
the text becomes much more realistic.

9.2.4 Summary
• n-gram models are simple models that make a Markov assumption
to model a distribution over sequences via probability tables
• However, they have limited history context, they cannot model long-term dependencies, and param-
eters grow exponentially
• Smoothing heuristics are required to deal with the resulting sparsity
• They can’t directly model conditional distributions (e.g., for translating sentences)
• In contrast to neural language models, they can be considered as local non-parametric predictors (thus
suffering the curse of dimensionality)
• Tokens are encoded as discrete items
– Large vocabularies are typically reduced to a shortlist (removing infrequent words)
√
– Any two words have the same distance ( 2 in one-hot vector space)
– Thus, n-gram models can’t share information between related (=close) words

9.3 Neural Language Models

The motivation for this unit is the fact that we cannot represent semantically related words with n-gram models,
which would be beneficial for a machine learning model. Neural language models now can utilize neural networks
for predicting the next word.

118
9.3.1 Local Word Representations
• n-gram models, that effectively are just conditional probability tables, suffer from the curse of dimen-
sionality
• Modeling the joint distribution of n = 10 consecutive words with a vocabulary
of size |V| = 10000 results in intractable |V|n = 1040 parameters, even if we limit the history context to
10.
• However, and this is where the idea for this type of neural language models came from, when modeling
continuous variables, we obtain generalization more easily than in the discrete case, because the function
to be learned is expected to behave locally smoothly
• This has been observed in other contexts but they haven’t been used for NLP and its usage here has made
a tremendous impact on the performance of NLP algorithms
• Discrete word representations assume the same distance between words:

√
(1, 0, 0, . . . , 0)> − (0, 1, 0, . . . , 0)> = 2
| {z } | {z }
w1 w2 2

• Remark: In this unit we use w to denote a one-hot encoding of a word

9.3.2 Word Representations

However, some words are more similar than others. Consider the sentences:
• “The cat is walking in the bedroom”
• “The dog was running in a room”
The sentences at the one-hot-vector level are very different but semantically they are quite related and having
seen the first sentence (in the training set) should help to generalize making the second sentence very likely as
• “cat” and “dog”,
• “walking” and “running”,
• “the” and “a”, . . .
have similar semantic roles. But this cannot be captured in purely discrete models.

9.3.3 Distributed Word Representations

Therefor the idea is to use a different representation, a so-called distributed representation:
• Distributed representations map one-hot vectors {0, 1}|V| of high dimensions
(e.g., |V| = 10000) to word embeddings RM of lower dimensions (e.g., M = 30)
• Thus, they distribute the representation along all dimensions (of the embedding vector), so we now have a
real valued vector where there is a different real number for each of these M dimensions, so the semantics
of that word is now distributed in that representation and we are able to model similarity between
words
• They distribute probability mass where it matters rather than uniformly in all directions around each
training point
• They therefore allow generalization across sentences
• Word embeddings are often realized through a simple linear mapping (it’s basically just a matrix
multiplication where the one hot encoder encoding goes in and out comes this lower dimensional vector
that is a matrix) which can then be further processed by a feedforward or recurrent neural network
Remark: Unlike in this illustration, we do not attach actual labels to words. The depiction in Fig. 145 is only
about semantic relatedness. For example student, colleges and schoolhouse are all related to academic, and
academic is related to papers, so student, colleges and schoolhouse are more likely to be followed by papers,
than by bill.

Example: Assume a training set with the following 3 sentences:

119
Figure 145: Distributed Word representations. Depiction of distributed word representations.

Local Representation Distributed Representation

walking
1 walking 1

cat cat
1 dog
dog

0 0
1 1

Figure 146: Local vs Distributed Word representations.

• “The cat is walking”

• “The dog is walking”
• “The cat is sitting”

A distributed representation learns to embed “cat” and “dog” nearby. So the distance between those two words
should be low.
Given that p(The cat is sitting) is high, and “cat” and “dog” are related

• “The dog is sitting”

is also likely despite not being part of the training set.

Through this similarity we can model sentences to be more likely that are actually not part of the training set.
An n-gram model can’t generalize in this setting.
As we can see in Fig. 146, we have a local representation on the left of walking, cat and √ dog. This is a one
hot encoding so they are on the axis and the distance between all of these is the same: 2. The goal is to
transform this into a lower dimensional representation, a distributed representation. In this particular case we
go from three-dimensional to two-dimensional, where now cat and dog are closer to each other than they are
to walking, as can be seen in Fig. 146 on the right.

Once we have learned these word embeddings, we can also visualize these word embeddings. An example of
a two-dimensional t-SNE visualization of a word embedding can be seen on Fig. 147. The shown regions are
zoomed in because the vocabulary is much larger. When zooming into particular regions we see that in the
region on the left we see countries and in the region on the right we observe dates or years. So the model has
learned to relate the semantic entities to each other by learning them from text.
But we always have to be careful with 2D visualizations - Geoffrey Hinton once said “In a 30-dimensional
grocery store, anchovies can be next to fish and next to pizza toppings.” This illustrates how difficult it is to
imagine and how easy it is to be misled by high dimensional spaces. In high dimensions everything can be next
to everything. Even if we visualize word embeddings we can really just do it in 2D, everything else is beyond
the scope of our imagination.

120
Figure 147: Word embeddings. Two-dimensional t-SNE visualization of a word embedding model.

Figure 148: Neural Probabilistic Language Model

9.3.4 Neural Probabilistic Language Model

The neural probabilistic language model [2] was the first language model that used a distributed representation
which have been used before in connectionism (Hinton, Elman) and symbolic reasoning (Paccanaro), but not
for language, which was a breakthrough for NLP.

Key Ideas:
• Associate with each word in the vocabulary a distributed
word feature vector (a real-valued vector in RM )
• Express the joint probability function of word sequences in terms
of the feature vectors of these words in the sequence, and
• Learn simultaneously the word feature vectors and the parameters
of that probability function
So we have an end-to-end training process. This was the first end-to-end model that simultaneously learned
the word feature vectors and the parameters of the probability function.

121
Feedforward Model:
Formal description of the Feedforward Model:
• Input: Sequence of words
eyidx(wt )
• Output: Prob. of next word P (wt |wt−1 , . . . , wt−n+1 ) = P yi
ie
• 3 Layers: y = b + Wx + U tanh(d + Hx)
x = (C wt−n+1 , . . . , C wt−1 )
1. Fully connected ⇒ embedding
2. Fully connected + tanh
3. Fully connected + softmax where C ∈ RM ×|V| . Thus the model
• Input to 2nd layer: vector of concate- • .. scales linearly with |V|
nated word embeddings
• .. scales linearly with n
• Optional: direct connections
The model is shown in Fig. 148. The model has some inputs at the bottom of the figure which are the indices of
the words, where the words are one-hot-vectors. This builds a sequence of words which then is transformed by
the following matrix C from a one-hot-vector to a word embedding. So for each word we get a high-dimensional
word embedding and then all of these word embeddings get concatenated into a vector. Which builds the first
layer, which is then fully connected and followed by a tanh as a hidden second layer. Next this is fully connected
and followed by a softmax as the third layer. This softmax is then outputting the probability of the next word
wt in the sequence given all the previous words wt−n+1 until wt−1 .
There also exists a small variation of this model which is visualized with the dotted lines in Fig. 148. These
lines show direct connections from the first layer to the last layer
In the formal description of this feedforward model the probability for the next word wt is given by a softmax.
yidx(wt ) in the softmax is used to get the index of the one-hot-vector which is then normalized by all the others.
The y’s then are linear predictions of the direct connections followed by a tanh-nonlinearity from x. x is a
concatenation of all the one-hot-vectors for each previous word multiplied with the embedding matrix C. Now
C is in RM ×|V| . We can see that the model scales linearly with the vocabulary size because if we increase
the vocabulary then C also has to increase. But different to before now the model also scales linearly with n.
Before we had |V|n and but now the model scales linearly with n because we are increasing the length of the
concatenated vector that we input to the second layer and so the number of parameters in that second layer
increases linearly, not exponentially.
This is actually a common trick that’s used when working with neural networks that instead of representing some
conditioning explicitly, the conditioning is an input to a neural network and therefore, in terms of complexity, the
model becomes more tractable. This is also something that’s for instance used in the context of reinforcement
learning. This way there is a lot more sharing of information and parameters that takes place.

• Neural probabilistic language models lead to significantly better results in comparison with the best of
the n-grams, with a test perplexity difference of about 24% on Brown and about 8% on AP News (which
are both datasets), when taking the MLP versus the n-gram that worked best on the validation set.
• The results also suggests that the neural network was able to take advantage of more context (on Brown,
going from 2 words of context to 4 words brought improvements to the neural network, not to the n-grams).
This is because of the similarities that are learned and the generalization that is implied by it.

• Also showed that the hidden units are useful (MLP3 vs MLP1 and MLP4 vs MLP2), and that mixing the
output probabilities of the neural network with the interpolated trigram always helps to reduce perplexity.

9.3.5 Word2Vec / Skip-Grams

There are a number of papers called Word to Vec - efficient estimation of word representation vector space.
• Fitting language models is hard (large |V| ⇒ large softmax, memory intensive and computation inten-
sive)

• Skip-grams predict a word in their surrounding context, which is kind of an unsupervised training task
• Instead of predicting a distribution over words,
switch to a binary prediction problem
• The model is given pairs of words and needs to distinguish if the words occur next to each other in the
training corpus or they are sampled randomly

122
• Logistic regression on inner product of word embeddings
• Can be trained very efficiently with lots of data
Using this we can some word vector arithmetics:
Word Vector Arithmetics

Figure 149: Word Vector Arithmetics

In Fig. 149 we can see some expression on the left and the nearest most likely token on the right. For example,
we can take the word embedding vector of Paris and subtract the word embedding of France and add Italy and
we get Rome as an outcome. This also shows that the semantics that this word embeddings have learned is
quite meaningful.

9.4 Neural Machine Translation

Neural machine translation is one of the landmark applications of neural language models.

9.4.1 Sequence to Sequence Learning

Sutskever et al. [13] proposed a very simple end-to-end trainable model for machine translation that shocked the
community because it was the first paper to produce results that outperformed all of the previously developed
hand-engineered rule-based systems for machine translation. The key behind this was a very well engineered
but simple model in combination with a lot of compute and data.
Sequence to Sequence Learning Model
• Two 4-Layer LSTMs for encoding/decoding the source/target sentence

• Encoding operates in reverse order of the input sentence to introduce short-term dependencies.
Because we end with the first word and then the decoder starts to decode with the first word in the
target language. By reversing the input sentence you get shorter dependencies because the first word in
the input sentence is likely related to earlier words in the output sentence.

• Otherwise this is a simple LSTM that processes the reverse ordered words, so that the words are embedded
in a distributed word representation. Also there is a hidden state that is carried forward through this
recurrent neural network and then there’s an intermediate representation:
• Intermediate representation produced by the encoder is called thought vector. This is the representation
that stores the meaning of that sentence that shall be translated. So once we reach the end of sentence
symbol we have the thought vector as the hidden representation. Which is a global representation of
the input sentence and then that fourth vector is decoded. The reason why we have this interface of the

LSTM Encoder LSTM Decoder

Thought
Vector

Figure 150: Sequence to sequence modelling

123
forward vector is that the input sentence and the output sentence may have different length, also the
word order may be very different in different languages depending on the grammar. So we create a global
representation of the input sentence default vector that is passed to decoder.
• Encoding using 1000 dim. word embeddings, decoding via beam search
• First end-to-end system that outperforms rule-based models and also the first system that demonstrated
that this is really possible at large scale which led to deployment.

9.4.2 Decoding
Let w1 , . . . , wT denote the target sentence and let v denote the thought vector. Sampling a translation from
the LSTM decoder is simple, because we have the autoregressive nature that we can just sample a new word
from the distribution that is predicted by the LSTM, given the thought vector and the previous words. Where
the previous words are summarized in this hidden state in the case of an LSTM:
wt ∼ p(wt |v, w1 , . . . , wt−1 )
But this is not what we want to do in translation, in that case we like to compute the most probable
translation:
w1 , . . . , wT = argmax p(w1 , . . . , wT |v)
w1 ,...,wT
This is costly because there is a large number of possibilities. So searching that exhaustively is intractable. But
often a greedy algorithm often works well in practice:
wt = argmax p(wt |v, w1 , . . . , wt−1 )
wt

So we only take the argmax over wt instead of over the sequence w1 , . . . , wT . Also the probability is calculated
for wt given the thought vector v and w1 , . . . , wT .

9.4.3 Beam Search

Failure of Greedy Algorithm:
• p(Apples are good) > p(Those apples are good), because “apples are good” occurs more frequently in the
training corpora.
• However there are more sentences in the training corpora that start with “those” compared to “apples”:
p(Those) > p(Apples)
• So if we do greedy decoding then we would start with “those” even though the probability of “apples are
good” is bigger than the probability of “those apples are good”.
This is where beam search comes in. It is better than a purely greedy algorithm but it’s still only an
approximation to the true solution that searches over the entire search space.

Idea of Beam Search:

• At each time step, maintain a list of the K best words and hidden vectors
• This can be used to produce a list of K best decodings for the following word, which can then be compared
to select the most likely one

Figure 151: Illustration of beam search. https://d2l.ai/chapter recurrent-modern/beam-search.html

124
In Fig. 151 an illustration of beam search can be seen. We start with an empty sequence and then we have
A, B, C, D, E as candidates and A and C are most likely so with a beam of size two we we keep track of A
and C.Next for A B is most likely and for C E is most likely. Then we continue with AB and CE. We always
maintain this short list of two hypotheses. If in the first step AB and AC would both be more likely than CE
then we would of course continue with those two possibilities.

9.4.4 The Transformer

In 2017 there was another big step and it was the transformer paper [15] that completely changed the architecture
of previous neural machine translation models into a purely attention-based architecture.
• Attention based model which doesn’t rely on recurrence or convolution.
• The advantage of an attention-based model over an RNN for instance is that there is no sequential
dependencies. One doesn’t need to wait for the previous words to be computed until you can compute the
probability for the current word, all the tokens can be processed in parallel. This is advantageous because
nowadays all the computational gains are through parallelism.
• Leads to significant speed-ups when using modern GPU clusters

• Self-attention relates all tokens in a layer with each other

• Thus can more easily capture long-distance dependencies compared to an RNN
• Transformer-like architectures have now replaced RNNs in NLP applications,
Defacto standard for all state-of-the-art models (e.g., on SuperGLUE benchmark)

Architecture of the Transformer

• Each layer in the Transformer has shape L[T, J] where t ranges over the position in the input sequence
and j ranges over features at that position. L is basically a matrix where for each position for each word
we have a vector, a word embedding.
• When processing sentences of words, T is the sentence length

• This is the same shape as in an RNN – a sequence of vectors L[t, J]

• However, unlike in RNNs, in the Transformer we can compute the layer L`+1 [T, J] from L` [T, J] in parallel
• In this respect, the transformer is more similar to a CNN than to an RNN

Self-Attention
• The fundamental innovation of the Transformer is the self-attention layer
• For each position t in the sequence we compute an attention over the other positions in the sequence
• The transformer uses multiple heads (because multiple heads are empirically better), i.e., it computes
the attention operation multiple times (K = 8 in the original implementation)
• Self-attention then constructs a tensor A[k, t1 , t2 ] – the strength of the attention weight from t1 to t2 for
head k. So the attention that t1 pays to t2 in a particular layer of the transformer for a particular head k.
• In the paper, an embedding dimension of DJ = 512 is chosen per token

• Using K = 8 heads, this results in a dimension of DQ = DK = DV = 64 for the query, key and value
embeddings that are used for each token (can be different). So they do computations at 64 dimensions
but they do it eight times and then they concatenate them again to get 512 dimensions.
Multi-Headed Self-Attention:
In the following we see the equations for the (multi-headed) self-attention-layer, the most fundamental layer of
The Transformer. Note that there are more layers which are not described here.
For each head k and word position t, we compute a key, query and value vector. The queries Q and the keys

125
Figure 152: The Transformer. http://jalammar.github.io/illustrated-transformer/

K are used to compute the self-attention matrix A for head k. A is then multiplied with the values V to yield
embedding vectors H that are concatenated.
Q
Q`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (41)
K
K`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (42)
V
V`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (43)
" #
1
A`+1 [k, t1 , t2 ] = softmax p Q`+1 [k, t1 , I] K`+1 [k, t2 , I] (44)
t2 DQ
H`+1 [k, t, i] = A`+1 [k, t, T ] V`+1 [k, T, i] (45)
L
L`+1 [t, j] = W`+1 [j, I] (H`+1 [1, t, I], . . . , H`+1 [K, t, I]) (46)

(1)-(3): So we start at the bottom where we have just the words represented as one-hot-vectors or word
embeddings of one-hot-vectors. Then we multiply this first layer with a matrix W to get a vector. This
multiplication runs over i and yields a 64 dimensional vector for each head and each word. This is done three
times, once for the query vector, once for the key vector and once for the value vector.

(4): For computing the attention matrix A, we first take the dot product of the query and the key vector. The
divison by the square root of DQ serves as a normalization factor. All of this is then put into a softmax over t2 . As
can be seen in the equation, Q goes over t1 and K goes over t2 and so we get this matrix A were t1 attends to t2 .

(5): Next we multiply this attention matrix A to the value vector V , where we now sum over the T dimension
of this vector V and the t2 dimension of matrix A.

(6): So now we have these 64 dimensional vectors H that we now concatenate over the different heads and
then we compute a matrix product with W L in oder to get the output of that layer.

In summary we compute queries, we compute keys and we compute values for each head and for each time
step, or word, and then from the queries and keys we compute attention and that attention is multiplied with
the values so we take the values where the model attends to and these are then concatenated and fed into the
next layer. Fig. 152 shows an illustration of the transformer that includes the here described self-attention layer
which is the core of the transformer.
If one wants to use the transformer for translation you will have a sequence of input words that are going to
be encoded which are depicted on the bottom left of Fig. 152, underneath the encoder. Then information from
this encoder is given to the decoder. In particular the key and the queries are given to the decoder and they are
combined with the values of each word in the sequence encoded through this encoder that’s part of the decoder.
The output sentence is successfully established through the decoder by combining keys and queries from the
encoder with values from the decoder and the predicting the next word.

126
10 Graph Convolution Networks
10.1 Machine Learning on Graphs
This section gives a motivation why GCNs have a better performance compared to other network architectures
and illustrates some areas of application of GCNs.

10.1.1 Motivation
Multi layer perceptrons (MLPs) are very flexible function approximators. Theoretically a MLP with only one
layer can already act as a universal function approximator, given that this layer can grow infinitely wide.
However, MLPs do not scale well. If the input to an MLP is large the amount of model parameters grows large
as well. Furthermore, the more parameters a neural network has, the more it tends to overfit: Its generalization
capability decreases. For structured signal grids like 2D images or 1D time series a CNN addresses this issue and
allows good generalization due to its convolution operation over the the regular grid of signals, which reduces
the number of signals. Unfortunately a lot of signals cannot be described in such a structured way such as
molecules or natural language. These signals may be better described with the help of graphs. In order to
exploit graph structured data, a model class is necessary that scales better than an MLP when receiving large
input data without loosing its predictive power. Further on this model has to be more flexible than a CNN in
order to exploit the local connectivity structure of any graph - not only structured ones - as prior information.

Figure 153: Comparison of graph structures. An illustration of a regular structured graph (left) and an
unstructured graph (right). The red dots marks the point of interest that is calculated by exploiting the graph
structure and the neighboring nodes (green).

Graphs are descriptors of a signal structure, where the signals are described as nodes (points) and the similarity
between signals with edges (lines between signals). Fig. 153 shows a comparison between a structured graph
(left) and an unstructured graph (right). On both graphs a convolution operation is applied in order to calculate
the red dot by exploiting the graph structure and the neighboring nodes (green). On the structured graph on
the left a convolution filter is applied, that computes a value for the red dot by calculating the dot product
between the elements in green and red. The equivalent operation is done on the right by using a convolution
that exploits the locality in the graph. These convolutions on unstructured graphs are polynomials conditioned
on the graph structure with the graph structure being encoded in a matrix derived from the graph.

10.1.2 Applications
Since different types of data can be represented in form of a graph, GCNs can be applied in several tasks in
multiple domains. Some of these applications are scene graph generation, multi-object tracking, authorship
attribution, recommendation systems, learning molecular fingerprints, protein interface prediction, interaction
networks, learnable physics engines for control, decentralized control of robot swarms.

10.2 Graphs
GCNs operate on graphs. Therefore some basic knowledge of graphs and operations on graphs is presented in
this section.
A graph can be represented as a triplet G = (V, E, W) with vertices V, edges E and weights W. Vertices or
nodes are defined as a set of N numerical labels V = {1, . . . , N }. Edges are ordered pairs of these numerical
labels (i, j), where (i, j) ∈ E is interpreted as ”node i is influenced by node j”. Weights wij ∈ R are numbers
associated to edges (i, j) that determine the strength of the influence that node j has on node i. Depending on
these basic properties, a graph can be classified as a directed graph or as a symmetric graph.

127
In the case of a directed graph Fig. 154 the edge (i, j) differs from edge (j, i). Thus, a connection between
between two nodes can be one-way only, meaning that (i, j) ∈ E and (j, i) ∈ / E. Furthermore, if a connection
between two nodes is bidirectional {(i, j), (j, i)} ⊆ E, their weights can be different wij 6= wji , e.g. the connections
between nodes 3 and 5 in Fig. 154 could have different weights.
The undirected or symmetric graph is a directed graph, with the special property that its edge set and its
weights are symmetric. Therefore, if the edge set E contains (i, j) it implies that (j, i) ∈ E as well. This is
illustrated in Fig. 155 with the red line with arrows on both sides. For the weights of these connections (i, j)
that means that wij = wji for all (i, j) ∈ E.
A special case in directed and symmetric graphs is the unweighted graph, where all existing connections
(i, j) ∈ E have weights wij = 1. Since the weights of a graph convey valuable information as a prior for GCNs,
most graphs that are encountered in the context of deep learning are weighted.

Figure 154: Directed graph. Nodes are illus- Figure 155: Symmetric graph. The illustration
trated by numerical labelled purple dots, edges are is similar to Fig. 154. Due to the symmetry of the
denoted by arrow lines and weights with a descrip- graph the edges are directed in both directions and
tion in the form wij . there is exactly one weight per edge.

Graph Matrix Representation. The visual graph with its connections can be mathematically represented
with the help of different matrices. A common representation of the edges of a graph G is the adjacency
matrix A. In this sparse N × N matrix, each row and each column represent a vertex and each entry Aij
contains the weight wij of all connections i, j ∈ E. Furthermore if the graph is symmetric its holds that A = A>
as seen in Fig. 156.

   
0 1 1 0 0 2 0 0 0 0
1 0 1 1 0 0 3 0 0 0
   
1
A= 1 0 0 1 0
D= 0 3 0 0
0 1 0 0 1 0 0 0 2 0
0 0 1 1 0 0 0 0 0 2

Figure 156: Unweighted graph and corresponding matrices. The unweighted graph is transformed into
respective matrices A and D. Colors mark the corresponding location of the edges in the matrices

Another representation is the degree matrix D that contains the degree of a vertex on its diagonal axis (see
Fig. 156). The degree di of node i is the sum of weights of its incident edges:
X
di = wij , (47)
j∈N (i)

where N (i) - the neighborhood - is the set of nodes that influence node i:

N (i) = {j|(i, j) ∈ E}. (48)

The diagonal Dii therefore contains the degree d of vertex i: Dii = di . D can be expressed in terms of adjacency
matrix as D = diag(A1)4
The Laplacian matrix L combines matrices D and A as follows:

L = D − A. (49)

It therefore contains D on its diagonal and A negated on its off-diagonals.

P Written explicitly in terms of graph
weights the entries consist of Lij = −Aij = −wij and Lii = di = j∈N (i) wij . One can think of this matrix as
a Laplacian in regular domain. It therefore measures the smoothness of a graph.
4 diag converts the vector A1N into a N × N square matrix with the vector on its diagonal.

128
Both the Laplacian and the adjacency matrix can be normalized to achieve a more homogeneous representation
of a graph. That is especially helpful for asymmetric graphs, where some nodes have a lot of neighbors and/or
a high degree and some nodes have only a few neighbors and/or a low degree. The normalized adjacency
matrix expresses weights relative to node degrees:
1 1 wij
Ā = D− 2 AD− 2 ⇒ Āij = p . (50)
di dj

The normalized Laplacian matrix is similarly defined as:

1 1 1 1
L̄ = D− 2 LD− 2 = D− 2 (D − A)D− 2 = I − Ā. (51)

The Laplacian matrix, the adjacency matrix and the normalized forms of both are in the following represented
by the Graph Shift Operators S. That is, because for the theoretical analysis of GCNs the specific graph
matrix representation is irrelevant. Nevertheless during deployment the specific representation matters and
leads to different results. It holds that if G is symmetric S = S> .

Graph Signal Diffusion. Given the mathematical representation of a graph G as S that captures the struc-
ture of this graph, a graph signal can be defined as a vector x ∈ RN that assigns a value xi ∈ R to every node
i. In that way S encodes the expected proximity or similarity between components of x. A multiplication Sx
yields to a diffused signal y over G:
y = Sx (52)
P
If S is the adjacency matrix that would yield to yi = j wij xj . The operation mixes the neighboring values of
xi and diffuses its signal along the edges over G. Since one application of S moves the signal of x only one edge
at a time, a diffusion sequence over k steps can be defined as

xk+1 = S xk with x0 = x (53)

or equivalently in form of a power sequence

xk = Sk x. (54)
(53) should be implemented in practice since leads to faster processing. For the following description of Graph
Convolution Filters (54) is used.

10.3 Graph Convolution

Graph convolution filters are the basic building block of a GCN. Given S and filter coefficients hk , a graph
convolution is a polynomial on S
X∞
H(S) = hk Sk (55)
k=0
P∞
The result of applying the filter H(S) to the signal x is the signal y = H(S) x = k=0 hk Sk x. As a short
hand notation this is written as y = h ?S x, where ?S denotes the graph convolution operation of filter
h = {hk }∞k=0 on signal x and graph shift operator S. The filter coefficient hk determines the importance of a
particular diffusion state k = 0, ∞ that is multiplied by x. Since in practice only a a finite amount of diffusion
states K − 1 is observed, it yields that
K−1
X
y = h ?S x = h0 S0 x + h1 S1 x + · · · + hK−1 SK−1 x = hk Sk x. (56)
k=0

The convolution therefore successively aggregates information from local to global neighborhoods. This is done
by a linear combination of the elements of the diffusion sequence Sk x weighted by the filter coefficients h.
Fig. 157 shows an exemplary application of a graph convolution.

Time Convolutions as Graph Convolutions. Time can be represented in form of a 1D line graph of
infinite length, where an input signal is shifted infinite many times. A resulting graph shift operator could
therefore be an infinitely big adjacency matrix A with ones only one left the diagonal. Now, given an input
signal x andPS = A, time convolution can be thought of as a polynomial on the adjacency matrix of the line
∞
graph: y = k=0 hk Sk x.

129
Figure 157: Graph convolutions as diffusion operators. A graph convolution can be visualized in a block
diagram. Iteratively a shift of S, a weighting of Sk x with hk and a summation of the weighted terms is applied.

Graph Fourier Transforms. Graph Fourier transforms (GFT) are equivalent to standard Fourier transforms
in the regular domain but on graphs. Given the eigendecomposition of the graph shift operator S = VΛV∗ ,
the GFT of a graph signal is given by
x̃ = V∗ x. (57)
The GFT is thus a projection onto the eigenspace of S. The GFT is a helpful tool for analyzing graph
information
P∞ processing systems, since in the GFT domain, graph convolutions are only pointwise operations
ỹi = k=0 hk λki x̃i . Further information on GFTs can be found at https://GCN.seas.upenn.edu/.

10.4 Graph Convolution Networks

Learning Graph Filters. After knowing what graph convolutions are, an algorithm for learning the filter
coefficients h of this graph convolution operation can be defined. This is equivalent to training a GCN with a
single layer only. Let D = {(xi , yi , Si )} denote a dataset of input signals xi , output signals yi and graph shift
operators Si of a graph. It is important to note that a dataset can contain various graph shift operators Si .
Thus, a single graph filter can be trained on multiple graphs in order to become more robust during deployment
even for unseen graph representations. The graph convolution (56) is denoted by fh (x, S). A prediction of an
output signal ŷ can be calculated using the following equation:
K−1
X
ŷ = fh (x, S) = hk Sk x. (58)
k=0

During the training a loss L is minimized between the predicted output signal ŷ = fh (x, S) and the original
output signal y in order to find the optimal values for the filter coefficients h:
X
h∗ = argmin L (fh (x, S), y) (59)
h
(x,y,S)∈D

If the inference problem at hand has a distinct input and output dimension, a readout layer is needed. Let
N denote the number of vertices in our graph, thus the input dimension of xN and M the output dimension of
ŷM with N 6= M . The readout layer is defined as:
K−1
X
ŷ = fh (x, S) = R hk Sk x (60)
k=0

with the readout matrix R ∈ RM ×N that matches the input and output dimension. Typically, R is not learned
but a design choice of the programmer. R could be such that it reads out only the value of node i by using
a unit vector at location i R = e>i . Another exemplary readout matrix could be the summation over all N
elements of all the graph R = 1> . This is useful for classification purposes.

Graph Perceptron. The graph filters as defined above have limited expressive power as they can only learn
linear mappings. In order to achieve a higher expressive power, the graph filters are combined with point-
wise non-linearities g(·), such as sigmoid, tanh or ReLU activation functions. This function is named graph
perceptron, since it introduces the same features for GCN as the perceptron for general neural networks. The
graph perceptron can be expressed as:
K−1
!
X
k
fh (x, S) = g hk S x . (61)
k=0

130
Because of the introduced non-linearity, the graph perceptron is able to express a larger function class.

Graph Convolution Networks. A GCN can now be defined by stacking multiple layers of graph perceptrons
on top of each other. Similar to a MLP, the GCN is recursively composed
K−1
!
X
k
x` = g h`k S x`−1 , (62)
k=0

where it is assumed that the input to the first layer is set to the input signal x0 = x. Note that compared to
(61) x and hk have index ` now to denote the layer. A short hand notation for the recursive application of the
graph perceptron over L layers is:
ŷ = fH (x, S) = xL , (63)
where H denotes a set of L vectors of trainable filter coefficients H = {h1 , . . . , hL }. The recursive composition
of a GCN is illustrated in Fig. 161. Empirical evidence suggests that, similar to CNNs deeper GCNs have a
better prediction performance due to more layers of non-linearities.

Learning Graph Convolution Networks. The filter parameters of a GCN are learned in same way as in
the single-layer-case but with the filter coefficients of all layers H as the optimization objective:
X
H∗ = argmin L (fH (x, S), y) . (64)
H
(x,y,S)∈D

Comparing a Multi-Layer Perceptron and a Graph Convolution Network. A comparison between

a MLP and a GCN shows that the GCN can be regarded as a special case of a MLP. This becomes clear when
looking at the trainable parameters of both network architectures. In (62) the learnable parameter in each layer
is the vector h` , whereas the MLP with x` = g (W` x`−1 ) has a weight matrix W` . Under the assumption that
S is constant in the dataset, it is obvious that the MLP is more flexible during training due to having more
parameters and thus attains lower training cost:
X X
min L (fW (x), y) ≤ min L (fH (x, S), y) (65)
W H
(x,y)∈D (x,y)∈D

Although GCNs are a special case of MLPs, their generalization capabilities to unseen input signals are better.
That is because GCNs are able to exploit a priori information about the symmetries of a graph that are carried
with the graph shift operator S. GCNs learn the structure of a graph as seen in Fig. 158. Furthermore, unlike
MLPs, GCNs can be trained on different graphs by passing different graph shift operators to them.

Figure 158: Comparing the generalization of MLPs and GCNs. The three images depict an output
signal on a graph, where the color indicates the signal strength. An MLP and a GCN receive the same input
and output pairs during training and produce a similar output signal (left). During inference time an unseen
signal is passed into the model and the MLP predicts an output signal similar to the predictions during training
(center). The GCN in contrast generalizes better since it knows the underlying structure of the graph. Therefore
it is able to predict an output signal that has a structure similar to the structures predicted during training but
on another part of the graph (right). This is due to the translation equivariance of such a network.

Multiple-Input-Multiple-Output GCNs. The definition of a standard GCN in (62) is limited to an input

vector x. Thus, for each vertex of the graph, only a single scalar value can be used as an input and as an
intermediate representation. In a Multiple-Input-Multiple-Output GCN (MIMO GCN) this restriction is eased
by introducing filter banks:
Filter banks take a graph signal and output a graph signal matrix Z = (z1 , . . . , zG ), where Z represents G
features per node. This is achieved by transforming a graph signal with G convolution filters separately (see
Fig. 159).

131
Figure 159: Filter banks. The block diagram on the left receives a signal x of the graph and applies this signal
separately with 3 different graph filters (G = 3). This produces a graph signal matrix Z = (z1 , z2 , z3 with each
vecotr containing one feature per node (right).

If now the input to a MIMO GCN is multidimensional with F features per node, the feature vector xf is
processed through G filters weighted by coefficients hfk g :
K−1
X
uf g = hfk g Sk xf (66)
k=0

The MIMO Graph Filter, thus generates an output with F × G features per node. Applying filter banks
over multiple layers would therefore lead to an unwanted exponential growth of features and parameters. For
this reason, in each layer the output of each graph filter G is summed up. This reduces the number of features
to G (see Fig. 160:
XF XF K−1
X fg
zg = uf g = hk Sk xf (67)
f =1 f =1 k=0

In matrix notation the MIMO graph filter can simply be expressed as

F K−1
X X K−1
X
zg = hfk g Sk xf ⇔ Z= Sk X Hk , (68)
f =1 k=0 k=0

where  11
h1G

K−1 hk ··· k
F  .. ..  .
X
k
z1 G 1

··· z = S
|{z} x ··· x  . . 
k=0 hF 1 hF G
| {z } | {z }
= Z ∈ RN ×G
∈R N ×N
= X ∈ RN ×F ···
| k {z k
}
= Hk ∈ RF ×G

A MIMO GCN is now defined by stacking multiple MIMO Graph Filters on top of each other for L layers.
This yields to the following equation:
K−1
!
X
k
X` = g S X`−1 H`k (69)
k=0

with X0 = X being the input signal matrix to the first layer. (69) is expressed as a short-hand notation as:

Ŷ = fH (X, S) = XL (70)

with parameters H = {H1 , . . . , HL }, where H` = {H`0 , . . . , H`,K−1 }.

Compared to the standard GCN the MIMO GCN is handling matrices as an input and output (see Fig. 162).

11 Autoencoders
11.1 Latent Variable Models
Latent variable models capture the structure of the data space using latent variables that can be modelled
using unsupervised learning without the need for data-label pairs. Formally, a latent variable model relates the
observation space x ∈ RD and the latent space z ∈ RQ with N << Q. The mapping can be fw : x → z referred
as encoder or gw : z → x referred as decoder. Latent variable models always have a decoder but may or may not
have an encoder. Furthermore, the mapping can either be linear or non-linear and deterministic or probabilistic.
Table 2 cites one example for each of mapping types. Latent variable models capture the underlying structure
and semantics of the data manifold usually in a lower dimensional space.

132
Figure 160: Filter Banks with multiple inputs. Each of the F feature vectors xf is applied to G graph
filters and produces the outputs uf g that are then summed up to z g .

Figure 161: Block diagram of a GCN. The in- Figure 162: Block diagram of a MIMO GCN.
put vector x0 is passed into the first graph per- In contrast to the GCN, the MIMO GCN receives
ceptron in the first layer ` = 1. The output x1 is an input matrix X and produces with the parame-
equally passed into layer ` = ` + 1 until the final ter coefficient matrix H the output matrix Z con-
output x3 is is produced. taining the values from the filter bank.

11.1.1 Generative Latent Variable Models

Broadly, generative models tries to capture the underlying data distribution p(x) from which the data points
{xi }N
i=1 are supposedly sampled from. They are primarily used to sample random instances from the modelled
distribution. Generative latent variable models capture the structure and semantics of the data using latent
variables. Generally, they employ a Bayesian model and the data distribution p(x) is defined as:
Z
p(x) = p(z)p(x|z) = Ez∼p(z) [p(x|z)],
z

where p(z) is the prior probability over the latent variable z ∈ RQ and p(x|z) represents the likelihood of x
given z. Given a dataset X , generative latent variable models aim to maximize p(x) where x ∈ X by learning
a prior p(z) and the likelihood p(x|z). Each ata point x is associated to a unique latent variable z. Fig. 163
shows graphical representation of generative latent variable model.

133
Deterministic Probabilistic
Linear Principle Component Analysis(PCA) Probablistic PCA
Non-Linear with Encoder Autoencoder Variational Autoencoder
Non-Linear without Encoder Generative Adversarial Networks

Table 2: Taxonomy of latent variable models

Figure 163: Plate notation of generative latent variable models. The model parameters w are fixed for
all the N data points. However, each input is associated with one distinct latent variable

11.2 Principal Component Analysis

Standard principal component analysis(PCA) is a deterministic and linear latent variable model that is primarily
used in data analysis and as a dimensionality reduction technique. PCA learns a bidirectional linear mapping
between the input space x and the latent space z. Given a dataset X = (x1 , . . . , xN )T ∈ RN ×D , PCA assumes
a linear mapping between data points X and the corresponding latent variables Z = (z1 , . . . , zN ) ∈ RN ×Q as
follows:
Q
X
x̂i = x + zij vj
j=1

PN
where x̂i is the reconstruction/prediction of the data point xi and x = N1 i=1 xi is data mean, V =
(v1 , . . . , vQ ) is an orthonormal basis of the latent space. Essentially, PCA constitutes a decoder given
by x = x + Vz (obtained by rewriting the above equation in matrix form) and an encoder z = VT (x − x)
obtained by inverting the above linear mapping. The goal of PCA is to minimize the L2 reconstruction loss wrt
latent variables Z and the orthogonal basis V. We formalize the reconstruction loss below:

N
X N
X Q
X
L(Z, V) = kx̂i − xi k2 = kx + zij vj −xi k2
i=1 i=1 j=1
| {z }
x̂i
N X
X Q
= k zij vj + x − xi k2
i=1 j=1
 
N
X XQ Q
X
2
=  zij +2 zij vjT (x − xi ) + kx − xi k2 
i=1 j=1 j=1

The optimization objective of PCA is given by

(Z∗ , V∗ ) = argmin L(Z, V),

Z,V

11.2.1 Solving PCA: Reconstruction Error Minimization

One can obtain the minimizers (Z∗ , V∗ ) of the reconstruction loss in closed-form. We first solve for the latent
variables Z by setting the partials of the reconstruction loss wrt Z to zero:

134
∂L(Z, V) !
= 2zij + 2vjT (x − xi ) = 0
∂zih
∗
=⇒ zij = −vjT (x − xi )

We plugin the solution for Z = Z∗ into the reconstruction loss L(Z, V) which can be simplified and rewritten
as
 
XN XQ
L(Z∗ , V) = − ∗2
zij + kx − xi k2 
i=1 j=1
Q
X N
X
=− vjT Svj + kx − xi k2
j=1 i=1

PN
where S = i=1 (x − xi )(x − xi )T is the scatter matrix of the dataset X.
We then proceed to find the orthogonal basis V using the simplified form. Since, V is constrained to be the
orthonormal basis, the solution V∗ that minimizes L(Z∗ , V) is subjected to an equality constraint vjT vj = 1 for
all j ∈ 1, . . . , Q. We employ Lagrange multiplier to solve for V and the Lagrange expression for the same is
given by:
Q
X N
X Q
X
L(Z∗ , V, λ) = − vjT Svj + kx − xi k2 + λj (vjT vj − 1)
j=1 i=1 j=1

We find V∗ by setting the partials of the above Lagrange expression wrt V to zero

∂L(Z∗ , V, λ) !
= −2Svj + 2λj vj = 0
∂vj
=⇒ Svj = λj vj

The possible solutions for (λ, V) are the eigen values and vectors of the scatter matrix S ∈ RD×D . There can
exist upto D eigen vectors. For latent space with dimensions Q < D we choose the eigen vectors with top Q
largest eigen values. This
PQ ensure the P reconstruction loss is minimized as the loss is proportional to the sum of
Q
pruned eigen values. j=1 vjT Svj = j=1 λj .

11.2.2 Solving PCA: Latent Variance Maximization

: An alternative motivation for PCA is maximizing the variance in the latent space. Formally, the variance
of latent variables can be written as,

V ar(z) = E (vT (x − x) − E vT (x − x) )2

= E (vT (x − x)(x − x)T v

∝ vT Sv

which shows maximizing the variance is proportional to maximizing the term vT Sv. Thus, one can solve the
below optimization objective wrt V for variance maximization of the latent variables where V is constrained to
be a orthonormal basis like before.
Q
X Q
X
(V∗ , λ∗ ) = argmax vjT Svj + λj (vjT vj − 1)
V,λ j=1 j=1

The above objective can be maximized by the Q largest eigen values and their corresponding eigenvectors.

11.2.3 Applications
Despite being a linear model, PCA yields good reconstructions with very low dimensional latent spaces. Besides
dimensionality reduction, PCA can also be used to explore the latent properties of the datasets. Fig. 164
showcases two applications of PCA on image manifold.

135
(a)

(b)

Figure 164: Applications of PCA to Image ManiFold a) Application of principal component analysis on
MNIST digit dataset. PCA is applied on the images with digit 3. The reconstruction becomes better as the
number of basis are increased. b) shows reconstruction of face images using only 3 eigen components achieving
significant dimenstionality reduction.

11.3 Autoencoder
Autoencoders generally constitutes a deterministic non-linear encoder fw and a decoder gw . They are usually
learnt to predict the input x ∈ RD as output x̂ ∈ RD . The encoder fw : x → z maps the input to a latent
code z ∈ RQ while decoder gw : z → x tries to reconstructs the input back from the latent code z. Hence,
autoencoders are trained to minimize the reconstruction loss such as squared error, kx̂ − xk2 . Fig. 165 presents
a schematic illustration of auto encoders.

Figure 165: Schema of an autoencoder. Autoencoder takes the input x and encodes into the latent repre-
sentation z using the encoder network fw (x). The latent code z is then reconstructed back using the decoder
network gw (z) which outputs the prediction x̂. To learn the model parameters w squared reconstruction error
is used.

The non-linear mappings fw and gw are parameterized by neural networks with weights w. One can choose
an appropriate class of the neural networks(Convolutional Neural Networks, Multi Layer Perceptron, etc.) based
on the nature of the input data.

11.3.1 PCA:Special Case of Autoencoders

When the mappings fw and gw are assumed to be linear using identity activation functions, autoencoders
interestingly reduce to standard PCA. The encoder is of form fw (x) = Ax + a which computes the latent
variable z and the decoder is given by gw (z) = Bz + b which outputs the reconstruction x̂ of input x. Since,

136
the goal of autoencoders is to minimize the reconstruction error the optimization objective can be written as:
N
X
w∗ = argmin kx̂i − xi k2
w
i=1
N
X
= argmin kgw (fw (xi )) − xi k2
w
i=1
N
X
= argmin k(B(Axi + a) + b) − xi k2
w
i=1
N
X
= argmin k(Cxi + c) −xi k2
w | {z }
i=1
x̂i

The optimal weights w∗ can be obtained by PCA as we have seen before.

Autoencoders can learn non-linear mappings that makes them powerful and often have small reconstruction
errors compared to PCA. In Fig. 166 we juxtapose the reconstructions of PCA and non-linear autoencoders
which suggests that autoencoders easily achieve very small reconstruction errors.

(a) Standard PCA (b) Linear autoencoder

Figure 166: Comparison of PCA and autoencoder on cosine manifolda) PCA applied on cosine data
with 1 dimensional latent space. b) Linear autoencoder with same latent space as PCA. It is interesting to
see that after sufficient number of iterations the reconstructions of PCA and linear autoencoder are almost
indistinguishable. c) Non-linear autoencoder with 1 dimensional latent space. With the non-linear functions
autoencoder produces much better reconstructions than standard PCA. d) With increase in the latent space
the reconstruction completely overlaps with the data points.

11.3.2 Denoising Autoencoder

Denoising autoencoders take noisy input data and learns to reconstruct the original undistorted input. In this
way the model learns to stable high level representations and increases its robustness to input corruptions.

137
11.4 Variational Auto Encoders(VAE)
11.4.1 Intractability of Learning Generative Latent Variable Models
Generative latent variable models disccussed in Section11.1.1 essentially captures the underlying data distri-
bution p(x) by considering a simple Bayesian model with prior probability p(z) over the latent space and
conditional probability p(x|z) over the input space given z. However, learning optimal parameterized distri-
butions is often intractable. Consider pw (x) to be a distribution parameterized by some weights w. The goal
of generative latent variable models is to maximize p(x) for the data points x ∈ X in our dataset. To find the
optimal weights w∗ that maximizes p(x) one can use the familiar minimization objective stated below:

w∗ = argmin Ex∼X [− log p(x)]

w

= argmin Ex∼X − log Ez∼pw (z) [pw (x|z)]
w
N
1 X
= argmin − log Ez∼pw (z) [pw (xi |z)]
w N i=1

The objective requires computing expectation over the latent variable z with distribution p(z). One is forced to
draw huge number of samples from p(z) to approximate the expectation and this number exponentially grows
with increase in latent dimensions making computation of the objective intractable.

11.4.2 Variational Auto encoders

Variational Auto encoders tackles the above intractability issue by approximating the posterior distribution
pw (z|x) with a so called recognition model qw (z|x). Using this approximation one can seek a tractable
lower bound for the marginal distribution pw (x) and optimize the lower bound to in turn maximize the
marginal distribution. We discuss the lower bound in detail in the next section.
To provide an intuitive explanation for recognition model qw (z|x) alleviating the intractability problem, consider
an observation space to be sound waves and latent space to be the word sequences. We wish to maximize the
likelihood of the sound waves in our dataset. Given a song(sound waves) from our dataset it might be highly
difficult to infer the lyrics(word sequences) from a vast space of words sequences. However, it becomes easier to
realize the song when someone can provide us some approximate of what the lyrics could be for the given song.
Similarly, the recognition model qw (z|x) provides an approximation of the true posterior p(z|x). This helps
in restricting the search space from which we the sample latent variable z compared to the prior probability
pw (z). Fig. 167 pictorially showcases an example where drawing three samples from prior probability pw (z)
assigns near-zero probability to the datapoint xi while with an approximate recognition model qw (z|x) realized
significant probability to the input.

Figure 167: Illustration of intractability with prior probability p(z). We wish to estimate the probability of
p(xi ) by drawing three samples from p(z) For all the three samples p(xi |z) is almost zero leading to draw more
samples from p(z). However, using the recognition model q(z|xi ) the search space for z is reduced and thereby
leading to significant estimate of p(xi ) with just three samples.

138
11.4.3 The Evidence Lower Bound
The log likelihood pw (x) can be rewritten as follows using the recognition model qw (z|x). We drop the parameter
w in below equations as the bound holds in general.

p(x)p(z|x)
log p(x) = Ez∼q(z|x) log
p(z|x)

p(x, z) q(z|x)
= Ez∼q(z|x) log + Ez∼q(z|x) log
q(z|x) p(z|x)

p(x, z)
= Ez∼q(z|x) log + KL(q(z|x), p(z|x))
q(z|x) | {z }
≥0

Note that the above reformulation maintains the equality. Furthermore, the KL divergence term in the above
equation is intractable as it relies on the true posterior p(z|x). However, KL divergence is non-negative
distance measure. Thus, we eliminate the intractable KL divergence and obtain a lower bound for the log
likelihood defined by:

p(x, z) q(z|x)
log p(x) ≥ Ez∼q(z|x) log =⇒ − log p(x) ≤ Ez∼q(z|x) log
q(z|x) p(x, z)
ELBO

One can interpret the above bound as enforcing KL measure to inhibit the divergence of the approximate
recognition model term from the prior distribution p(x) while the other term measures the reconstruction
error wrt to the input x.

11.4.4 Learning Objective

Given a dataset X = (x1 , . . . , xN ) and w the model parameters, VAE minimizes the negative log likelihood
objective as follows:

w∗ = argmin Ex∼X [− log pw (x)]

w
N
X
= argmin [− log pw (xi )]
w
i=1
N
X qw (z|x)
≈ argmin Ez∼qw (z|xi ) log ELBO approximation
w
i=1
pw (xi , z)
N
X
= argmin KL(qw (z|xi ), p(z)) + Ez∼qw (z|xi ) [− log pw (xi |z)] By expansion
w | {z }
i=1
| {z }
Approx Posterior = Prior Reconstruction Term

11.4.5 Neural Network Parameterization

Essentially VAEs comprises of a recognition model qw (z|x) that approximates the true posterior p(z|x) and a
likelihood model pw (x|z) To ensure tractability of the KL divergence term in the learning objective, the recog-
nition model is chosen to be a multi-variate Gaussian distribution parameterized by the neural network.
However, the prior of the latent space is generally as standard Gaussian distribution. Formally, the Gaussian
recognition model can be expressed as:

1 1 1 T −1
qw (z|x) = exp − (z − µw (x)) Σ w (x) (z − µw (x))
(2π)Q/2 |Σw (x)|1/2 2

139
Figure 168: In this example the random variable z on the left takes a standard Gaussian distribution. However,
by finding an approximate function the random variable can evoke a different distribution. Here when g(z) =
z/10 + z/kzk the normally distributed random variable can generate the distribution on the right using g(z).
This is how VAEs are expressive despite simple prior probability distribution.

where Σw (x) and µw (x) are parameterized by neural network with weights w. Typically, Σw (x) = diag(σ 2w (x))
a diagonal matrix and shares the same backbone with µw (x). The reason for restricting the recognition model
and the prior to be Gaussian distributions such that KL divergence term in the learning objective is tractable
and can be computed directly using an analytical solution.
One final problem with the current framework is that the gradients for reconstruction term wrt w needs to
backpropagate through sampling operation that is non-differentiable. This makes training the parameters
with stochastic gradient methods challenging. VAE resolves this problem by invoking an alternate method for
sampling from qw (z|x) which is called as the reparameterization trick.
Reparameterization trick moves the sampling step to an input layer and avoids directly sampling from qw (z|x).
Given an input xi , Σw (xi ) and µw (xi ) can be deterministically computed. Alternative to sampling directly
from the distribution N (µw (xi ), σ w (xi ), ) one can sample a standard Gaussian distribution by ∼ N (0, I) and
construct the latent sample as
K
z = µw (xi ) + σ w (xi )

In this way one can sample from the latent variables by first sampling the error variable and computing z
using deterministic functions. With the above trick the reconstruction term can be written as
h K i
Ez∼qw (z|xi ) [− log pw (xi |z)] = E∼N (0,1) − log pw (xi |z = µw (xi ) + σ w (xi ) )

. A schematic illustration of reparameterized VAE is shown in Fig. 169

11.4.6 Applications
Although, variational autoencoders assume prior distribution of the latent space pw (z) to be a standard normal
distribution, with powerful neural networks they can be quite expressive. VAE can learn a mapping from the
standard normal distribution to any distribution of the latent variables using the first few layers of the decoder
networks. See Fig. 168 for an example. A broad range of recent demonstrations of VAEs are illustrated in
Fig. 170.

12 Generative Adversarial Networks

12.1 Generative Adversarial Networks
12.1.1 Generative Models
Generative models
• are probabilistic, they learn a probability distribution pmodel to represent the true data distribution pdata
• are able to generate samples from pmodel
• can in some cases estimate pmodel explicitly and therefore allow to evaluate the (approximate) likelihood
pmodel (x) of a sample x
• are called implicit if they do not allow for evaluating the likelihood, explicit otherwise

140
(a) VAE regular

(b) VAE reparameterized

.
Figure 169: In a VAE we apply KL divergence measure on the recognition model and reconstruction loss on
the likelihood model. With a regular VAE the latent variable z is sampled directly from the recognition model
qw (z|x). With reparameterization trick the latent variable is obtained by using deterministic estimates σ w , µw
and sampling from N (0, 1)

12.1.2 GANs
Generative Adversarial Networks are currently the most popular generative models. Just like (Variational)
Autoencoders they are non-linear, but GANs are implicit models, do not have an encoder and are trained
differently.
GANs use an adversarial process in which two models are trained simultaneously. Given a prior (e.g. gaussian)
p(z) over latent variables z ∈ RQ as well as observations x ∈ RD the two models can be described as:

• a generator GwG : RQ 7→ RD that gets a sample z ∼ p(z) as input and generates x̂ ∼ pmodel with pmodel
approximating pdata
• a discriminator DwD : RD 7→ [0, 1] that estimates the probability that a sample comes from pdata

The goal of G is to maximize the probability of D making a mistake while the goal of D is to make as little
mistakes as possible. Due to the two models playing against each other, both of them need to improve constantly.
This can be described as a two-player minimax game with a value function V (G, D):

G∗ , V ∗ = argmin argmax V (D, G) (71)

G D
V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼p(z) [log(1 − D(G(z)))] (72)
| {z } | {z }
recognize true samples reject samples generated by G

V (G, D) describes the expectation of D recognizing samples from pdata and rejecting samples generated by
G. The parameters of D are optimized towards maximising this expectation while the parameters of G are
optimized towards minimizing V (G, D). The optimization of the models is attained via Backpropagation.

Generator and discriminator can be implemented with different types of neural networks (e.g. MLPs,
CNNs, RNNs). After training only the generator is kept in order to sample from pmodel . The discriminator
can be seen as a learned loss function on x̂ that is only necessary for training the generator. Fig. 171 shows an
overview of the functionality of GANs.

12.1.3 Algorithm
In order to train GANs, the discriminator D is updated first. For k iterations an equal amount of real samples
x as well as latent samples z are drawn. The latent samples are used to generate samples x̂. In each iteration
the weights pf the discriminator are updated by stochastic gradient ascent of V (G, D).
Afterwards, the generator is updated. After drawing new latent samples z, the weights of G are updated via

141
stochastic gradient descent. Here, only the second part of V (G, D) is used since the first part of the sum is
independent of the generator.
D is optimized k times (typically with k ∈ {1, . . . , 5}) for each optimization of G in order to maintain it near
its optimal solution, while preventing overfitting if the dataset is finite. For this to work, the generator has to
be changed slowly, i.e. its learning rate must be small. The algorithm stops when the value function converges
to some extreme.

While not converged do

1. For k steps do

1.1 Draw B training samples {x1 , . . . , xB } from pdata (x)

1.2 Draw B latent samples {z1 , . . . , zB } from p(z)
1.3 Update the discriminator D by ascending its stochastic gradient:
B
1 X
∇wD log D(xb ) + log(1 − D(G(zb )))
B
b=1

2. Draw B latent samples {z1 , . . . , zB } from p(z)

3. Update the generator G by descending its stochastic gradient:
B
1 X
∇wG log(1 − D(G(zb )))
B
b=1

12.1.4 The gradient trick

Early in the training, the generated samples x̂ are rather random, so the discriminator can easily distin-
log 1
guish between true and generated samples. This leads to a low gradient log(1 − D(G(z))) = log D(G(z)) .
In contrast, log(D(G(z))) provides stronger gradients and can therefore be maximized instead of minimizing
log(1 − D(G(z))). See Fig. 172 for visualization.

12.1.5 Expressiveness
GANs are very expressive. Fig. 173 shows an one-dimensional example of a gaussian pmodel distribution shifting
its mean and increasing the variance in order to adapt to pdata . It can be seen how with increasingly overlapping
distributions pmodel and pdata the discriminator becomes increasingly uncertain until it returns a probability of
about 0.5 everywhere.

12.1.6 Theoretical results

Proposition 1. Optimal discriminator
For any given generator G, the optimal discriminator D is:

∗ pdata (x)
DG (x) =
pdata (x) + pmodel (x)

Proof. The training criterion for the discriminator D is to maximize (wrt. D):
Z Z
V (G, D) = pdata (x) log(D(x)) dx + p(z) log(1 − D(G(z))) dz (73)
Zx z

= pdata (x) log(D(x)) + pmodel (x) log(1 − D(x)) dx (74)

x
(75)
2
f = a log(y) + b log(1 − y) (a, b) ∈ R \ {0, 0} (76)
a b
⇒ f0 = + · −1 (77)
y (1 − y)

142
Set the derivative to 0 to calculate the maximum:
a b a b
0= − ⇔ = (78)
y (1 − y) y (1 − y)
⇔ a − ay = by ⇔ a = by + ay (79)
a
⇔y= (80)
a+b

This proposition shows that the optimal discriminator will return the probability of an input x being part of
the data distribution pdata .

Theorem 1. Global Optimality

The global minimum of the virtual training criterion
∗ ∗ ∗
V (G, DG ) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − DG (x))] (81)
pdata (x) pmodel (x)
= Ex∼pdata log + Ex∼pmodel log (82)
pdata (x) + pmodel (x) pdata (x) + pmodel (x)
∗ 1 ∗
is achieved for pmodel = pdata where DG = 2 and V (G, DG ) = − log 4 ≈ −1.386.

Proof. Reformulation in terms of the non-negative Jensen-Shannon divergence yields:

∗
V (G, DG ) = KL(pdata , pdata + pmodel ) + KL(pmodel , pdata + pmodel ) (83)

pdata + pmodel pdata + pmodel
= − log 4 + KL pdata , + KL pmodel , (84)
2 2
= − log 4 + JSD(pdata , pmodel ) (85)
since the Jensen-Shannon divergence JSD(pdata , pmodel ) is 0 if pmodel = pdata and − log 4 is thus the minimal
∗
value that V (G, DG ) can assume.

∗
Here, it is proved that for the value function V (G, DG ) to be minimal even though the discriminator is
optimal, pdata has to be equal to pmodel .

Proposition 2. Convergence
∗
If G and D have enough capacity, and at each update step the discriminator D is allowed to reach D = DG ,
and pmodel is updated to improve
∗ ∗ ∗
V (pmodel , DG ) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − DG (x))] (86)
Z
∝ sup pmodel (x) log(1 − D(x)) dx (87)
D x

then pmodel converges to pdata .

Proof. The argument of the supremum is convex in pmodel . The supremum doesn’t change convexity,
∗
thus V (pmodel , DG ) is also convex in pmodel with global optimum pmodel = pdata as shown in Theorem 1.

This proposition shows that with G and D large and complex enough and the discriminator being al-
lowed to reach optimality each time it is updated, the generator will achieve to adapt pmodel exactly to pdata .

These theoretical results are based on three assumptions.

• G has enough capacity to represent the data distribution and D has enough capacity to optimally dis-
criminate between samples from pdata and pmodel .To fulfill this assumption the models must be very
large.
• For each update of G the discriminator D is updated until reaching the optimum DG ∗
. In practice,
this would be computationally too expensive and lead to overfitting when using finite datasets. Instead,
Algorithm 12.1.3 is used.
• Optimize the model distribution pmodel . Since the model distribution is represented by parameters of G,
this is not possible.
Due to the assumptions not holding in practice, GANs might not converge to pdata or even oscillate. Nevertheless,
GANs are empirically quite successful.

143
12.1.7 Mode Collapse
A common failure in GANs is mode collapse. If the generator learns to cover a part of pdata very well, the
gradients regarding this part are low while the gradients regarding the rest of pdata are high. This can lead to
the generator repeatedly learning different parts of the data distribution while unlearning the previous parts.
Thus, the generator is always capable of producing high-quality samples but with very low variability.
There exist different strategies for avoiding mode collapse.
• encourage diversity: in minibatch discrimination the discriminator outputs a probability regarding the
whole batch. As a result, a broader range of samples have to be generated closely to pdata
• anticipate counterplay: stabilize the training of the generator via anticipating the response of the discrim-
inator. Fig. 174 shows an example of anticipate counterplay. This strategy requires backpropagating the
generator gradient through the anticipation steps.
• experience replay: minimizes hopping back and forth between modes
• train multiple GANs: approximate the whole data distribution with multiple GANs covering a range of
modes
• optimization objective: use different optimization strategies in order to improve the learning

12.1.8 Advantages and Disadvantages

An advantage of GANs is its flexibility: a wide variety of functions and distributions can be modeled. It is easier
than other generative models since only backpropagation is required for training - the sampling happens before,
when choosing an input - and unless VAEs, there is no approximation of the likelihood required. Nevertheless,
GANs generally produce more realistic samples than VAEs.
The missing explicit representation of pmodel might in contrast be an disadvantage since the sample likelihood
can thus not be evaluated and used e.g. for inference or performance evaluations. Additionally, the discriminator
and generator must be balanced well during training to ensure convergence to pdata and to avoid mode collapse.
It is therefore quite hard to train GANs.

12.2 GAN Developments

12.2.1 Architecture Guidelines for stable Deep Convolution GANs (DCGAN)
In the DCGAN paper, a systematic search for a suitable architecture for Deep Convolutional GANs has been
conducted. The following ideas have been found to work well in order to produce high quality images:
• Replace any pooling layers with strided convolutions (discriminator) and fractional strided convolutions
for upsampling (generator)
• Use batch normalization in both the generator and the discriminator
• Remove fully connected hidden layers for deeper architectures
• Use ReLU activations in the generator except for the output which uses tanh
• Use Leaky ReLU activations in the discriminator for all layers
The resulting model was able to produce images of complex scenes like bedrooms and to produce reasonable
interpolations between scenes. Comparable to the vector arithmetics of Word2Vec ([? ]), DCGAN can perform
arithmetic operations on z vectors in order to create new samples as shown in Fig. 175. The model thus seems
to learn semantically meaningful representations.

12.2.2 Evaluating the performance of GANs

Evaluating the performance of implicit generative models is difficult. Therefore, a metric called Fréchet inception
distance (F ID) is often used to assess the quality of images created by a GAN. The idea behind F ID is to
take a pretrained Inception v3 network, to input real images and to calculate a distribution of these images
by means of deeper, more semantically meaningful features computed inside the Inception network. In the
following, GAN produced images are inputted and their distribution is computed based on the same features.
This leads to multivariate distributions N (µd , Σd ) for the real images and N (µm , Σm ) for the GAN images.
F ID then computes a distance between these to distributions:
2
F ID = kµm − µd k2 + Tr(Σm + Σd − 2(Σm Σd )1/2 )

144
F ID could be proven to increase when manipulating images or using images from different distributions. While
it can detect mode collapse within a class of images, it is not suitable for measuring mode collapse over the
whole data distribution - unless knowing where to find modes and testing each of them separetely.

12.2.3 Gradient Penalties and Convergence

One mayor advance for making GAN training more stable and less dependent on hyperparameters was the
introduction of regularizers such as Gradient Penalties, e.g.
h i
2
V (D, G) = Ex∼pdata (x) log D(x) − λk∇x D(x)k + Ez∼p(z) [log(1 − D(G(z)))]

The regularizer penalizes large gradients of the discriminator with respect to x. This has been shown to lead to
converging instead of oscillating parameters (Fig. 176). When applying gradient penalties to DCGANs, high
quality images can be produced without needing to rely on ideas like the ones mentioned in 12.2.1.

12.2.4 CycleGAN
CycleGANs are used for Image-to-Image translation, i.e. they map images from one domain X to another
domain Y and back. A cycle consistency loss is used to check how well an image x is reconstructed after being
mapped to another dimension and back. See Fig. 177 for visualization. Possible use cases are the creation of
paintings out of photographs or photographs out of line drawings.

12.2.5 Progressive Growing of GANs

An idea to improve the performance of GANs is to start with small layers and low resolutions and to progressively
add bigger layers used for creating images with higher resolution. This is one of the approaches used in state-
of-the-art GANs today.

12.2.6 BigGANs
Really large image sets like ImageNet are difficult to handle for GANs due to their diversity. The idea of
BigGANs is to create class-conditional GANs which receive the class label as an additional input. Instead of
using regularizers which might hamper the performance of GANs, the authors monitored the singular values
of the weight matrices of generator and discriminator. By means of these singular values they could determine
when mode collapse started to happen and stop training at these locations.

12.3 Research at AVG

The goal of the research group is to create intelligent systems that can interact with a complex 3D environment.
One task in order to achieve this goal is to understand the environment in terms of its geometry and appearance.
This means, a 3D shape has to be predicted from images. While the input is well defined - it consists of one
or multiple images - different types of output can be returned. In the past, Voxels, Points or Meshes have been
used. As one can see in Fig. 178 these outputs discretize space and are thus rather poorly detailed.
The idea of the occupancy network paper was to not represent the 3D shape explicitly. Instead, the surface
is considered implicitly as the decision boundary of a non-linear classifier such that the decision boundary
separates the points inside an object from the points outside:

fθ : R3 × X → [0, 1]

The first input is a 3D-location. X inputs a condition, e.g. an image and the neural network then outputs an
occupancy probability.
The representation can then be learned by a VAE in order to create new samples. Moreover, it can be extended
to 4-dimensional space and used e.g. to model conditional surface light fields or Generative Radiance Fields.

References
[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv.org, 2018.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. Journal of Machine Learning Research (JMLR), 3:1137–1155, 2003.

145
[3] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capability and trainability in recurrent neural
networks. Proc. of the International Conf. on Learning Representations (ICLR), pages 1–17, 2017.
[4] Mané Vasudevan Le Cubuk, Zoph. Learning augmentation strategies from data. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2019.

[5] Rauber Schütt Bethge Geirhos, Temme and Wichmann. Generalisation in humans and deep neural net-
works. 2018.
[6] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256, 2010.

[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
[9] Leal-Taixe and Niessner. I2dl.
[10] Herbert Robbins and Sutton Munro. A stochastic approximation method. The Annals of Mathematical
Statistics, 1951.

[11] Krizhevsky Sutskever Srivastava, Hinton and Salakhutdinov. Dropout: a simple way to prevent neural
networks from overfitting. In Journal of Machine Learning Research (JMLR), 2014.
[12] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initializa-
tion and momentum in deep learning. In Proc. of the International Conf. on Machine learning (ICML),
volume 28, pages 1139–1147, 2013.

[13] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems (NeurIPS), 2014.
[14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.
ISCA Speech Synthesis Workshop, pages 1–15, 2016.

[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing
Systems (NeurIPS), pages 5998–6008, 2017.

146
References
[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv.org, 2018.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. Journal of Machine Learning Research (JMLR), 3:1137–1155, 2003.

[3] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capability and trainability in recurrent neural
networks. Proc. of the International Conf. on Learning Representations (ICLR), pages 1–17, 2017.
[4] Mané Vasudevan Le Cubuk, Zoph. Learning augmentation strategies from data. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2019.

[9] Leal-Taixe and Niessner. I2dl.

[10] Herbert Robbins and Sutton Munro. A stochastic approximation method. The Annals of Mathematical
Statistics, 1951.
[11] Krizhevsky Sutskever Srivastava, Hinton and Salakhutdinov. Dropout: a simple way to prevent neural
networks from overfitting. In Journal of Machine Learning Research (JMLR), 2014.
[12] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initializa-
tion and momentum in deep learning. In Proc. of the International Conf. on Machine learning (ICML),
volume 28, pages 1139–1147, 2013.

147
(a)

(b)

(c)

Figure 170: Applications of variational autoencoders

Figure 171: Overview of the functionality of a GAN.

148
Figure 172: The gradient trick. While log(1 − D(G(z))) has a low gradient in the beginning, log(D(G(z)))
has a high gradient and is thus more useful for training.

Figure 173: Adapting pmodel to pdata .

149
Figure 174: Anticipating the response of the discriminator. The first row shows the result of a GAN on
a target distribution with multiple modes when using an anticipation strategy. The second row shows how the
GAN jumps between modes if no anticipation strategy is used.

Figure 175: Vector arithmetic on averaged z vectors of samples. z vectors are combined via arithmetic
operations and Gaussian noise is added to produce new samples.

Figure 176: Gradient penalties. (a) the parameters oscillate. (g) and (h) the paramters converge to an
equilibrium.

150
Figure 177: Cycle GAN image-to-image translation.

Figure 178: 3d representations of images. 3D-R2N2 uses Voxels. PSGN uses Points. Pix2Mesh and
AtlasNet use Meshes.

151

XCXCXCXCXCXCXCXC
No ratings yet
XCXCXCXCXCXCXCXC
20 pages
Unit IV
No ratings yet
Unit IV
21 pages
Technical Seminar Index
No ratings yet
Technical Seminar Index
4 pages
unit-3 NNDL
No ratings yet
unit-3 NNDL
22 pages
historicaltrendsindeeplearning-240727084838-a66d3478
No ratings yet
historicaltrendsindeeplearning-240727084838-a66d3478
7 pages
Selection
No ratings yet
Selection
12 pages
ppt1dl
No ratings yet
ppt1dl
50 pages
unit 4 nndl
No ratings yet
unit 4 nndl
37 pages
History
No ratings yet
History
75 pages
Case Study
No ratings yet
Case Study
6 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Seminar1 Part 3
No ratings yet
Seminar1 Part 3
12 pages
unit 4 nndl
No ratings yet
unit 4 nndl
37 pages
71511
No ratings yet
71511
21 pages
What Are Neural Nets
No ratings yet
What Are Neural Nets
4 pages
AI-and-ML-Workshop-pptx_250131_193538
No ratings yet
AI-and-ML-Workshop-pptx_250131_193538
44 pages
Lecture 1
No ratings yet
Lecture 1
135 pages
DLTest 1 QB
No ratings yet
DLTest 1 QB
13 pages
unit-1 (1)
No ratings yet
unit-1 (1)
55 pages
Intro ML Lecture 1
No ratings yet
Intro ML Lecture 1
9 pages
DL Unit1 HD
No ratings yet
DL Unit1 HD
141 pages
Recent Advances in Deep Learning Based Computer Vision
No ratings yet
Recent Advances in Deep Learning Based Computer Vision
6 pages
Advancements_and_Applications_of_Deep_Learning
No ratings yet
Advancements_and_Applications_of_Deep_Learning
4 pages
Notes Unit 1 ML
No ratings yet
Notes Unit 1 ML
17 pages
Neural Network Ass # 1
No ratings yet
Neural Network Ass # 1
5 pages
DL Slides 1
No ratings yet
DL Slides 1
63 pages
Chapter-2(Deep Learning)
No ratings yet
Chapter-2(Deep Learning)
18 pages
3 Intro to ANN
No ratings yet
3 Intro to ANN
39 pages
How Powerful Is AI - A Deep Learning Literature Review by Alban Tchikladze
No ratings yet
How Powerful Is AI - A Deep Learning Literature Review by Alban Tchikladze
10 pages
Deep Learning Full
No ratings yet
Deep Learning Full
25 pages
FAM_Unit4
No ratings yet
FAM_Unit4
11 pages
DL-UNIT_1
No ratings yet
DL-UNIT_1
12 pages
Unit 1 Complete Notes
No ratings yet
Unit 1 Complete Notes
44 pages
ML Material (2)
No ratings yet
ML Material (2)
40 pages
1.1.1. Introduction To AI and Machine Learning
No ratings yet
1.1.1. Introduction To AI and Machine Learning
34 pages
Unit I - Fundamentals of DL
No ratings yet
Unit I - Fundamentals of DL
41 pages
A Research Survey Report on Deep Learning Concepts
No ratings yet
A Research Survey Report on Deep Learning Concepts
8 pages
History of Neural Networks
No ratings yet
History of Neural Networks
4 pages
History of Neural Networks
No ratings yet
History of Neural Networks
4 pages
Chapter 1 - Introduction To Deep Learning 2023
No ratings yet
Chapter 1 - Introduction To Deep Learning 2023
50 pages
Deep Learning Algorithms and Architectures
No ratings yet
Deep Learning Algorithms and Architectures
26 pages
Historia Ia Ing Graficos Buen Texto
No ratings yet
Historia Ia Ing Graficos Buen Texto
7 pages
Neural Network Concepts and Paradigms
No ratings yet
Neural Network Concepts and Paradigms
96 pages
Unit - 1
No ratings yet
Unit - 1
69 pages
Lesson 1 - History, Definitions and Basic Concepts
No ratings yet
Lesson 1 - History, Definitions and Basic Concepts
6 pages
Intro DL
No ratings yet
Intro DL
48 pages
Deep Learning Midsem Merged Previous Batch
No ratings yet
Deep Learning Midsem Merged Previous Batch
423 pages
Deep_Learning_with_R
No ratings yet
Deep_Learning_with_R
18 pages
Deep Learning
No ratings yet
Deep Learning
37 pages
Machine Learning Glimpse
No ratings yet
Machine Learning Glimpse
37 pages
Unit 2
No ratings yet
Unit 2
25 pages
ANN Unit 3 Answers
No ratings yet
ANN Unit 3 Answers
12 pages
1 - Intro To Machine Learning
100% (1)
1 - Intro To Machine Learning
20 pages
NN DL Unit - III
No ratings yet
NN DL Unit - III
19 pages
5
No ratings yet
5
25 pages
Abhijit Ghatak - Deep Learning With R-Springer (2019)
No ratings yet
Abhijit Ghatak - Deep Learning With R-Springer (2019)
259 pages
ML MOD-1
No ratings yet
ML MOD-1
29 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
(Machine Learning_ Foundations, Methodologies, And Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
No ratings yet
(Machine Learning_ Foundations, Methodologies, And Applications) Fengxiang He, Dacheng Tao - Foundations of Deep Learning-Springer (2025)
298 pages
Deep Learning Frameworks
From Everand
Deep Learning Frameworks
Jamal Hopper
No ratings yet
Condition Assessment Models For Sewer Pipelines
No ratings yet
Condition Assessment Models For Sewer Pipelines
121 pages
AI Applications To Communications and Information Technologies - IEEE (2024)
No ratings yet
AI Applications To Communications and Information Technologies - IEEE (2024)
493 pages
Blind Navigation System Using Artificial Intelligence
No ratings yet
Blind Navigation System Using Artificial Intelligence
5 pages
Neural Networks - Lecture 2
No ratings yet
Neural Networks - Lecture 2
28 pages
Detection_of_Non_Technical_Loss_in_Power
No ratings yet
Detection_of_Non_Technical_Loss_in_Power
5 pages
Soft Computing CT QP
No ratings yet
Soft Computing CT QP
2 pages
AI_ML
No ratings yet
AI_ML
23 pages
Predictive Analysis Unit 4
No ratings yet
Predictive Analysis Unit 4
40 pages
Machine Learning Lecture
No ratings yet
Machine Learning Lecture
332 pages
SCT
No ratings yet
SCT
29 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Cs601pc - Machine Learning Unit - 2-1
No ratings yet
Cs601pc - Machine Learning Unit - 2-1
148 pages
Unit I Introduction
No ratings yet
Unit I Introduction
55 pages
Introduction To Artificial Intelligence Technique: Humans."
No ratings yet
Introduction To Artificial Intelligence Technique: Humans."
10 pages
Instant ebooks textbook Advanced Systems for Environmental Monitoring, IoT and the application of Artificial Intelligence 1st Edition Jamal Mabrouki download all chapters
100% (3)
Instant ebooks textbook Advanced Systems for Environmental Monitoring, IoT and the application of Artificial Intelligence 1st Edition Jamal Mabrouki download all chapters
65 pages
Advanced Systems for Environmental Monitoring, IoT and the application of Artificial Intelligence 1st Edition Jamal Mabrouki - Download the ebook now and read anytime, anywhere
100% (3)
Advanced Systems for Environmental Monitoring, IoT and the application of Artificial Intelligence 1st Edition Jamal Mabrouki - Download the ebook now and read anytime, anywhere
65 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
No ratings yet
Lecture 11 - Introduction To Artificial Neural Networks (ANN)
35 pages
Module 2
No ratings yet
Module 2
44 pages
UNIT 4 - Perceptron and DL
No ratings yet
UNIT 4 - Perceptron and DL
39 pages
Single Layer & Multilayer Perceptron
No ratings yet
Single Layer & Multilayer Perceptron
14 pages
Employee Attrition in HR using ML techniques
No ratings yet
Employee Attrition in HR using ML techniques
14 pages
Machine Learning
No ratings yet
Machine Learning
40 pages
branches of AI
No ratings yet
branches of AI
7 pages
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
(Ebook) Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou ISBN 9781439830031, 1439830037 All Chapters Instant Download
100% (4)
(Ebook) Ensemble Methods: Foundations and Algorithms by Zhi-Hua Zhou ISBN 9781439830031, 1439830037 All Chapters Instant Download
81 pages
Computer Vision: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
No ratings yet
Computer Vision: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
22 pages
P.E.S. College of Engineering, MANDYA, 571401: Identifying The Android Malware Using Machine Learning Algorithm
No ratings yet
P.E.S. College of Engineering, MANDYA, 571401: Identifying The Android Malware Using Machine Learning Algorithm
34 pages
MIN-400B Report (Final Evaluation) - 2
No ratings yet
MIN-400B Report (Final Evaluation) - 2
54 pages
DEEP_LEARNING_UNIT_1[1]
No ratings yet
DEEP_LEARNING_UNIT_1[1]
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.