Tubingen DL Notes
Tubingen DL Notes
Abstract
These lecture notes have been written collectively by the class of winter 2020/2021 and curated by the
TAs of the deep learning lecture. If you find an error, please email the TA responsible for the section.
1 Introduction
1.1 Introduction
The introduction contained all organizational matters. The team was introduced and the contents, goal and
organization of the course, the exercises, lecture notes, materials, credits and prerequisites were explained.
Here, x is the data and w are the weights. X could also be some features of the data, so we will call it features
from now on. If you multiply there features with the weights and the product is bigger than zero, we assign +1
to that function. If it is smaller than 0, we assign -1. That’s a binary decision, that’s why it is called a linear
threshold neuron. It has been demonstrated that his simple operation is more powerful that AND/OR gates,
which are special cases of this computation. But at the time there was no procedure to effectively learn the
weights. This has changed in 1958:
1
“perceptron algorithm”, which was optimizing of perceptron criterion:
X
L(w) = − w T x n yn
n∈M
This is very different from the gradient based optimization with back propagation used today because the linear
threshold unit is non-differentiable. Therefore, the auxiliary task of the perceptron criterion has been defined.
The optimization of this criterion simply looks at all the results that have been wrongly labeled; so M is the set
of wrongly labeled examples. It looks what the model does for those and what the true label is (+1 or -1) and
it is minimizing a loss based on these incorrectly classified labels such that they are classified correctly in the
next iteration of the algorithm. For this very simple model, this is an optimal thing to do because it converges
to the right solution, if a solution exists, which was proven by Novikoff. It was a big success and there was also
some resonance in the media, as for example a perceptron, that was trained to recognize the difference between
males and females. While this provides a working example of such an algorithm, the perceptron was heavily
overhyped: For instance, Rosenblatt claimed that the perceptron will lead to computers that walk, talk, see,
write, reproduce themselves and are conscious of their own existence. So the hopes and expectations were really
high and couldn’t be fulfilled, which lead to mistrust in this technology.
2
of data became available. AlexNet was the first neural network to win the ILSVRC via GPU training, deep
models, better neural network architectures and huge datasets. This was the sparking point of the deep learning
revolution, where people recognized that deep learning can really change the world.
2014: Generalization
This lead to the success story of deep learning in terms of generalization. It could be empirically demonstrated
that deep representations generalize well despite a large number of parameters. A CNN can be pre-trained on
large amounts of data on a generic task (e.g., ImageNet classification) and then only the last layers need to be
fine-tuned (re-trained) on few data of a new task and still perform very well.
2014: Visualization
Zeiler and Fergus published a paper on visualization of that a deep neural network learns. The goal of this
paper was to provide insights into what the network has learned, because the networks are like black boxes.
They visualized image regions that most strongly activate various neurons at different layers of the network.
And they found that higher levels capture more abstract semantic information.
After applying this change to an image, the network recognizes all images as being classified as “ostrich”.
2016: WaveNet
In 2016 it was demonstrated that deep generative models of raw audio waveforms like WaveNet can generate
speech which mimics human voice, as well as music.
3
was the first computer program to defeat a professional Go player. AlphaZero, developed in 2017 even learns
via self-play and masters multiple games.
2020: GPT-3
This year, GPT-3 came along, which is the first version of the language model by OpenAI. It is upscaling
existing language models to 175 Billion parameters.It has a text-in / text-out interface and many use cases
like coding, poetry, blogging, news articles and chatbots. There are also controversial discussions. It has been
licensed exclusively to Microsoft on September 22, 2020.
Current Challenges
There remain still some challenges for the next generation, such as un- or self-supervised learning, interactive
learning, accuracy (e.g. for self-driving), robustness and generalization, inductive biases, understanding and
mathematics, memory and compute and, last but not least ethics and legal questions. And it also remains
open, whether “Moore’s Law of AI” will continue.
Unsupervised learning
Model parameters are learned using a dataset without labels {xi }N
i=1 , so just with inputs. Examples include
clustering, dimensionality reduction and generative models.
Self-supervised learning
Model parameters are learned using a dataset of data-data pairs {(xi , x0i )}N
i=1 . Examples include self-supervised
stereo and optical flow estimation and contrastive learning.
Reinforcement learning
Model parameters are learned using active exploration from sparse rewards instead of a fixed dataset. Examples
4
include deep q learning, gradient policy and actor critique.
f :X →N
f :X →R
Inputs x ∈ X can be any kind of objects images, text, a sequence of amino acids, . . .
The output y ∈ N/y ∈ R is either a discrete or real number, but only one single number. Examples are the
label of an image as in classification, or regressing a single number for the stock market, a density estimation, . . .
f :X →Y
The inputs x ∈ X can be any kind of objects and the outputs y ∈ Y are complex (structured) objects such as
images, text, parse trees, folds of a protein, computer programs, . . .
Today we will focus on the regression problem. A model f always takes an input x and produces an out-
put y. There are two main tasks, the learning task and the interference task.
Learning: Learning happens through estimating the parameters w from training data {(xi , yi )}N
i=1
Inference: Given w from the learning task, novel predictions are made: y = fw (x)
Classification
An example is the mapping of an input image to the output label “Beach” or “No Beach”, thus performing the
task of classification:
Mapping: fw : RW ×H → {“Beach”, “No Beach”}
Regression
For a regression problem, a stock value prediction with N numbers as input and a single number as output aree
an example:
Mapping: fw : RN → R
Structured Prediction
An example for structured prediction problems is an audio signal as input and a sequence of words, forming a
sentence as output.
Mapping: fw : RN → {1, . . . , L}M
Semantic segmentation is another example for structured prediction, where the input is an image and the
output is a label map.
Mapping: fw : RW ×H → {1, . . . , L}W ×H
A third example of structured prediction is 3D reconstruction, the input is a set of images and the output
3
is a 3D reconstruction. Mapping: fw : RW ×H×N → {0, 1}M
For this mapping, suppose we want to predict 323 voxels, with a binary variable per voxel (occupied/free)
3
232 = 232768 different reconstructions could be predicted by the model. This number is even larger than the
number of atoms in the universe, which is ∼ 2273 .
5
1.3.3 Linear Regression
In linear regression, we want to make a prediction for the model parameters in a linear sense. Formally, this
means: Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R). The goal is, to
predict y for a previously unseen input x. The input x may be multidimensional.
As an example, we try to fit a line as the ”ground truth” over some noisy observations, which are samples from
the model with added noise. This is what we provide to the learning algorithm. When we execute that model
at a novel location x, we get a sensible response y. This is done by defining an error function.
The error function E(w) measures the displacement along the y dimension between the data points and the
model f (x, w) specified by the parameters w. It aims at minimizing the overall error.
f (x, w) = w> x
N
X 2
E(w) = (f (xi , w) − yi )
i=1
N
X 2
= x>
i w − yi
i=1
2
= kXw − yk2
In linear regression, we have a function represented as a linear model. w> x is the inner product of the weight
vector w and the input features x. The error function is the sum over all data points of the square of the
difference in the y direction between the prediction of the model and the observation y. This can be written
as the squared `2 norm of the matrix X times the weight vector w minus the vector y.
The gradient of the error function with respect to the parameters w is given by:
2
∇w E(w) = ∇w kXw − yk2
>
= ∇w (Xw − y) (Xw − y)
= ∇w w> X> Xw − 2w> X> y + y> y
= 2X> Xw − 2X> y
As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form. This means that we can
set ∇w E(w) to zero to obtain a closed form solution:
−1
∇w E(w) = 0 ⇒ w = (X> X) X> y
−1
The matrix (X> X) X> is also called Moore-Penrose inverse or pseudoinverse because it allows us, to
compute the inverse of a non-square matrix, which is the minimizer of this least-square problem.
6
For estimating w from X , we define a squared error function, e.g.:
N
X 2
E(w) = (f (xi , w) − yi )
i=1
Figure 1: Polynomial Curve fitting. Plots of polynomials of various degrees M (red) fitted to the data
(green). We observe underfitting (M = 0/1) and overfitting (M = 9). Choosing the best fitting degree M is a
model selection problem.
7
machine learning from optimization. In optimization, we are just interested in fitting a model to observations
while in machine learning, we are always interested in a model that generalizes well.
The assumption that is often made in statistical learning theory, is that the training and test data are indepen-
dent and identically (i.i.d.) drawn from the data distribution pdata (x, y). This is important, because this assures
that we can make certain statements about the learning problem. In the case of Fig. 1 the data distribution
has been chosen as follows:
We have an underlying sine curve with x-locations between 0 and 1. On this basis we sample a y-location by
adding some gaussian noise with the standard deviation σ to the sine curve.
The idea is, to discourage large parameters by adding a regularization term with strength λ. Since this problem
−1
is quadratic in w, it also has a closed form solution: w = (X> X+λI) X> y.
Figure 2: Ridge Regression. Plots of polynomial with degree M = 9 fitted to 10 data points using ridge
regression. Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).
In Fig. 2 we can see, that for the formerly overfitting model with M = 9 we can obtain a much better fit with
8
a mild regularizer. With a strong regularizer the model remains constantly 0, which refers to the underfitting
situation. If we choose a very small regularizer, parameters or model weights can become very large and the
problem becomes ill-conditioned. If we increase the regularizer, the weights become smaller. Again, we want to
select a model with the smallest generalization error on the validation set.
Bias(ŵ) = E(ŵ) − w
This is how, in expectation, the point estimator deviates from the true parameters. This expectation is over all
possible datasets X . ŵ is unbiased ⇔ Bias(ŵ) = 0. A good estimator has little bias.
The variance of a point estimator is the variance over ŵ over all datasets X :
Bias-Variance Dilemma:
Statistical learning theory tells us that we can’t have a little bias as well as a low variance. There is a trade-off
that we have to make.
Figure 3: Bias-Variance Dilemma. Ridge regression with weak (λ = 10−8 ) and strong (λ = 10) regulariza-
tion. Green: True model. Black: Plot of model with mean parameters w̄ = E(w). Red: Estimates, with high
variance at the right and low variance at the left.
In Fig. 3 we can see that if we have a weak regularization, we have a large variance but the mean is a good
fit to the ground truth, so the bias is small. With a strong regularization, we obtain less variance in the
predictions, however, the model has a strong bias. The mean of the parameters (the red estimates) deviates
strongly from the ground truth. So there is a bias-variance tradeoff which can be expressed mathematically:
E[(ŵ − w)2 ] = Bias(ŵ)2 + Var(ŵ). But we might not have to deal with this in all situations. E.g. in deep
neural networks the test error decreases with network width. For further information see the blogpost at
https://www.bradyneal.com/bias-variance-tradeoff-textbooks-update.
9
1.3.6 Maximum Likelihood Estimation
We now reinterpret our results by taking a probabilistic viewpoint. Let X = {(xi , yi )}N i=1 be a dataset with
samples drawn i.i.d. from the data distribution pdata . And let the model pmodel (y|x, w) be a parametric family
of probability distributions. Then the conditional maximum likelihood estimator for w is given by
Example: If we assume that the model distribution is a Gaussian, where the mean is a linear function, and
the variance is σ: pmodel (y|x, w) = N (y|w> x, σ), we obtain
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
1 > 2
e− 2σ2 (w xi −yi )
X 1
= argmax log √
w
i=1 2πσ 2
N N
X 1 X 1 2
= argmax − log(2πσ 2 ) − 2
w > x i − yi
w
i=1
2 i=1
2σ
N
X 2
= argmax − w > x i − yi
w
i=1
2
= argmin kXw − yk2
w
We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to yield exactly the same least
squares estimator derived before:
2
ŵ = argmin kXw − yk2
w
There are various variations possible here. If we were choosing e.g. the model distribution pmodel (y|x, w) as a
Laplace distribution, we would obtain an estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1 .
Or otherwise assuming a Gaussian distribution over the parameters w and performing a maximum a-posteriori
(MAP) estimation yields to ridge regression:argmax w p(w|y, x) = argmax w p(y|x, w)p(w).
This shows that there is a connection between maximum likelihood estimation and the linear least squares
problems that have been explained earlier. This is important, because maximum likelihood estimators are a
very important tool in statistical learning theory because they are known to be consistent under mild assump-
tions. This means, that as the number of training samples approaches infinity N → ∞, the maximum likelihood
(ML) estimate converges to the true parameters. They are also very efficient: The ML estimate converges most
quickly as N increases These theoretical considerations make ML estimators appealing.
10
2 Computation Graphs
2.1 Logistic Regression
We’ve already got to know the maximum likelihood estimator and more precisely the Conditional Maximum
Likelihood Estimator for w:
N
X
ŵM L = argmax log pmodel (yi |xi , w) (1)
w
i=1
because we’re conditioning on some input x we’re not just trying to model a distribution y but we are trying
to model a conditional distribution y given x. We can see here the Maximum Loglikehood estimator, we have
the log of the product, which is the sum of the logarithm. However, the estimate that we obtain - argmax, is
the same as the Maximum Likelihood Estimator would return as the logarithm is a monotonic function. In
practice we prefer to use loglikelihood estimator both for numerical reasons when implementing these algorithms
because this is resulting in more stable computation, but also for mathematical reasons and for connection to
the concepts in information theory. We’ll mostly use loglikelihood from now, it doesn’t return the same value,
but it returns the same maximum, the same parameters w. We denote the estimate as ŵM L . This hat indicates
that this is an estimate of the true parameters.Now we want to perform a binary classification: yi ∈ {0, 1}.
So the output can take only two possible discrete labels - 0 and 1.The question is - how should we choose
pmodel (y|x, w) in this case? Gaussian model isnt a good choice, because it’s a continuous distribution. Bernoulli
distribution is good for this case, it models binary classification problem:
where ŷ is a prediction of some model and y is the true label from our dataset. So ŷ is some prediction that
must depend on the input x and it must also depend on some trainable parameter w, we denote that function
as: ŷ = fw (x).
We are working with the discrete distribution, so we have the next requirement fw (x) ∈ [0, 1]. We can use a
sigmoid function: fw (x) = σ(w> x) where σ is the sigmoid function:
1
σ(x) = (3)
1 + e−x
We take the linear combination of the weights and then we do the non-linear transformation sigma to it. Sigmoid
takes an unbounded range of the real numbers and transform this domain of real numbers to the domain of
from 0 to 1 interval. That’s why this function is called squashing function. And that’s what we want - we want
the output of this function between 0 and 1 to be a proper probability. That’s also a reason why is this called
a logistic regression, because this is logistic transfer function.
11
Lets put it together:
N
X
ŵM L = argmax log pmodel (yi |xi , w) (4)
w
i=1
N
X h i
(1−yi )
= argmax log ŷiyi (1 − ŷi ) (5)
w
i=1
N
X
= argmin −yi log ŷi − (1 − yi ) log(1 − ŷi ) (6)
w | {z }
i=1
Binary Cross Entropy Loss L(ŷi ,yi )
Here for the model distribution we plug in the Bernoulli distribution, where
ŷi = σ(w> xi )
. Because we have a logarithm, we can transform it, we split the product with a sum. And then instead of
maximizing we can minimize the last expression by writing a minus in front. In machine learning we often don’t
maximize likelihoods, but we rather minimize loss functions and that’s why we are writing it this way here
and this this term in the sum has a specific name in the machine learning community, it is called a binary
cross-entropy loss. It’s a loss between the predicted y and the true y and as the name loss indicates a high
loss is bad and a low loss is desirable, so we want to minimize the loss, we want to compute the overall minimum
over the data set of all these individual losses. In machine learning we use often the more general term loss
rather than what we have used before in linear regression the error function and the reason is that this loss
term is more general, it’s not necessarily an error between observations,it could also be for instance an inductive
bias that we want to encode. Now this formula has the following interpretation, we minimize the dissimilarity
between the empirical data distribution pdata (defined by the training set) and the model distribution pmodel .
We want to find a parameter for our model, such that the prediction of our model is most similar to our data.
Lets look at the Binary cross entropy loss:
We want to minimize those curves. For yi = 1 the loss L is minimized if ŷi = 1. For yi = 0 the loss L is
minimized if ŷi = 0.Thus, L is minimal if ŷi = yi . This loss function can be extended to > 2 classes.
2.1.1 1D example
Lets look at this visually.
We have on the x-axis our features, this is the input 1d and we want to classify between positive and negative.
So we want to tell for a particular x value is it in the positive class or in a negative class. We have set of positive
samples in green and negative in red. We want to fit to dataset X logistic regression model fw (x) = σ(w0 +w1 x)
12
Figure 6: 1D example. An illustration of the simple 1D example.
Figure 8: Probabilistic interpretation. Probabilities of classifier fw (xi ) for positive samples (yi = 1)
Figure 9: Probability of the negative class. Probabilities of classifier fw (xi ) for negative samples (yi = 0)
we”ll just put them to the bottom, because it makes more sense.
Figure 10: Probability of the both classes. Probabilities of classifier for the both classes
We want to maximize those probabilities or equivalently we want to minimize the negative logarithm.
So then we minimize the mean or the sum over all these negative log probabilities.
2.1.2 Optimizing w.
In contrast to linear regression, the loss L(ŷi , yi ) is not quadratic in w. w appears inside the sigmoid which
is a nonlinear function and then we take the logarithm of this nonlinear function, which is also a non-linear
13
Figure 11: Log loss. Log loss of classifier for both classes
function, so this is highly non-linear transformation of linear model and we don’t have an easy expression as we
had it in the quadratic case for the linear regression task. If there is no closed form analytic solution, we need
to apply some iterative optimization technique which is based on gradients. Luckily we can compute gradients
and so we can use a gradient based optimizer, that slowly follows the gradient to a local minimum starting
from some initialization point.Despite this not being a quadratic function we, it’s still a convex function.It can
be shown that this problem is a convex problem, so we know that we don’t fall into a local optimum. If we
apply a gradient based optimizer, we end up with the global optimum.But we can’t reach it in a single step or
analytically, we need to apply an iterative gradient-based algorithm. And for all the gradient-based optimization
algorithms, where the grading is needed, we need to compute this gradient which is the nabla operator. The
nice thing about this particular objective function is that despite in the in the first step the gradient looks quite
complex, it simplifies a lot when you do the analytical derivation and this is the final form of the gradient of
the binary cross entropy loss of this logistic regression model:
∇w L(ŷi , yi ) = (ŷi − yi )xi (9)
Since we are able to compute this gradient analytically, we can apply it inside a iterative gradient based
optimizer, which tries to go stepwise towards the optimum. The simplest optimizer that we can use and one
that’s actually used nowadays heavily even in deep learning is called gradient descent. We pick the step size η
and tolerance and Initialize w0 . And then we repeat until kvk < . We iterate the procedure:
N
X
v = ∇w L(ŷ, y) = ∇w L(ŷi , yi ) (10)
i=1
wt+1 = wt − ηv (11)
We optimize this until this gradient becomes small and converges to an extreme value of this function.
Figure 12: Gradient descent. Example of the converging variants of gradient descent.
But there’s several variants that work a little bit better, one is line search, that’s the green curve, where you’re
going into the direction of the gradient,but you search for the smallest value. You can do even better for
some problems by using the conjugate gradient method (red), where you’re not going into the direction of the
gradient, but into some conjugate direction, that brings you more directly to the optimal value.
14
2.1.3 Connections to the Information Theory
Maximizing the Log-Likelihood is equivalent to minimizing Cross Entropy or KL Divergence:
XN
ŵM L = argmax log pmodel (yi |xi , w) (12)
w
i=1
| {z }
Log-Likelihood
= argmin Epdata [log pdata (y|x) − log pmodel (y|x, w)] (15)
w
= argmin DKL (pdata kpmodel ) (16)
w | {z }
KL Divergence
We can rewrite log likelihood using an expectation, because what if we would write 1 over N, which would be the
mean, then we would look at the empirical estimate over the data distribution,but one over N is a constant with
respect to w so it gets absorbed into the argmax operator. But equation having a one over N in the beginning
would exactly correspond to the definition of the expectation operator, which is basically expectation over the
probability of the data points. So each of the data points is a sampling based estimate of the log pmodel . We
can write argmax in terms of argmin, if we replace plus with a minus. Now what we can do also is we can take
this term and add the log pdata (y|x).We can do that because the data probability in this data distribution does
not depend on w, so it’s changing the value of this function, but it’s constant with respect to w.It means that
we have the cross-entropy minus the entropy and this in information theory terms is called the KL divergence.
It’s one of many divergences, but it’s a measure of the similarity of two distributions.This is a nice intuition
here, by computing the maximum likelihood estimate of the parameters we’re trying to minimize the distance
between two distributions,more precisely between the data distribution given by the data set - the empirical
data distribution and the distribution of our model given its parameters w.
As we will see, gradients ∇w L can be computed using a backward pass. Both, the forward pass and the
backward pass are efficient due to the use of dynamic programming, i.e., storing and reusing intermediate re-
sults.This decomposition and reuse of computation is key to the success of the backpropagation algorithm,
the primary workhorse of deep learning. It would be unthinkable to optimize deep networks with millions or
billions of parameters without this decomposition and reuse of computation. That’s why the backpropaga-
tion algorithm which adjusts these millions and billions of parameters in our deep models is still the primary
workhorse for deep learning today.
15
what we’re really interested in the end is the gradients for these parameters, because this is where the updates
are applied to. Finally we have the compute nodes which could be aver the final node (the loss function) or
intermediate compute nodes that take inputs, for instance input nodes or previous compute nodes or parameters
and produce a result.
Figure 13: Computation graph. Example of the computation graph for linear regression.
We’re looking at a very specific example of linear regression.This is the source code here:
(1) u = w1 x
(2) ŷ = w0 + u
(3) z = ŷ − y
(4) L = z2
This is how we write the linear regression problem, which we could also write in a single line, we write it in as
a sequence of atomic operations. Now there is multiple levels of granularities that we can use. This is a very
fine level of granularity, but we can use more coarse-grained levels of granularity as well and what we want to
choose depends on how big we want to make these atomic units, which level of atomic units we still can handle.
So in this case here we could combine (1) and (2) , then we would change this computation graph.
Figure 14: Computation graph with another level of granularity. Example of the computation graph
with another level of granularity for linear regression.
Steps (1) and (2) have collapsed into one step, which directly computes this affine transformation.
We can also for instance collapse the loss, we can compute not these two things here independently but we don’t
want to compute them jointly:
So now we have only two steps:
(1) ŷ = w0 + w1 x
(2) L = (ŷ − y)2
(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
16
Figure 15: Computation graph with another level of granularity. Example of the computation graph
with another level of granularity for linear regression.
Figure 16: Computation graph of logistic regression. Example of the computation graph for logistic
regression.
(1) u = w> x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
Figure 17: Computation graph with vector representation. Example of the computation graph with
vector representation for logistic regression.
What we can also do is we can stack two of these operations behind each other:
(1) h = σ(W1> x)
(2) ŷ = σ(w2> h)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
17
Figure 18: Multi-Layer Perceptron. Example of the computation graph for Multi-Layer Perceptron.
2.3 Backpropagation.
Our goal is to find gradients of negative log likelihood:
N
X
∇w − log pmodel (yi |xi , w) (17)
| {z }
i=1
L(yi ,xi ,w)
The multivariate chain rule is concerned with compositions of functions, where in the argument of a function
we have a set of functions and each of these functions here depends on the same variable x.
(1) y = y(x)
(2) L = L(y)
We’re going to after running this forward pass, actually calculating the value of L for a particular input x, we
are going to run a so called backward pass, that’s where the name backpropagation is coming from. Backward
18
Figure 19: Simple computation graph.
Pass:
∂L ∂L ∂L ∂L
(2) = =
∂y ∂L ∂y ∂y
∂L ∂L ∂y
(1) =
∂x ∂y ∂x
We will indicate forward pass with the black arrows and backward pass with the blue. We backpropagate gradi-
ents from the output node, which is always L in our case, to each individual node, such that at each individual
node we can read off the gradients with respect to each of these nodes. We’re interested in gradients of L with
respect to any variable in this computation graph and we’re going to use two different colors here for indicating
two different types of quantities. Red color for backpropagated gradients and blue color for local gradients, that
are locally computed based on this assignment formulas (1) and (2).
The backward pass starts at the output node L and it computes gradients backwards.The first thing it does it
computes the gradient of L with respect to itself which is obviously 1. In order to compute the gradient of L
with respect to y we need to apply the chain rule. We need to compute the gradient of L with respect to the
gradient of L times the gradient of L with respect to the gradient of y. Now we can do one step further,now
19
we’re calculating the first expression the gradient with respect to the first variable, which we have in the forward
pass computed first. Now we compute the gradient of L with respect to x, this is exactly what we wanted to
compute. Again the chain rule, it’s the gradient of L with respect to y times the gradient of y with respect to
x.What we’re ultimately interested in is to backpropagate gradients of course to the weights of a parametric
function such as logistic regressor or a neural network. For all the values at a particular iteration we know the
values , so we will we’ll always insert the actual gradients and not the symbolic gradients.
Let’s look at a slightly more difficult example.This is an example where we have a fan out bigger than one
situation.Fan out means from a particular node there are multiple outgoing connections. In case y has multiple
connections.
(1) y = y(x)
(2) u = u(y)
(2) v = v(y)
(3) L = L(u, v)
2.3.3 Implementation.
Now we are basically ready to implement the backpropagation algorithm. A convenient way to do so is to
consider each variable or node as an object that has certain attributes, for instance the value (x.value) and the
gradient (x.grad). Values are computed in the forward pass. Here is the sequence of assignments in python:
x.value = Input
y.value = y(x.value)
u.value = u(y.value)
v.value = v(y.value)
L.value = L(u.value, v.value)
Tha value of x is an Input, value of y is function y evaluated at x value and then u value is the function u
evaluated at y value and then v is also evaluated at y value. And then finally we have a Loss function.
20
Similarly we can look at the backward pass. First of all we would set the gradients to zero, because we’ll gonna
additively update.
x.grad = y.grad = u.grad = v.grad = 0
L.grad = 1
u.grad += L.grad ∗ (∂L/∂u)(u.value, v.value)
v.grad += L.grad ∗ (∂L/∂v)(u.value, v.value)
y.grad += u.grad ∗ (∂u/∂y)(y.value)
y.grad += v.grad ∗ (∂v/∂y)(y.value)
x.grad += y.grad ∗ (∂y/∂x)(x.value)
In order to evaluate gradient we need the values that have been computed in the forward pass. The gradient
itself is a function.
Let’s look at a slightly more real example in the sense of machine learning. This is a logistic regression example.
BCE denotes Binary Cross Entropy Loss.
Forward Pass:
(1) u = w0 + w1 x
(2) ŷ = σ(u)
(3) L = −y log ŷ − (1 − y) log(1 − ŷ)
| {z }
BCE(ŷ,y)
We want to calculate the gradients, but we’re not interested in the gradients with respect to any variable
in this graph, for instance we’re not interested in the gradients with respect to these green input variables
here, we could calculate them and sometimes we really do, like in the case of style transfer.That’s the beauty
about the back propagation algorithm that it can compute the gradients with respect to any variable can even
compute the second derivative by executing it twice. But in this case here when we’re interested in training
the parameters of the logistic regression model. We’re really just interested in back propagating the gradients
from L to w1 and w0 so that we obtain the gradients of L with respect to w1 and w0 . Which are then used
in the gradient based optimization in an iteration loop in order to update the gradient the parameters of the
model.Let’s apply the back propagation algorithm to this example:
∂L ∂L ∂L ŷ − y
(3) = =
∂ ŷ ∂L ∂ ŷ ŷ(1 − ŷ)
∂L ∂L ∂ ŷ ∂L
(2) = = σ(u)(1 − σ(u))
∂u ∂ ŷ ∂u ∂ ŷ
∂L ∂L ∂u ∂L
(1) = =
∂w0 ∂u ∂w0 ∂u
∂L ∂L ∂u ∂L
(1) = = x
∂w1 ∂u ∂w1 ∂u
2.3.4 Summary
We can write mathematical expressions as a computation graph, that decomposes complex expressions into
much simpler computations that are tractable and where we can apply efficient dynamic programming to solve
21
for both - the values in the forward pass and the gradients in the backward pass.
The values are efficiently computed in the forward pass and the gradients are computed in the backward pass.
Multiple incoming gradients are summed up (multivariate chain rule).
Modularity: Each node must only ”know” how to compute gradients wrt. its own arguments.
One fw/bw pass per data point:
XN
∇w L(y, X, w) = ∇w L(yi , xi , w) (21)
| {z }
i=1
Backpropagation
22
c l a s s Input :
def init ( self ):
pass
d e f addgrad ( s e l f , d e l t a ) :
pass
c l a s s Parameter :
def i n i t ( s e l f , value ) :
s e l f . v a l u e = DT( v a l u e )
Parameters . append ( s e l f )
d e f addgrad ( s e l f , d e l t a ) :
s e l f . grad += np . sum ( d e l t a , a x i s = 0 )
d e f UpdateParameters ( s e l f ) :
s e l f . v a l u e −= l e a r n i n g r a t e ∗ s e l f . grad
c l a s s CompNode :
d e f addgrad ( s e l f , d e l t a ) :
s e l f . grad += d e l t a
Here we can see the abstract definition of the base classes. The input class has an initialization function and
a function for adding the gradients, which are empty, because we are not interested in adding gradients to the
input variables and also because input variables don’t depend on any other variables. The parameter class
has a initialization function that copies and stores the value of that parameter in the class itself and appends
these parameters to a parameter vector, because later on we’ll need to sum up these parameters in order to
get the computation result. It also has a function for adding up gradients, this function simply sums up all the
gradients along the data dimension. This model computes the values in the forward pass and the gradients in
the backward pass for all the data points simultaneously and this is for efficiency reasons, because it allows to
utilize efficient matrix operations in python. The Parameter class also has an UpdateParameter function that
takes a step into the gradient direction, this is what is utilized during gradient descent. The compute node
function also has a function for adding gradients, because we also need to update the gradients of the compute
nodes themselves. In order to execute the computation graph we need to define a forward and a backward
function.
d e f Forward ( ) :
f o r c i n CompNodes : c . f o r w a r d ( )
d e f Backward ( l o s s ) :
f o r c i n CompNodes + Parameters :
c . grad = np . z e r o s ( c . v a l u e . shape , dtype = DT)
l o s s . grad = np . o n e s ( l o s s . v a l u e . shape ) / l e n ( l o s s . v a l u e )
f o r c i n CompNodes [ : : − 1 ] :
c . backward ( )
d e f UpdateParameters ( ) :
f o r p i n Parameters : p . UpdateParameters ( )
Forward function takes the inputs x and y and the current state of the parameters w and goes left to right
through the computation graph in order to compute all the intermediate values all the way until the head node
L. We have stored all these computation nodes in a list CompNodes and we have stored them in a way such
that they are stored left to right, which means that when we arrive at any particular node in the sorted list,
then we know that all the previous nodes have already been computed, so we can utilize their values. Thus in
the forward pass of the backpropagation algorithm we simply go through the sorted list of computation nodes
and for each computation node apply the forward function that is implemented inside that computation node.
For the backward pass we first set all the gradients of all the computation nodes and all the parameters to zero
and then we go backwards. We go backwards starting from L and propagate the gradients backward iteratively
calling the backward function of each compute node in order to send gradient updates to its parents. Then
finally we have an UpdateParameters function that loops through a list of parameters and calls the UpdatePa-
rameter function for each parameter.
Remark: Forward() and Backward() functions compute the forward and backward passes respectively over the
23
entire data set and the reason for this is that matrix operations are very cheap, because they are implemented
very efficiently in python, while if we would go over the data set with loops we would be much slower. This
vectorization is very important to exploit, furthermore if we would have GPUs hardware available, then we
could even parallelize this computation, because the forward pass of each individual data point is independent
of each other data point and the backward pass of each individual data point is independent of the backward
pass of our data points.
Lets look at the concrete example of the computation node of sigmoid function:
1
σ(x) =
1 + e−x
σ 0 (x) = σ(x)(1 − σ(x))
c l a s s Sigmoid (CompNode ) :
def init ( self ,x ):
CompNodes . append ( s e l f )
self .x = x
def forward ( s e l f ) :
bounded = np . maximum( −10 ,np . minimum ( 1 0 , s e l f . x . v a l u e ) )
s e l f . v a l u e = 1 / ( 1 + np . exp(−bounded ) )
d e f backward ( s e l f ) :
s e l f . x . addgrad ( s e l f . grad ∗ s e l f . v a l u e ∗ (1− s e l f . v a l u e ) )
Here we can see the python definition of the sigmoid class, which inherits from the class computation node.
It has three functions: an initialization function, a forward function and a backward propagation function. At
initialization time we simply add the node itself to the list of computation nodes and we store the parent of that
node in that class itself. self.x = x means take the parent which is an input that could be another computation
node let’s say an affine computation node that does some computation and then inputs to the sigmoid function.
In the forward function we calculate the sigmoid expression. So first we compute a bounded value of the value
itself in order to avoid numerical problems and then we implement the sigmoid function. For the backward
pass we implement derivative of sigmoid function, but what we do actually is we implement that function and
multiply that function with the back propagated gradient, the gradient at the node itself and pass this as a
message further on to the parent.
Remark: It’s important to note that in this backward pass the gradient is sent to the parent node self.x.
Let’s execute a concrete minimal example.
import e d f
# data loading
Execution Example: edf . clear compgraph ()
• Load data X and labels y x = e d f . Input ( )
y = e d f . Input ( )
• Initialize parameters w0 x . v a l u e = Load ( data )
y . v a l u e = Load ( l a b e l s )
• Define computation graph
• For all iterations do # initialization of parameters
params 1 = e d f . AffineParams ( nInputs , nHiddens )
– Forward Pass params 2 = e d f . AffineParams ( nHiddens , n L a b e l s )
L(ŷi = fw (xi ), yi )
# definition of computation graph
– Backward Pass h = e d f . Sigmoid ( e d f . A f f i n e ( params 1 , x ) )
∇w L(ŷi , yi ) p = e d f . Softmax ( e d f . A f f i n e ( params 2 , h ) )
L = e d f . Cross EntropyLoss ( p , y )
– Gradient Update
t+1
wP = wt − # gradient descent
N
η i=1 ∇w L(ŷi , yi ) f o r i in range ( i t e r a t i o n s ) :
e d f . Forward ( )
e d f . Backward (L)
e d f . UpdateParameters ( )
24
3 Deep Neural Networks
3.1 Backpropagation with Tensors
So far, we have only discussed how the backpropagation algorithm works on functions of scalars, that is, functions
like y = σ(w1 x + w0 ). But what if we want to optimize parameters in a function like y = σ(Ax + b), where
we have to deal with matrices and vectors? Technically, this function can be completely broken down such that
we have a computational graph consisting only of scalar operations. However, it is often much simpler to deal
with vectors, matrices and tensors (n-dimensional matrices) directly. If we implement a computational graph
composed of tensor-valued nodes, each node still has the attributes value and grad. The value-attribute of
a node contains an array with the components of the tensor. Because the loss function L is still a scalar, all
that is needed for backpropagation are the partial derivatives of the loss with respect to each component of a
tensor. The grad-attribute thus simply contains the gradient, so b.grad stores ∇b L. Note that this means
that A.value and A.grad have the same shape (similarly for all other nodes).
y = σ(|{z}
Ax +b)
=u
Here, the intermediate variable u has been indicated. As already stated, the computational graph might also
consist of scalar operations only and that is certainly one way to implement the computation of the value and
grad attributes. The forward pass would look like this:
for i u.value[i] = 0
for i,j u.value[i] += A.value[i, j] ∗ x.value[j]
for i y.value[i] = σ(u.value[i] + b.value[i])
In the backward pass, first the gradients of the loss with respect to y (so ∇y L) are computed and stored in
y.grad. Then, the gradients of u and b can be computed:
Here, the back-propagated gradients are shown in red and the locally computed gradients in blue.
In deep learning, this kind of implementation remains possible for higher order tensors. If the forward pass
of a computational graph is defined as follows:
then the backward pass can also be implemented in terms of loops over indices:
In practice, this is usually not how the training of a deep network is implemented. Instead, a method called
minibatching is used.
3.1.2 Minibatching
In addition to too many indices being messy, writing loops manually can be very computationally inefficient. The
languages that are mainly used for writing training loops for neural networks, such as Python or MATLAB, are
not designed for efficient loops over arrays and can be very slow (mostly for type checking reasons). A technique
called vectorization is often advantageous: Instead of looping over an array and applying an operation to each
25
scalar element, apply the operation to a whole vector (or matrix) at once. There are special libraries that provide
this functionality (such as NumPy) and they can speed up the computation dramatically. Some libraries also
add support for computation on a GPU. The goal is, that the bulk of the computation time should go to the
actual calculation of floating-point numbers and not to things such as type checking.
In our context this means not taking a single observation x for an optimization step (forward pass, backward
pass, parameter update), but a minibatch of N observations, stacked together in a matrix X. The equation
from above then becomes:
|{z} +B)
Y = σ(XA
=U
N ×D
Here, each row in X ∈ R is one observation x ∈ RD and the bias vector b ∈ RM is broadcast to RN ×M .
All values that depend on the input x now also have a batch index b:
For the actual parameter update step, the computed gradients are averaged over one batch.
In a real implementation, vectorization would be used:
def forward ( self ):
self . value = np . matmul ( self . x . value , self . w . A . value ) + self . w . b . value
This is easier to read than for-loops with many indices, but might be harder to implement for NumPy-novices.
w> x + w0 = 0,
(see Fig. 25). Each point x is classified according to the side of this linear boundary that it lies on:
• Decide for class 1 ⇔ w> x > −w0
• Decide for class 0 ⇔ w> x < −w0
1.0
Decision Boundary
Sigmoid
0.8
0.6
Class 0 Class 1
(x)
0.5
0.4
0.2
0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Because the decision boundary is linear, logistic regression can only solve classification problems with linearly
separable datasets. Some simple examples include the OR and AND functions:
26
x1 x2 OR(x1 ,x2 ) x1 x2 AND(x1 ,x2 )
0 0 0 0 0 0
0 1 1 0 1 0
1 0 1 1 0 0
1 1 1 1 1 1
For these two functions (datasets) optimal parameters can easily be found:
OR AND
x1 x1
1 1 > |{z}
0.5 1 1 > |{z}
1.5
| {z } x2 | {z } x2
−w 0 −w 0
w> w>
| {z } | {z }
x x
From the following figures, it can be seen that these decision boundaries perfectly separate all four data
points in both cases:
OR AND
Class 0 Class 0
Class 1 Class 1
1 1
0 1 0 1
However, there are many datasets that are not linearly separable. One very simple example is the XOR-
function:
x1 x2 XOR(x1 ,x2 )
0 0 0
0 1 1
1 0 1
1 1 0
The dataset can be seen in Fig. 26. Visually, it is obvious that there exists no linear decision boundary that
cleanly separates the two classes. Formally, this can be proven using convex sets.
Class 0
Class 1
1
0 1
Figure 26: The XOR Dataset
Definition 1 (Convex Set) A set S is called convex, if for any two points x, y ∈ S the line segment connecting
the two points also completely lies in S:
The linear decision boundary divides the plane into two half-spaces (the decision regions). These half-spaces
are convex. If a feasible hypothesis, i.e. a correct decision boundary, were to exist, the two red points must be
in the red region, while the green points must lie in the green region. Because the decision regions are convex,
the connecting line of the two green points must completely lie in the green decision region and the connecting
line of the two red points must completely lie in the red decision region. However, as can be seen in Fig. 26,
27
the point (0.5, 0.5) lies on both of these connecting lines (they intersect here). Thus it must lie in both decision
regions, which is not possible. Hence, no solution can exist.
There are, of course, many problems that cannot be solved with a linear classifier. The XOR-Problem
is particularly well known because of its simplicity. It is partially responsible for the decline of interest in
perceptrons and neural networks in the 1970s.
The trick we need to use to solve the XOR-Problem (or other non-linear problems) using a linear classifier
such as logistic regression, is to employ basis functions. Analogously to how polynomial basis functions allow a
linear regression model to fit nonlinear datasets, the inputs of a classification problem can be transformed using
basis functions in such a way that the dataset becomes linearly separable. The decision boundary no longer
splits the input space of datapoints x in half, but the feature space of feature vectors ψ(x). One such feature
mapping is defined in Tab. 1.
The feature space and the new decision boundary are visualized in Fig. 27.
Class 0
Class 1
0 1
Figure 27: Feature space and decision boundary for the XOR problem
A different set of features that also transform the input into a linearly separable form are ψ1 (x) = OR(x1 , x2 )
and ψ2 (x) = NAND(x1 , x2 ). These features can then be combined using XOR(x1 , x2 ) = AND(ψ1 (x), ψ2 (x)).
So even though all of the functions OR, AND and NAND are linearly classifiable, they can compute XOR when
composed together. The final computational graph for XOR can then be written as follows:
>
h1 = σ(wOR x + wOR )
>
h2 = σ(wN AN D x + wN AN D )
>
ŷ = σ(wAN D h + wAN D )
Instead of going directly from input to output, the input gets transformed nonlinearly into the feature h. Here,
h is called a hidden layer. The equations can also be written more compactly:
>
wOR wOR
h = σ x+
>
wN wN AN D
AN D
| {z } | {z }
W w
>
ŷ = σ(wAN D h + wAN D )
28
Figure 28: Solving the XOR Problem with a hidden layer
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3 Output Layer
Class 0 Class 0
Class 1 Class 1
Transformation
by hidden layers
Figure 30: Hidden layers transform the input into better features
Each neuron in a MLP is fully connected to all neurons in the previous layer. The total number of layers is
also called the depth of the model, this is where the name deep learning comes from. Technically, the neurons
in a MLP are not perceptrons. Perceptrons use a linear threshold activation function which cannot be trained
using backpropagation and is instead trained using the perceptron algorithm. MLPs use (mostly) differentiable
29
activation functions and are trained using backpropagation. In Fig. 29, a complete MLP is shown. The only
function of the hidden layers is to transform the input vectors into features that can be processed by the simple
linear regression or logistic regression output layer, as was done manually in the last section. This is illustrated
in Fig. 30. Some different activation functions that can be used are shown in Fig. 31.
Artificial neural networks are loosely inspired by biological neural networks. Neurons in the brain are also
structured in layers where one neuron has many inputs and computes one output. Even the sigmoid activation
function can be found in biological neurons. However, brains are very different from Multi-Layer Perceptrons
and the goal of deep learning is not to model the brain, but to build models that achieve good statistical
generalization in many different settings.
MLPs can be trained using the backpropagation algorithm and (stochastic) gradient descent as follows:
1. Initialize weights w, pick learning rate η and minibatch size |Xbatch |
2. Draw (random) minibatch Xbatch ⊆ X
3. For all elements (x, y) ∈ Xbatch of minibatch (in parallel) do:
(a) Forward propagate x through network to calculate h1 , h2 , . . . , ŷ
(b) Backpropagate gradients through network to obtain ∇w L(ŷ, y)
1
4. Update gradients: wt+1 = wt − η |Xbatch
P
| (x,y)∈Xbatch ∇w L(ŷ, y)
1 https://cs.stanford.edu/people/karpathy/CNNjs/demo/classify2d.html
30
Finally, let us discuss what would happen if there are no non-linear activation functions. This following
two-layer MLP
h = g(A1 x + b1 )
y = g(A2 h + b2 )
can be written as
y = g(A2 g(A1 x + b1 ) + b2 ).
If we used a linear activation function, such as g(x) = x, we can rewrite this equation:
y = A2 (A1 x + b1 ) + b2 = A2 A1 x + A2 b1 + b2 = Ax + b
Thus, with linear activations, a multi-layer network can only express linear functions.
Theorem 1 (Universal Approximation Theorem) Let σ be any continuous discriminatory function. Then
finite sums of the form
N
X
G(x) = αj σ(a>
j x + bj )
j=1
are dense in the space of continuous functions C(In ) on the n-dimensional unit cube In . In other words, given
any f ∈ C(In ) and > 0, there is a sum, G(x) for which
Remark: This theorem has been proven for various activation functions (e.g., Sigmoid, ReLU).
If we restrict ourselves to the space of binary functions f : {0, 1}D → {0, 1}, it is not difficult to see why this
result holds. Here, any function f is really just a table, such as the following:
x1 x2 x3 y
.. .. .. ..
. . . .
0 1 0 0
0 1 1 1
1 0 0 0
.. .. .. ..
. . . .
Figure 32: Linear threshold network
If we now take a linear threshold function as the activation function for each neuron, we can construct a
2-layer MLP with the output X
ŷ = [a> x + bi > 0] .
|i {z }
i
hi
See Fig. 32 for an illustration. Here, each hidden neuron hi recognizes exactly one combination of inputs, see
Fig. 32 and the table above to see how the parameters of one such neuron are implemented. If we now do
this for all 2D combinations of inputs, we can construct a 2-layer MLP that can be equivalent to any binary
function. We used a linear threshold function, which is not continuous. However, we can easily approximate
this activation function using sigmoid functions, see Fig. 33
Thus, every binary function can be approximated arbitrarily well with a 2-layer MLP using 2D number of
hidden neurons. This is an important theoretical result, but an exponential number of neurons means that
memory and computation time will also increase exponentially with the size of the input. Another problem
with the approach from above is that the network only memorizes the input-output pairs and thus will not be
able to generalize at all. By increasing the number of layers, a deep network needs far fewer parameters to
learn complex functions. This is because a deep architecture introduces an inductive bias: it assumes that the
31
1.0
(x)
(2x)
(5x)
0.8 (50x)
0.6
0.4
0.2
0.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
Figure 33: Sigmoid functions can approximate the Heaviside function
complex function learned is a composite of simpler functions. This leads to more compact models and better
generalization performance. As an example, the parity function
( P
1 if i xi is odd
f (x1 , . . . , xD ) =
0 otherwise
requires an exponentially large shallow network but can be computed using a deep network whose size is linear
in the number of inputs D.
If the neurons of a MLP are activated with the absolute value rectification function, there is a intuitive
geometric analog to how the layers of the network transform the input. The weights and biases of a layer define
a hyperplane in the feature space which defines a “mirror”. This means that the inductive bias of such networks
is the assumption, that complex functions arise as mirrored images of simpler patterns. This is illustrated in
Fig. 34. This image also intuitively conveys the exponential advantage of more layers over more parameters.
It has been shown multiple times that deeper networks generalize better and that introducing more layers
is often more effective than introducing more parameters, see Fig. 35. This means, that the inductive bias of
compositionality is a very useful prior over the space of hypothesis functions that a model can learn.
32
Figure 35: Deeper networks often perform better than shallow networks when using the same number of
parameters
33
4 Deep Neural Networks II
4.1 Output and Loss Functions
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer Loss Function Target
Figure 36: A simple neural network containing a one dimensional output layer and loss function.
Besides the input and hidden layers, an artificial neural network has an output layer containing the prediction
ŷ and a loss function L that compares the output with a given target y (see Fig. 36). The choice of the output
layer depends on the task (discrete or continuous predictions, classification or regression problem).
1.2
KL Divergence Large 1.2
KL Divergence Small
pdata pdata
pmodel pmodel
1.0 1.0
0.8 0.8
p(y|x)
p(y|x)
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
y y
(a) Large KL-divergence leads to a large loss value. (b) Small KL-divergence leads to a small loss value.
The loss function of a neural network quantifies the divergence of model output (=prediction) and the target
value. It evaluates the quality of a prediction by boiling it down to a single or few numbers. Predictions that are
similar to the desired target value should get a small loss value, while strongly diverging predictions should be
assigned a large loss value. In this sense you can think of the loss function as a measure of distance or cost being
paid for a prediction. In this way the loss function enforces similarity between predictions and target values.
Note that the values don’t necessarily have to be single points, but might also be parameters of a probability
distribution as visualized in Fig. 37. In the second case, the Kullback-Leibler divergence (KL divergence) serves
as loss as it measures to divergence of to probability distributions.
But how to design a good loss function? Basically, a loss function can be any differentiable function that we
wish to optimize. But instead of designing a loss function by hand, it is often preferable to derive the cost
function from the maximum likelihood principle. To do so, consider the output of the neural network as
parameters of a distribution over outputs yi . The maximum likelihood principle is used to find the optimal
values for the parameters by maximizing the likelihood function derived from the training data. So using this
approach, we try to find model parameters w that maximize the likelihood of the model for a given training
34
dataset X.
For example,
let a neural network fw (x) predicts mean µ of Gaussian distribution over y, that is p(y|x, w) =
1 (y−fw (x))2
√
2πσ 2
exp − 2σ 2 . The goal is, to maximize the probability of the target y under this distribution which
would result in the mean being shifted toward y. This setting is also visualized in Fig. 38.
As the loss function depends on the specific task at hand, the following two subsections will show how to derive
the loss function for different regression and classification problems. We will also discuss the design of the
output layer as both considerations are mutually dependent.
0.40
0.35
0.30
0.25
p(y)
0.20
0.15
Target
0.10
0.05
Mean
0.00
6 4 2 0 2 4 6
y
0.40
0.35
0.30
0.25
p(y)
0.20
0.15
0.10
0.05
0.00
6 4 2 0 2 4 6
y
The Gaussian Distribution (see Fig. 39 is parameterized by a mean µ and a standard deviation σ:
!
2
1 (y − µ)
p(y) = √ exp −
2πσ 2 2σ 2
Its characteristics are thin tails: p(y) → 0 quickly as y → ∞. This means outliers are strongly penalized, wrong
datapoints can impact the distribution significantly.
The L2 Loss
The formula of the L2 Loss is similar to the Maximum Likelihood in the first lecture, but now the parameter
µ is the prediction of a neural network. We assume the model distribution is a Gaussian distribution where
the Multi-Layer-Perceptron predicts the mean of that distribution, and σ is constant. When this is plugged
into ŵM L (the Maximum Likelihood objective), the first expression can be removed, as it is constant with
35
respect to w. By reformulating the last expression by removing the minus, we are now minimizing instead of
maximizing, so we have arrived at our loss function (see equation below).
1 (y−fw (x))2
Let pmodel (y|x, w) = √2πσ 2
exp − 2σ 2 be a Gaussian distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X 1 X 1
2 2
= argmax − log(2πσ ) − 2
(fw (xi ) − yi )
w
i=1
2 i=1
2σ
N
X 2
= argmax − (fw (xi ) − yi )
w
i=1
N
X 2
= argmin (fw (xi ) − yi )
w | {z }
i=1
L2 Loss
This loss function is called the squared loss or L2 Loss. As this loss is strongly affected by outliers, often a
different distribution is used - the Laplace Distribution.
0.5
0.4
0.3
p(y)
0.2
0.1
0.0
6 4 2 0 2 4 6
y
The Laplace Distribution (see Fig. 40) has a similar form to the Gaussian distribution, except now in
exponential expression the absolute difference between y and µ is used:
1 |y − µ|
p(y) = exp −
2b b
1
The normalization constant 2b has changed slightly as well due to this. In the formula, µ denotes the location
of the distribution, while b represents the scale, similar to the standard deviation in the Gaussian distribution
determining the width of the distribution.
This distribution has heavier tails than the Gaussian: p(y) → 0 more slowly as y → ∞. This means more
probability mass is at the tails, so outliers are penalized less strongly, which often makes it the preferred choice
for regression problems in practice.
When deriving the loss here we do the same calculations as before, but now using the Laplacian Distri-
bution. Here, the location parameter µ is predicted by the Feed-Forward Neural Network, with b being an
arbitrary constant scale parameter.
1
Let pmodel (y|x, w) = 2b exp − |y−fbw (x)| be a Laplace distribution. We obtain:
36
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2b) − |fw (xi ) − yi |
w
i=1 i=1
b
N
X
= argmax − |fw (xi ) − yi |
w
i=1
N
X
= argmin |fw (xi ) − yi |
w | {z }
i=1
L1 Loss
Refactoring the equation in a similar way to the Gaussian, we arrive at a similar loss function too (see above).
The difference is, that the square has been replaced with the absolute value, meaning that the absolute loss
(called the L1 Loss)is being minimized. This loss is more robust than the L2 loss.
It is also possible to predict more than one parameter. Consider the Laplace Distribution again, but
this time predicting both the location parameter µ and the scale b with a neural network. Technically these
parameters are predicted by different neural networks fw (xi ) and gw (x), but these are typicaly the same except
for different output layer for each network.
Let pmodel (y|x, w) = 2 gw1(x) exp − |y−fw (x)|
gw (x) be a Laplace distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N N
X X 1
= argmax − log(2 gw (x)) − |fw (xi ) − yi |
w
i=1 i=1
gw (x)
The resulting expression is not as simple anymore, as both expressions depend on w and thus cant be removed.
This also results in a more complicated loss function. This loss function is again derived from the maximum
likelihood principle by assuming a certain distribution over the data. So if there exists knowledge of the
distribution of the data, a loss function that fits can be derived that would be hard to specify otherwise.
Predicting both parameters allows for estimating the aleatoric uncertainty (observation noise) with the neutral
network itself. This can be helpful for example in situations where an observation is not clear (for example an
image where part of the lens was covered, that is too dark, etc.), the uncertainty about the prediction can be
predicted too (larger or smaller b). The parameters can then be adjusted based on how certain the model is
for a given prediction. Another example would be predicting the category/pixel depth of an image, where it
is hard to make predictions at boundaries (foreground or background), resulting in higher uncertainty in these
regions (see Fig. 41).
Mixture Density Networks All of the distributions considered so far were unimodal. In some cases like the
depth prediction above, we may not know exactly if a pixel belongs to foreground or background. This is hard
to model with a Laplacian or Gaussian distribution with a single peak. To represent multi-modal distributions,
37
we can also model mixture densities using a mixture model:
M
!
(m)
X 1 |y − fw (x)|
pmodel (y|x, w) = πm (m)
exp − (m)
m=1 2 gw (x) gw (x)
0.8
0.7
0.6
0.5
p(y)
0.4
0.3
0.2
0.1
0.0
6 4 2 0 2 4 6
y
Fig. 42 above is a mixture model of two components using a simple Laplacian distribution that is summed over
the number of components, resulting in a prediction for the location µm and scale bm for all the m modes (in
this case two). Here πm ∈ [0, 1] denotes the weight for each component,
P as all componentes are weighted relative
to each other. A constraint for this parameter is further that m πm = 1.
This model is called a mixture density network, with a mixture of Laplace distributions in this case.
Bernoulli distribution: The Bernoulli distribution is a distribution over 2 classes, so it is only appli-
cable for binary classification problems (e.g. classifying cats vs. dogs). It can be written as:
p(y) = µy (1 − µ)(1−y)
Here, the parameter µ denotes the probability for y = 1. As before, the probabilities of both classes must sum
to 1.
Similar as before, we can assume our model distribution as the Bernoulli distribution now and put this in the
Maximum-Likelihood estimator to derive the loss function. Doing this we again obtain the binary cross-
entropy (BCE) loss function (see below). Let pmodel (y|x, w) = fw (x)y (1 − fw (x))(1−y) be a Bernoulli
38
0.7
0.6
0.5
0.4
p(y)
0.3
0.2
0.1
0.0
0 1
y
distribution. We obtain:
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X h i
(1−yi )
= argmax log fw (xi )yi (1 − fw (xi ))
w
i=1
N
X
= argmin −yi log fw (xi ) − (1 − yi ) log(1 − fw (xi ))
w | {z }
i=1
BCE Loss
In other words, maximizing the Log-Likelihood with the Bernoulli distribution as the model distribution is
equivalent to minimizing the BCE loss. The last layer of fw (x) can be a sigmoid function (or any other
squeezing function) such that fw (x)y ∈ [0, 1] to get probabilities. In this case, unlike for regression problems,
this is required as we have seperate classes.
Categorical distribution:
So how can this approach be scaled up to multiple classes, such as in MNIST?
0.5
0.4
0.3
p(y)
0.2
0.1
0.0
1 2 3 4
y
In the case of multiple classes, we again use the Maximum-Likelihood principle, but this time with the
Categorical distribution:
p(y = c) = µc
Here, the probability of y taking any of the classes c is µc . Again, all probabilities have to sum to 1.
Discrete distributions such as this one by definition accomodate multiple modes.
Alternative notation:
C
Y
p(y) = µyc c
c=1
Consider the distribution not over one-dimensional categorical labels, but over a vector y, which is a one-hot
vector with yc ∈ {0, 1}, where the length of y is the number of classes. Each element of the vector is either
1 or 0, but the sum over the whole vector is 1, so only one element can be 1 (the true class). For example
y = (0, . . . , 0, 1, 0, . . . , 0)> .
39
With this definition the probability can be rewritten as a product, where we have µyc c . Only where yc = 1 the
term is not 1, but µc .
In the table below we see an example for this representation:
class y y
1 (1, 0, 0, 0)>
2 (0, 1, 0, 0)>
3 (0, 0, 1, 0)>
4 (0, 0, 0, 1)>
In this example fouranimals are representing the diferent classes. Each class is represented by a one-hot vector
y with binary elements yc ∈ {0, 1}, with an index c where yc = 1 determines the correct class and with yk = 0
for k 6= c.
This vector can also be interpreted as a discrete distribution, with all the probability mass at the true class.
Again, we can reformulate to turn the maximization problem into a minimization problem and arrive at the so
called Cross-Entropy loss, which goes across all classes. We compare the target for class c (the entry in the
one-hot vector) with the prediction for that c. This means we need a prediction for every c now. In other
words, we minimize the cross-entropy (CE) loss.
The target y = (0, . . . , 0, 1, 0, . . . , 0)> is a one-hot vector with yc its c’th element.
The Softmax
(c)
How can we ensure that fw (x) predicts a valid Categorical (discrete) distribution?
(c) PC (c)
The requirement for that is, that (1) each element fw (x) ∈ [0, 1] and (2) c=1 fw (x) = 1, so the distribution
has to sum to 1.
Using the element-wise Sigmoid for example, we would ensure (1), but not (2). This problem can be solved by
defining a so called softmax function on top of the affine predictions (scores) for each class, which guarantees
(1) and (2):
!
exp(x1 ) exp(xC )
softmax(x) = PC , . . . , PC
k=1 exp(xk ) k=1 exp(xk )
So the exponential of each individual element is divided by the sum of the exponentials of each individual
element.
Let s denote the network output after the last affine layer (=scores). Then:
C
(c) exp(sc ) (c)
X
fw (x) = PC ⇒ log fw (x) = sc − log exp(sk )
k=1 exp(sk ) k=1
So for each class we have the output of the neural network as input to the softmax. Taking the logarithm of
the softmax we arrive at the Log Softmax function above. We already see that sc is a direct contribution to
the loss function, i.e., so it does not saturate.
40
Intuition: Assume c is the correct class. Our goal is to maximize the log softmax:
C
X
(c)
log fw (x) = sc − log exp(sk )
k=1
The first term here encourages the score sc for the correct class c to P increase. The second term encourages
C
that all scores in s jointly decrease. It can be approximated by log k=1 exp(sk ) ≈ maxk sk as exp(sk ) is
insignificant for all sk < maxk sk . Therefore, the loss always strongly penalizes the most active incorrect
prediction. If it is the correct prediction (i.e., sc = maxk sk ), it is not penalized because both terms roughly
cancel each out out, but it it’s the wrong term then it is penalized.
4
50
3
2 40
exp(sc)
1 30
sc
0 20
1 10
2 0
1 2 3 4 1 2 3 4
c c
On the left we see four classes and their scores predicted by the neural network. On the right we see the
exponentials of these scores. All classes that are not top score P
classes are almost insignificant, when taking
C
their exponentials. For example, the second term becomes: log k=1 exp(sk ) = 4.06 ≈ s3 = maxk sk , so the
value is very similar to the maximum.
(c) PC
Assuming c = 2 is the correct class, we obtain: log fw (x) = sc − log k=1 exp(sk ) = 1 − 4.06 ≈ −3. For c = 3
(c) PC
we obtain: log fw (x) = sc − log k=1 exp(sk ) = 4 − 4.06 ≈ 0 So we get a much larger value, if the correct
class corresponds to the class where we have assigned the highest score with our neural network, and a much
lower value if we didn’t assign the correct class.
exp(x1 ) exp(x2 )
softmax(x) = ,
exp(x1 ) + exp(x2 ) exp(x1 ) + exp(x2 )
exp(x1 ) 1
= ,
exp(x1 ) + 1 exp(x1 ) + 1
1 1
= ,1 −
1 + exp(−x1 ) 1 + exp(−x1 )
= (σ(x1 ), 1 − σ(x1 ))
As seen above, the softmax can be rewritten, such that the first expression is the expression of the Sigmoid
function. This means that the Softmax is effectively a multiclass generalization of the sigmoid function.
In practice, the overparameterized version where all scores are predicted is often used, as it is simpler to
implement and doesn’t make a big difference.
The name Softmax is somewhat confusing, with ”soft argmax” being a more precise name, as it is a
continuous and differentiable version of the argmax function in one-hot representation:
!
exp(s1 ) exp(sC )
softmax(s) = PC , . . . , PC
k=1 exp(sk ) k=1 exp(sk )
41
Example with 4 classes:
4 1.0 1.0
50
3 0.8 0.8
2 40
0.6 0.6
softmax
argmax
exp(sc)
1 30
sc
0 20 0.4 0.4
1 10 0.2 0.2
2 0 0.0 0.0
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
c c c c
Here we first see the scores on the left, then the exponentials of these scores, then the softmax and then the
argmax or one-hot encoding for category 3. As we can see, the softmax and the argmax are almost the same,
so the softmax function can be regarded as an approximation of the argmax.
We have seen that the softmax responds to differences between inputs. It is also invariant to adding
the same scalar to all it’s inputs:
softmax(x) = softmax(x + c)
We can therefore derive a numerically more stable variant:
For all inputs, we subtract the maximum of all the inputs/scores of the softmax. This allows for accurate
computation even with limited fixed precision, even when the x become large. It illustrates again, that the
softmax depends only on the differences between individual scores and not on some global term that we add or
subtract.
Example: Suppose we have 4 classes, so C = 4 and 4 training samples x with labels y represented in one-hot
encoding:
Input x Label y Predicted scores s softmax(s) CE Loss
(1, 0, 0, 0)> (+3, +1, −1, −1)> (0.85, 0.12, 0.02, 0.02)> 0.16
(0, 1, 0, 0)> (+3, +3, +1, +0)> (0.46, 0.46, 0.06, 0.02)> 0.78
(0, 0, 1, 0)> (+1, +1, +1, +1)> (0.25, 0.25, 0.25, 0.25)> 1.38
(0, 0, 0, 1)> (+3, +2, +3, −1)> (0.42, 0.16, 0.42, 0.01)> 4.87
Suppose we have a model making predictions (third column): For the first input it does a good job at predicting
the correct class, for the second one it is uncertain between 1 or 2, for the third it is uncertain about all possible
classes and for the last one the model actually predicts the wrong class.
After computing the softmax of these scores, we can then compute the Cross-Entropy loss. We can see, that
the CE-loss is relatively small for the first example, as the the model assigned the correct class with high
probability. For the second example, the CE-loss decreases, as the uncertainty increases. In the third it’s
increasing even more, as the uncertainty is also bigger. In the final example the model predicts the wrong class,
resulting in a very high CE-loss. If this would be a minibatch in our stochastic gradient descent optimization,
then sample 4 would contribute most strongly to the loss function.
42
functions must be non-linear to learn non-linear mappings. Some activation functions are not differentiable
everywhere, while still suitable for training.
1.0 Sigmoid
0.8
0.6
g(x)
0.4
0.2
0.0
10 5 0 5 10
x
The Sigmoid maps the input to the range [0, 1], which can be interpreted as a probability or analogously to the
saturated “firing rate” of neurons in the brain.
It does come with some problems however: The saturation on the higher and lower ends “kills” the gradients,
which leads to problems in backpropagation. The second problem is that the output is not zero-centered, but
between 0 and 1, thus introducing a bias from the first layer on (positivity bias).
43
Allowed
Gradient
Update
Directions
Allowed
Gradient Update
Path
Update
Directions
Optimal Update
Figure 49: Sigmoid Problem 2 Figure 50: Sigmoid Problem 2: Inefficient Updates
We want to compute the loss function to the parameters a (see Fig. 49). If we have multiple hidden layers
with a Sigmoid activation function for each of them, we know that the output of each hidden layer is always
positive. Thus, the input to the next layer is also always positive. Furthermore, the gradient of the Sigmoid is
also always positive. This means, that the gradient wrt. the parameters ai is given by:
∂L ∂L ∂g ∂L ∂g ∂x ∂L ∂g
= = = xi
∂ai ∂g ∂ai ∂g ∂x ∂ai ∂g ∂x
As we know, all xi are positive, and the gradient of the sigmoid is also always positive, the blue terms are all
positive. Therefore, sgn(∂L/∂ai ) = sgn(∂L/∂g), so all gradients have the same sign (+ or -).
The problem with this is, that it restricts the space in which gradient updates can happen and leads to
effectively very inefficient optimization (see. Fig. 50). This problem becomes even bigger in higher-dimensional
space, as the subset of possible directions to move into is even smaller relative to the full space. This problem
can be somewhat alleviated by using minibatches.
1.00 Tanh
0.75
0.50
0.25
0.00
g(x)
0.25
0.50
0.75
1.00
10 5 0 5 10
x
To adress the problems of the Sigmoid function, the Tanh function has been proposed (see Fig. 51):
2
g(x) = −1
1 + exp(−2x)
It is generally pretty similar to the Sigmoid function, but the output range is fundamentally different. It maps
the input to range [−1, 1]. This is an anti-symmetric mapping from the input domain to the output. The
advantage of this is, that it makes the function zero-centered, so we have negative and positive output values.
This alleviates problem 2 of the Sigmoid, but is also a saturating activation function, therefore also killing
gradients.
44
10 ReLU
g(x)
4
0
10 5 0 5 10
x
Another activation function that has been propsed is the Rectified Linear Unit (see Fig. 52), which is the most
commonly used activation function in practice. It is defined as:
g(x) = max(0, x)
It is called Rectified Linear Unit as we have a linear component, but all the values x > 0 are capped, so it does
not saturate. It is not differentiable at 0, but that does not matter for training as we are not going to query
that function exactly at 0.
This function typically leads to much faster convergence than the Sigmoid or Tanh activation functions, and is
also computationally very efficient.
It is also not zero-centered however. Also, there is no learning for x < 0 ⇒, so if the input becomes smaller
than 0, then these ReLus become so called ”dead ReLUs”. It often happens in practice, that some neurons are
dead and don’t learn anymore.
As shown in Fig. 53, whenever we have an input value x < 0 this activation function is 0, so the downstream
gradients are also going to be zero, so there is no learning. For this reason, we often initialize these ReLUs
with a positive bias (b > 0).
10 Leaky ReLU
6
g(x)
0
10 5 0 5 10
x
g(x) = max(0.01x, x)
It is generally very similar to the normal ReLU, except that now there is a slope for all values x < 0 as well.
The advantage of this activation function is that it does not saturate as well, so the gradients won’t die. It is
45
also closer to zero-centered outputs, leads to fast convergence and is computationally efficient.
The Parametric ReLU Activation Function;
There are more alternatives to the Leaky ReLU, for instance the Parametric ReLU as a generalization of the
Leaky ReLU, where the factor α is itself a learnable parameter:
g(x) = max(αx, x)
10 ELU
6
g(x)
0
10 5 0 5 10
x
Another activation function that has been proposed is the Exponential Linear Units function (see Fig. 55).
It is very similar to the Leaky ReLU again, with the difference that it is also differentiable at 0 and has a
saturation for small x, which adds robustness to noise for some problems. Otherwise it shares the benefits with
the leaky ReLU. Its default α = 1.
A generalization of the ReLU activation function is the Maxout activation function as proposed by Goodfellow
et al.:
The Maxout function is using multiple affine predictions as an input (for example two in the case above, a>
1
and a>
2 . It thus increases number of parameters per function, which can be a disadvantage.
As a rule of thumb it is advisable to use ReLU with a sufficiently small learning rate by default, but
Leaky ReLU, Maxout or ELU can be tried out for some small potential additional gain. In general, Tanh
should be preferred over Sigmoid, and is often used in recurrent models.
46
(a) Original data (b) Zero-centered (c) Normalized
∂f (x) f (x + h) − f (x)
= lim
∂x h→0 h
∂f (x) f (x + h) − f (x − h)
= lim
∂x h→0 2h
For a particular h we can evaluate this expression and compare it to the analytic gradient that we have
implemented and verify it.
But how to choose h? For h = 0 the expression is undefined, so it has to be chosen bigger than 0 but small
enough. When choosing h too small rounding errors may occur due to the finite precision of the data types. If
its chosen too large we may
√ get approximation errors because the approximation for the secant is wrong.
A good choice is usually 3 with the machine precision. Examples are = 6 × 10−8 for single precision (32
bit) and = 1 × 10−16 for double precision (64 bit). (Example ommitted due to space constraints).
The result of zero-centering is visualized in Fig. 57b. The original data is shifted such that the mean is 0
for every feature. While being one of the most commonly applied preprocessing steps, note that the exact
implementation of centering can vary slightly in practice: For AlexNet the overall mean image (W × H × 3
numbers) is subtracted. VGGNet and ResNet compute a per-channel mean (mean along each channel: 3
numbers) that is subtracted from the training data for preprocessing. In ResNet, data points are additionally
47
(a) Small Random Values
(c) Xavier
Figure 58: The effect of different initialization choices. Tahn was used as activation function for all examples.
normalized using a per-channel standard deviation. Normalization refers to transforming the data to have
unit variance. This can be achieved by dividing the data points by the feature-wise standard variance:
N
1 X
xi,j ← xi,j /σj with σj2 = (xi,j − µj )2
N i=1
The effect is depicted in Fig. 57c. Two further techniques that require to compute the eigenvalues of the
covariance matrix are decorrelation (Fig. 57d) and whitening (Fig. 57e). Decorrelation refers to multiplying
with the eigenvectors of the covariance matrix. Dividing by the square root of the eigenvalues of the covariance
matrix is called whitening. Both, decorrelation and whitening, are less commonly used.
Constant initialization Naively, initializing all weights with a constant value (e.g. 0) seems to be an easily
and practicable idea. Unfortunately, any constant initialization scheme will perform very poorly as it hinders
learning drastically. In the forward pass all weights will have the same influence and thus receive the same
correction signal in the backward pass. This will be the case throughout training and causes all neurons to
learn the same, thus significantly reducing the expressiveness of the neural network.
Random initialization An alternative approach is to initialize the weights at random. However, even with
random numbers it is important to choose them appropriately.
Small Random Numbers: In order to achieve initial weights not equal but close to zero, weights could be
drawn from a Gaussian with small standard deviation (e.g. σ = 0.01). Unfortunately, this leads to the problem
of vanishing gradients, which refers to an exponential decrease of the activation with deeper layers (see Fig. 58a).
Since in backpropagation the error signal is multiplied with the respective activation, the signal that will be
passed on to subsequent layers will get small at an exponential scale and deeper layer will no longer receive
48
meaningful gradients. This causes learning to be very slow or even diverge. Using the chain rule, this can be
∂L ∂g ∂L ∂g
seen by setting the activation xi to 0, which also causes the gradient to be 0: ∂a i
= ∂L
∂g ∂x xi = ∂g ∂x 0 = 0
Large Random Numbers: Reacting to the problem of vanishing gradients the next obvious idea to try would
be to initialize the weights with large random numbers, i.e. draw them independently from a Gaussian with large
standard deviation (e.g. σ = 0.2). This however can cause exploding gradients as the backpropagated gradients
will be multiplied with large values in every layer and can cause the network to oscillate. When using saturating
activation functions large random weights cause all activation functions to saturate such that no meaningful
gradient can be backpropagated which again hinders learning (see Fig. 58b). Looking at the chain rule, this
∂g ∂L ∂g
becomes by setting the local gradient ∂x to 0 (saturated activation function): ∂ai
= ∂L ∂L
∂g ∂x xi = ∂g 0 xi = 0
Xavier initialization A more sophisticated initialization strategy is Xavier initialization which was proposed
by Glorot et al. [6]. The main idea is to adapt the magnitude of the initial weights to the input size of
the respective layer. The authors propose to draw weights independently from a Gaussian distribution with
σ 2 = 1/Din , where Din denotes the dimension of the input to the respective layer. Note that this may vary
across layers
√ as their input size varies. A result of this initialization is shown in figure Fig. 58c. The choice of
σ = 1/ Din can be motivated theoretically. Let us consider y = g(w> x) and assume that all xi and wi are
independent and identically (i.i.d.) distributed with zero mean. Let further g 0 (0) = 1. Then:
Thus:
Var(wi ) = 1/Din ⇒ Var(y) = Var(xi )
It is important to note that the Xavier initialization assumes zero centered activation function. This is partic-
uarly visible when comparing the resulting activations when using tanh versus ReLU as activation function. For
tanh the activations distribution appears to be well scaled across all layers. For ReLU and variants, collapsing
activations are observable for deeper layers. Thus Xavier initialization should be used for zero-centered action
function. For ReLU activation functions, a slightly adapted initialization strategy was proposed by He et al. [7].
He initialization As ReLU is restricted to the positive real line, He et al. [7] adapt the Xavier initialization
by doubling the variance. Weights are now drawn from a Gaussian with σ 2 = 2/Din . For ReLU activation
functions this leads to a well scaled activation distribution across all layers.
5 Regularization
5.1 Problem Statement
The primary goal of machine learning techniques (for e.g. polynomial curve fitting, image classification using
MLP’s) is to learn a model from the training data which achieves low generalization error i.e. learn a model
which performs well on new previously unseen samples (test set: assumed to be drawn i.i.d. - independently
and identically from the true data distribution) and not just on the training set.
Now that we have formalized what we want our training model to do, the important question still remains what
should the capacity/ complexity of our training model be to achieve such a low generalization error ? The best
way to answer this is to look at the bias-variance trade-off curve plotted against model complexity and to think
of generalization error in terms of bias/variance. (Fig. 59)
If we choose to train a model with low capacity (lower number of parameters/ weights, low variance) then we
risk underfitting (Fig. 60) i.e. the model is too simple, has high bias and can not even fit our training data
properly let alone perform well on the test set.
On the other hand if we chose a model with very high capacity (higher number of parameters/ weights, high
variance) then we risk overfitting (Fig. 61) i.e. the model has high variance and has fit too well to our training
49
Figure 59: Bias-Variance Trade-Off Curve
1.5
M=1 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5
0.0
y
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x
data because of which it cannot generalize well to new unseen samples leading to high test error (=generalization
error).
Takeaway: High variance + Low bias = High generalization error
From the curve(Fig. 59) we can see that the optimal model complexity for which our model would obtain the
lowest total error is also when we have both low bias and low variance. Since at this optimal model capacity
we have both low bias and low variance we can also be confident about obtaining low generalization error.
For example: In the case of fitting polynomial functions we see in Fig. 62 that when the degree of the polynomial
is 3 (optimal model complexity), both the variance and bias are low (because training set error is low) along
with the lowest generalization error (test set error).
So how do we find this sweet-spot (optimum model complexity) empirically for which we have both low bias/-
variance and low generalization error ? This is exactly where the technique of regularization figures in.
5.2 Intuition
We have two logical ways to empirically find the optimal model capacity. One way would be to start off with a
low capacity model (high bias, high training error, underfitting regime) and iteratively increase its complexity
(increase variance / reduce bias) till we obtain an optimal model which has the lowest generalization error and
low training error. The other way would be the exact opposite, i.e. start off with a high capacity model (high
50
1.5
M=9 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5
0.0
y
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 61: Example of Overfitting Training Error is Low but Generalization Error is High
1.5
M=3 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5
0.0
y
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 62: Example of Good Fit Generalization Error and Training Error are Low
variance, low training error, overfitting) and iteratively reduce its capacity (reduce variance / increase bias) to
obtain an optimal model with lowest generalization error.
In practise, the second option is highly preferred purely for the fact that if we start with a low capacity model
i.e. a model which underfits our data, we can never be sure about how much capacity needs to be added such
that both training error and test error are low. Whereas in the second option we have one less variable to worry
about i.e. the training error, as when we have a model which overfits our training data we can be very confident
that reducing our model capacity by a bit will not affect the training error drastically.
Now that we have chosen the second option i.e. starting with a high capacity model, we have two ways to reduce
the model capacity viz. either reduce the variance OR increase the bias of the model.
Important Note: As can be seen in the bias-variance trade-off curve(Fig. ??), increasing bias would reduce
variance and vice-versa. Our goal is to find the sweet spot where both variance and bias are low as not only
would it lead to low total error but also low generalization error.
Again in practise, increasing the bias is highly preferred over reducing the variance of a model since there is no
way to determine which and how many parameters/ weights need to be removed to decrease model capacity to
the desired optimal capacity. Whereas there are many simple methods to increase the bias of a model. This is
exactly what the concept of regularization is all about.(Fig. 63)
5.3 Concept
Regularization is a technique that trades increased bias for reduced variance of a model. When we apply
regularization to a high capacity model which is overfitting the data (high variance, very low bias), the idea
is that increasing the bias from a very low value to a slightly larger value would decrease the variance of the
model as per the bias-variance trade-off i.e. reduce the model capacity without increasing its bias drastically.
51
1.5
M = 9, = 10 8 Ground Truth
Noisy Observations
1.0 Polynomial Fit
Test Set
0.5
0.0
y
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x
After applying regularization to an overfitting model, we hope to end up with an optimal model which has
both low bias/ variance (=low generalization error) and low total error as is desired. In other words the goal of
regularization is to minimize the generalization error when using large models (high complexity models).
Another good way to understand regularization is to visualize what it does in function space, as can be seen in
the Fig. 64.
Solution
Regularizer
Data
Function Space
1. In the figure, the set Function Space denotes the family of all possible functions that our model can
represent.
2. The intersection between the Function Space and Data denotes the subset of functions within the function
space which are constrained by the Data, i.e. the family of functions belonging to the function space of
our model that maximize the likelihood (minimize negative log-likelihood loss) of the given data.
(Note: Since we do not know what the true generative distribution of our data is, it need not completely
overlap with the function space. Real-life data can be arbitrarily complex and it is very likely that it may
not be able to be fully represented by even the most complex/ deep models. Our goal is to always find a
good functional approximation which fits the data well.)
3. The Regularizer is an additional soft constraint on the function space and forces the model to represent
functions which are close to the constraint boundary.
4. The solution that we want i.e. the optimal model lies within the intersection of the Function Space, Data
and Regularizer.
In summary, both the Data and Regularizer should be viewed as constraints on the Function Space. The goal
of introducing the regularizer is to encourage the model to learn a function which definitely fits the data well
but most importantly is also not too complex.
52
5.5 Parameter Penalties
Let X = (X, y) denote the dataset and w the model parameters. (Note: In the case of a Multi-Layer Perceptron
or a Deep Neural Network w is the vector which contains all the flattened weight matrices of each layer stacked
together into a single column vector) We can limit the end model capacity by adding a parameter norm
penalty R to the loss L
L̃(X , w) = L(X , w) + α R(w) (22)
| {z } | {z } | {z }
Total Loss Original Loss Regularizer
In the above loss, α is a hyper-parameter and has to be found empirically by performing cross-validation and
then choosing a value which yields best generalization performance.
Important Note: The above loss is called the ”parameter” penalty loss as the Regularizer R in the loss does
not depend on the dataset X and penalizes only the parameter vector w i.e. the parameters of the model.
• R quantifies the size of the parameters / model capacity, as by penalizing the parameters the regularizer
forces only a few parameters to be active effectively reducing the original model capacity.
• Minimizing L̃ will decrease both L and R. What this means is that we want to end up with a model
which not only fits the data well (because L is minimized) but is also not too complex ((because R is
minimized).
• Typically, R is applied only to the weights (not the bias) of the affine layers of the model. The two
important reasons for doing this are
– The bias term decides how far away the output of the affine transform layer should be from the origin.
Adding the bias term to the regularizer would force the bias of each layer to be close to 0, effectively
constraining the meaningful solution space and thus hampering the representational capacity of the
model.
– Since we typically only have one bias term per layer, the number of bias terms is minuscule when
compared to the number of weight parameters, therefore not constraining the bias would make no
significant difference to our optimization objective.
• Often, R drives weights closer to the origin (in absence of prior knowledge). Since typically no prior
knowledge about the problem is available, constraining the weights to be close to 0 is the best possible
practice. This is because smaller weights lead to less complicated solutions. Thus the regularizer R
encourages the model to be as simple as possible.
• But the second network might predict wrongly if the test distribution is slightly different (x1 and x2
match less closely) ⇒ Worse generalization. The reasoning behind this statement is that even though
53
both networks give the same predictions the second network does this by using a complex model i.e. large
weights, whereas the first network does the same using a less complex model i.e. small weights. Now if
both these models were to see a sample which would be slightly different than the test distribution i.e.
if the input varies by a bit, the difference in activations between the current sample and a sample from
the test set would be extremely high for the second network given the large weights, possibly leading to
wrong predictions. Whereas for the first network the difference in activations between the current sample
and any sample from the test set would be lower given the small weights, very likely leading to a correct
prediction. Thus, small weights also add some kind of robustness to small changes in the input to the
first network consequently leading to better generalization performance.
Objective Function
Optimum
Regularizer
What does parameter penalization look like in parameter space? In the Fig. 66, the green contour plot
is that of the objective function and the red contour plot is that of the regularizer. If there were no regularizer
applied, ideally one would expect the Optimum point to be at the centre of the green contour plot, given how we
would be minimizing just the original loss. However, when we minimize the loss L̃(X , w) in (22), we additionally
| {z }
Total Loss
also minimize the regularizer. Given this joint loss objective, the optimum now shifts to the current location
so as to minimize the regularizer loss. As can be seen in Fig. 66, adding the regularizer forces the value of w1
to increase by a bit from the previous optimum at the centre of the green plot and the value of w2 to decrease
significantly to the current location, so as to minimize the regularizer loss. Thus, these two competing losses
i.e. the original loss and the regularizer force the optimum to shift to somewhere in-between the two plots.
5.5.2 L2 Regularization
Also known as Weight decay (=ridge regression, in the context of linear regression) uses an L2 norm penalty
2
R(w) = 12 kwk2 : (All variables below are the same as defined in (22))
54
,where
wt+1 are the updated weights obtained after performing a single gradient descent step on wt and,
η is the step-size i.e. learning rate
Important Note: As can be observed from the final gradient descent update step, there is a slight difference
here as compared to the normal gradient descent step done without L2 regularization. The first term in the
L2 parameter update step is (1 − η α)wt as opposed to just wt used in the unregularized parameter updates.
Now, since both 0 < η, α < 1, their product η α < 1, therefore we know that (1 − η α) is a number which is
slightly lesser than 1. Since we are multiplying the weight vector wt with a quantity lesser than 1 before each
parameter update, we are effectively decaying the weights of the model at each training iteration before the
gradient update while performing L2 regularization.
We have now seen what happens during a single gradient update step while using an L2 regularizer.
What happens while using an L2 regularizer over the the entire course of training?
Let w∗ = argmin w L(X , w) denote the solution to the unregularized objective and consider a
quadratic approximation L̂ of the unregularized loss L around w∗ . The quadratic approximation is a
multi-variate Taylor series expansion around the optimum w∗
1
L̂(X , w) = L(X , w∗ ) + g> (w − w∗ )Linear Component + (w − w∗ )> H(w − w∗ )Quadratic Contribution
2
1
= L(X , w∗ ) + (w − w∗ )> H(w − w∗ )
2
with gradient vector g = 0(since the gradient at the optimum w∗ is 0) and semi-positive Hessian matrix H
(second-order derivatives).
When including the regularization term, this approximation becomes:
1 α
L̂(X , w) = L(X , w∗ ) + (w − w∗ )> H(w − w∗ ) + w> w
2 2
∇w L̂(X , w̃) = H(w̃ − w∗ ) + αw̃ = 0....(L(X , w∗ ) does not depend on w, thus its derivative wrt. w is 0)
(H + αI)w̃ = Hw∗
w̃ = (H + αI)−1 Hw∗
Thus, as α approaches 0, the regularized solution w̃ approaches w∗ . Note: This is because when α = 0,
w̃ = (H)−1 Hw∗ = Iw∗ This statement also intuitively makes a lot of sense, because as the influence of the
regularizer approaches 0, the current optimum in Fig. 66 would start shifting back to the centre of the green
contour plot i.e. the optimum of the unregularized objective.
w̃ = (H + αI)−1 Hw∗
−1
= QΛQ> + αI QΛQ> w∗
−1
= Q (Λ + αI) Q> QΛQ> w∗ ......(substitute I = QQ> in above equation)
−1
= Q (Λ + αI) ΛQ> w∗ .....(Apply (ABC)−1 = C−1 B−1 A−1 in above equation, also Q−1 = Q> )
Since Λ is a diagonal matrix containing the eigenvalues of H, we know that (Λ + αI)−1 is also a diagonal matrix
with its diagonal components being λi1+α where λi is the i-th diagonal entry of Λ, i.e. the i-th eigenvalue of
the matrix H.
With this in mind lets read the last equation from the R.H.S,
Q> w∗ signifies a change of basis of the vector w∗ onto the eigen-basis of H, i.e. every component of w∗ is now
aligned with the respective eigenvectors of H.
Applying (Λ + αI)−1 Λ to Q> w∗ would thus mean that the component of Q> w∗ that is aligned with the i-th
eigenvector of H is rescaled by a factor of λiλ+α
i
.
55
Now multiplying this scaled vector by Q simply signifies a change of basis from the eigenbasis of H back to the
original basis.
Thus, in summary what the additional term of L2 regularization does is that it takes every component of w∗
that is aligned with the i-th eigenvector of H and rescales it by a factor of λiλ+α
i
. From the previous term it is
easy to see that Regularization affects directions with small eigenvalues λi α the most.
Unregularized Objective
L2 Regularizer
5.5.3 L1 Regularization
The L1 regularization is another parameter penalization method to increase bias in a model. Unlike L2 regu-
larization, the L1 regularizer R(w) = kwk1 , tries to minimize the L1 norm of the weight vector w instead of
minimizing its L2 norm. Since the regularization objectives for both regularizers are different, their contour
plots are different as well as can be seen in Fig. 68 and Fig. 67.
Unregularized Objective
L1 Regularizer
56
• L1 Regularization results in a solution which is more sparse (compared to L2 )
Important Note: If we observe the contour plots of both the L1 regularizer (Fig. 68) and the L2
regularizer(Fig. 67) closely we can see that the optimum of the regularized objective w̃ in the L1 case is
closer to the origin along the direction of w1 than it is in the case of L2. What this signifies is that the L1
regularizer tries to squeeze and set most of the parameters as close to 0 as possible (w1 in the L1 example)
and allots some amount of flexibility and freedom to the remaining parameters (w2 in this example, which
in-fact increased by a bit to reach w̃ in the L1 case). Since the output weight vector after applying L1
regularization would consist of mostly 0’s, it is said to encourage ”sparsity” of solutions.
A good way to visualize what the L1 and L2 regularizers do is to look at Fig. 69 and Fig. 70. In both these
figures, we can think of the high level features in the orange box as being the output of the penultimate layer
of our model. The model is now supposed to make a classification decision based on these high level features.
The figures illustrate what solutions such a model trained with either an L2 or an L1 regularizer would prefer.
Figure 69: Effect of L2 regularizer in high level feature space. [9] The L2 regularizer prefers a model
which takes all the input features into account, i.e. every feature will have a corresponding weight
57
Figure 70: Effect of L1 regularizer in high level feature space. [9] Since the L1 regularizer prefers sparse
solutions, it will prefer a model which pays attention to only a few key features, i.e. the weights for the other
features are set to be close to 0 and thus ignored.
With computation similar to the one done above, the L1 regularizer can also be interpreted as the MAP
inference of w with a Laplace distribution prior.
5.5.6 How does the Computation Graph of a Parameter Penalty Regularizer look like ?
Important Note: An iteration means that we have iterated through a single mini-batch. An epoch means
that we have iterated through all the mini-batches i.e. iterated through our complete training dataset one time.
In Fig. 72,
• While training error decreases over time, validation error starts increasing again
58
Figure 72: Typical Loss Curve
• The graph shows that training for long amounts of time does not always help in reducing validation error,
for e.g. in this case, the validation error actually steadily increases with the number of epochs, even though
the training error decreases slowly.
• Thus, the most logical approach is to: train for some time and return parameters with lowest val-
idation error i.e. store model parameters every fixed number of epochs and at the end rollback to the
saved parameters which give the lowest validation error.
• There is a small additional cost in terms of training resources and time, as to do early stopping the
validation set needs to be forward propagated every fixed number of epochs so that we can keep track of
the validation error.
Unregularized Objective
L2 Regularizer
Early stopping:
L2 Regularization:
• Dashed: Trajectory taken by SGD
(Starts from near the origin as we ini- • Regularize objective with L2 penalty
tialize our weights to be close to 0)
• Penalty forces minimum of regularized
• Trajectory stops at w̃ before reaching loss w̃ closer to origin
the minimum w∗
• Under some assumptions, both early stopping and parameter penalties can be considered equivalent, as
can also be understood intuitively by looking the similar solutions that both approaches obtain in Fig. 73
59
• Efficient as a single training run tests all hyperparameters (unlike weight decay)
• Only cost: periodically evaluate validation error on validation set
• To reduce this cost the Validation set can be made smaller than the training data (which is usually the
case) OR/ AND the evaluation of the validation set can be done less frequently
Remark: If little training data is available, one can perform a second training phase where the model is
retrained from scratch on all training data using the same number of training iterations determined by the early
stopping procedure. It is important to note that the early stopping procedure was carried out on a training
split (split the little training data into training/ validaton set as is done normally). We then use the number of
training iterations obtained from this procedure to then train a model on the whole dataset (no splitting into
sets unlike earlier).
Intuition:
• Different models make different errors on the test set
• By averaging we obtain a more robust estimate without a better model!
• Works best if models are maximally uncorrelated
• Winning entries of challenges are often ensembles (e.g., Netflix challenge), as empirically speaking it is
very likely that using ensemble methods gives a 1-2% performance improvement in most tasks
• Drawback: requires evaluation of multiple models at inference time
• If errors are correlated (c = v), the ensemble error becomes v ⇒ no gain (Since v is the variance of the
error of a single model, there is no performance that has been gained by ensembling)
• If errors are uncorrelated (c = 0), the ensemble error reduces to K 1
v (We have gained performance by
1
reducing the variance of the errors of each individual model in the ensemble from v to K v when the errors
are uncorrelated)
Thus:
• Ensemble maximally effective if errors maximally uncorrelated i.e. when c = 0
60
5.7.2 Different types of Ensemble Methods
• Initialization: Train networks starting from different random initialization on same dataset or using
different minibatches (via stochastic gradient descent). This often already introduces some independence.
Example of Bagging
• First model learns to detect top “loop”, second model detects bottom “loop”
• The intuition behind using these models as part of an ensemble is that, each model learns to perform a
simple task which is independent of each other (detecting top loop for the model trained on first resampled
dataset and detecting the bottom loop for the model trained on the second resampled dataset. We can
thus, average these models which learn to perform simple tasks into an ensemble which gives better
performance by averaging over the predictions of each individual model.
5.8 Dropout
Idea:
61
• During training, set neurons to zero with probability µ (typically µ = 0.5)
Important Note: Removing a neuron or setting it to zero means that we remove all the incoming and
outgoing connections of the neuron from the model architecture.
Also this procedure of dropout is applied to all neurons i.e. every neuron in the network will be dropped
out with probability µ.
• Each binary mask (after performing dropout we obtain a binary mask i.e. is a given neuron in the network
active or not) is one model. This is because every different binary mask would give rise to a different
model architecture.
• This probability-based dropout mask is applied in every iteration. Since in every iteration we flip a biased/
unbiased coin for every neuron to determine whether it will participate in training, the binary mask of
the complete network changes randomly with every training iteration
• Creates ensemble “on the fly” from a single network with shared parameters. While performing dropout
we get random binary masks in every iteration i.e. random models in every iteration. Thus, dropout is a
technique which allows us to take advantage of ensemble methods i.e. to average predictions of different
models (generated in each iteration) from just a single network without any of the computational overhead
introduced in traditional ensemble methods like creating/ training multiple models from scratch, storing
parameters of each model, etc.
Important Note: The ensemble created by dropout is exponentially large and grows with the number
of training iterations. This is because the probability of getting the same binary mask i.e. same model
in any future iteration is extremely low. Thus, dropout effectively adds a new model to the ensemble in
every iteration.
62
the effect of each other. This phenomenon is called co-adaptation and is undesirable as in such a scenario
the network does not learn any useful representation, given how it is busy cancelling out its own neurons.
Whereas when using dropout since any neuron can be dropped out randomly, the network cannot learn to
change the output of nearby neurons by relying on the output of a certain neuron. Thus dropout prevents
co-adaptation of features from occurring.
• Requires only one forward pass at inference time (Explanation in Section 5.8.2)
ŷz = fw (x, z)
Important Note: The output of the neural network ŷ now not only depends on the weight vector w but
also on the binary mask z generated by dropout. Since the binary mask is random, the model architecture
is random and thus the output with dropout is also random.
Here, z is a binary mask with one element per unit drawn i.i.d. from a Bernoulli p(zi ) = µ1−zi (1 − µ)zi
where zi = 0 if neuron i is removed from the network
• At inference time, we want to calculate the ensemble prediction:
X
ŷ = fw (x) = Ez [fw (x, z)] = p(z)fw (x, z)
z
Important Note: Every binary mask generated by dropout corresponds to a new model. To take benefit
of the idea proposed in ensemble methods(Section 5.7), we then take the expectation of our individual
model predictions over the generated ensemble i.e. the set of all models (binary masks) that have been
generated by dropout.
• The number of ensembles generated by dropout is exponential in the number of neurons. That is, if we
have M neurons in our original model, we have 2M possible ensembles i.e. 2M possible binary masks.
Important Note: We have to sum over all z (i.e. over all generated binary masks) to calculate the
Ensemble Prediction using dropout.
Thus to obtain a single ensemble prediction we have to sum over 2M possible terms (since we have 2M
possible binary masks), leaving the calculation intractable. (As the number of terms in the summation
will rise exponentially with the number of neurons)
How do we solve the intractability of calculating an Ensemble Prediction while using dropout ?
fw (x) = w1 x1 + w2 x2
fw (x, z) = z1 w1 x1 + z2 w2 x2
Note: fw (x, z) is the dropout equivalent of making a prediction, where z1 and z2 are the binary masks of a
neuron that can either be 0 (neuron has been dropped out) or 1 (neuron will remain active). Assuming µ = 0.5,
during training we optimize the expectation over the ensemble:
1 1 1 1
Ez [fw (x, z)] = (0 + 0) + (w1 x1 + 0) + (0 + w2 x2 ) + (w1 x1 + w2 x2 )
4 4 4 4
1 1
= (w1 x1 + w2 x2 ) = fw (x)
2 2
Note: We have 4 terms in the summation as we have 4 possible combinations of z1 and z2 . Each of the
terms are weighted equally ( 14 ), since the probability of getting a given combination out of the 4 possible is
p(z1 )p(z2 ) = 0.5 ∗ 0.5 = 0.25.
Important Note: In the above equation, we see that the Expected Prediction of the trained model is half
(µ) times the output of the original network. This implies that the weights learnt by the model trained using
dropout must be double (1/µ) times the weights of the original model, as only then would both the models give
63
the same prediction.
It is also important to note that dropout is only used during training. Therefore while performing inference we
want our model trained using dropout to behave the same way as the original model.
Thus, at test time, we must multiply the trained weights (which are 1/µ times the weights of the original
model) by the dropout probability µ.
Remark: This weight scaling inference is only an approximation for non-linear models. However, this approx-
imation still works really well empirically. (As can bee seen in Fig. 78)
Figure 77: Comparing the outputs of an autoencoder trained without (left) and with dropout
Motivation:
• Deep neural networks must be invariant to a wide variety of input variations. We want our model to
be invariant (not affected by) to input variations, as we want it to classify all inputs irrespective of their
variations correctly. For e.g. we would want our model to correctly classify both an otter lying on its back
and an otter sitting on a rock as an otter, despite the variations in these 2 images.
64
Figure 79: Variations of images from a single class
• Often in real-life data there exist large intra-class variations in terms of pose, appearance, lighting,
etc. This makes image classification in particular an extremely hard task.
65
iaa.Crop(px=(1,64))
iaa.Fliplr(0.5)
Affine Transformation:
• Apply a single affine/ linear transformation to the 2D image space i.e. scale, rotate, shear an image etc.
• Blank spaces left after the transformation can be filled by using different strategies like using constant
color, expand image colors, replicate colors, etc.
iaa.Affine()
66
Piecewise Affine Transformation:
• Similar to Affine transformation albeit now a single transformation is not applied to the whole image
• The image is perceived as an underlying grid. A piece-wise affine transformation is applied independently
to each grid cell. This causes different regions in the image to be distorted differently
iaa.PiecewiseAffine(scale=(0.01, 0.1))
Perspective Transformation:
• Similar to Affine transformation albeit a perspective transform has more degrees of freedom
• The perspective effect causes certain regions of the image to be squeezed and other regions to be enlarged
iaa.PerspectiveTransform(scale=(0, 0.4))
iaa.GaussianBlur(sigma=(0.0, 10.0))
Image Sharpening:
• Does the opposite of introducing blur i.e. introduces sharpness into input images
67
iaa.Sharpen(alpha=(0, 0.5), lightness=(0.75, 1.25))
Emboss Effect:
• Another effect which introduces pixel-wise light/shadow and could improve generalization performance
Edge Detection:
• Uses the edge detected version of a given image
• Is important to verify whether the edge detected version are still reasonable and do not change / make it
impossible to decipher semantics of the class
iaa.EdgeDetect(alpha=(0, 1.0))
Important Note: In Fig. 80 we can see that when a deep network was trained on a dataset which
added uniform noise to both the train and test dataset, it managed to achieve super-human performance.
However, when the network was trained on the same dataset albeit this time with salt-and-pepper noise
(a noise where pixels of the image are set to white or black based on a specified probability), the network
managed to obtain only chance-level performance i.e. as good as random-guessing. This result is especially
surprising given how both datasets (one with uniform noise and the other with salt-and-pepper noise look
like they have similar noise distributions) and yet the difference in performance of the network between
both these visually-similar datasets is drastic. This validates the point that networks are highly susceptible
to noise.
68
• Noise is present everywhere while collecting data, for instance noise induced by camera sensors while
capturing data, noise induced by the scene illumination conditions, etc.
Figure 80: How is classification performance affected by different kinds of noise ? [5]
Following are some of the most popular additive-noise augmentation techniques to improve generalization
performance:
Gaussian Noise:
• Add a fixed Gaussian noise to each pixel in an image. The additive noise is chosen randomly for different
images
• Each pixel in an image is turned into either black or white depending on a specified probability
iaa.SaltAndPepper(0.1)
Dropout Noise:
• Unlike salt-and-pepper noise which does not retain color, in dropout noise each pixel is only turned to
black based on a specified dropout probability
69
iaa.Dropout((0.01, 0.5))
Cutout Noise:
• A more structured form of noise which involves cutting-out i.e. setting pixels to a specific color of regions
in an image.
• Noise can also be applied to the hidden units, not only to the input
• Prominent example of applying noise to the hidden inputs is : Dropout. Another example would be
adding random noise individually to each hidden unit
• The advantage of adding noise to the hidden units and not just the input is that we are adding robustness
not just to input-level features but also at intermediate levels where higher level concepts and higher level
representations of the neural network are established
• As seen in Section 5.8, adding noise to the hidden units of a network also works really well empirically to
gain better generalization performance
70
5.9.4 Color Transformations
Why are Color Transformations important ?
• Color transformations have played a key-role in the success of most neural networks that work well on the
Imagenet dataset
• Cameras produce different color spectrum’s based on the type of sensors used and the kind of white-
balancing performed. A more important reason behind this is the change in lighting conditions while
capturing data, for e.g. the colors captured by a camera would change dramatically for the same scene
depending on the time of the day i.e. sunset would have warmer colors, mid-day would have brighter colors,
etc. We thus want our models to be invariant to such variations so that our generalization performance
increases. This is exactly why color transformations are an essential form of data augmentation
Following are some of the most popular color transformations techniques to improve generalization performance:
Contrast:
• Change the contrast of images to either become faint or have stronger contrast.
Brightness:
• Change the brightness of the whole images to make them either darker or brighter
Local Brightness:
• Brightness are applied locally per channel, the locations of which are determined by a specified frequency
noise
71
iaa.FrequencyNoiseAlpha(exponent=(-4, 0), first=iaa.Multiply((0.5, 1.5), per channel=True))
iaa.AddToHueAndSaturation((-50, 50))
Color Inversion:
• Even though color inverted images are not a realistic choice as they look nothing like natural images,
depending on the task they could still be a good strategy to improve generalization performance
Grayscale:
• Convert the images into a spectrum between grayscale and colored images, the strength of which is
detrmined by the specified α value
iaa.Grayscale(alpha=(0.0, 1.0))
5.9.5 Weathers
Even though weather effects might sound complex they are still very easy to compute and can be done
on-the-fly without requiring any sophisticated graphics engine.
Snow:
72
• Introduces a snow-like effect on the images
Clouds:
• Introduces an effect on the images which look like clouds
iaa.Clouds()
Fogs:
• Introduces a fog effect on the images
iaa.Fog()
73
Figure 82: Random Data Augmentation Combinations applied to a single input image
Horizontal
Flip
Figure 83: Random Data Augmentation Combinations applied to a single input image
• Remark: For general object recognition, flips and rotations can often be useful!
• For dense prediction tasks (depth/instance/keypoints), also transform targets. For e.g. in tasks like
instance segmentation, depth prediction or stereo depth estimation, any affine/ warping/ translation
transformations done to the input would affect the ground-truth predictions (whereas per-pixel noise
would make no difference). Thus, in such dense prediction tasks, the output need to be transformed
accordingly so that we obtain the valid ground-truth for the transformed input image
• It is important to specify the right distributions (often done empirically). Not all transformations would
be applicable or beneficial to improving performance in a given task. Thus the kind of transformations
that would help in improving performance and their corresponding strength (hyperparameters) need to
be chosen empirically
• Can also be combined with ensemble idea:
74
180°
Rotation
Figure 84: Random Data Augmentation Combinations applied to a single input image
Figure 85: Random Data Augmentation Combinations applied to a single input image
6 Optimization
We have previously already discussed the most basic form of optimization stochastic gradient descent and even
simpler gradient descent. In the following we will learn about some of the variants of these. As well as some
strategies on how to build up a deep learning project and on how to debug our model if there is some bug or
expected behaviour.
75
6.1 Optimization Challenges
First, let’s revisit gradient descent and in the following look at the most frequent challenges that we encounter
while optimizing a deep learning model.
w0 = winit
wt+1 = wt − η ∇w L(wt )
First we initialize w0 at timestep 0 to some initial value, then in every timestep we update wt+1 by adding a
fraction of the negative gradients to the weights wt of the current timestep.
Figure 87: Gradient Descent A 1D example of a few gradient descent update steps.
In Fig. 87 we can see a 1D example loss function L(w) wrt. to the network parameters w. The loss is a non-
convex function, which is the case for most neural network losses. Therefore there are multiple local minima
and one global minima, that we want to reach in the optimal case. Although we can only find one of those
through optimization. The good news is that many local minima in deep neural networks are good ones. Here
we can nicely see why we need to add the negative gradient to progress further. The gradient at w0 is negative,
which means that if add it directly to the weight w, the weight (x-axis) would get smaller, resulting in a new
w1 that is to the left of w0 , thus worse. By adding the negative gradient, in this case a positive number, to w0
we end up at the w1 as shown in the figure, thus closer to the local minima.
Figure 88: Divergence A 1D example of a few gradient descent update steps that diverge.
results in too many steps that are needed to converge in the minimum. Whereas a learning rate too high might
lead the network to divergence. Fig. 88 shows this divergence behaviour, where the step size is too large such
that we jump over the minimum, to a possibly steeper part of the loss function (higher gradient). This might
repeat itself, leading to divergence.
76
6.1.3 Cliff
Another challenge regarding the gradients is a cliff (Fig. 89), where the updated weight wt jumps over the
minimum onto a steep cliff (very high gradient), which then catapults the next updated parameters wt+1
further away from the minimum. A common heuristic to counteract such effects is to clip the gradient to a a
priori selected range, which introduces another hyperparameter we need to search.
Figure 89: Cliff A 1D example of a few gradient descent update steps that jumps off a cliff.
Figure 90: Saddle point An example for a saddle point of a simple loss function.
The saddle point (see Fig. 90) of loss functions is a tricky part, because the gradient ∇w L(w) = 0, although we
are not at a minimum - not even a local minimum. In reality, this is mostly not a problem, even though there
are many saddle points in DL, because the chance to exactly hit a saddle point is very low. For every parameter
direction the gradient needs to be exactly zero, which is unlikely when working with millions of parameters.
6.1.5 Plateau
A region similar to the saddle point is the plateau (Fig. 91), where the gradients are close to zero (e.g. saturated
sigmoid activation function, dead ReLUs).
6.1.6 Ravine
A ravine (Fig. 92) is a very narrow valley with a small gradient along the slope of the valley. The ravine in
the loss function leads to the minimum. Due to the narrow valley it is very hard easy to diverge, similar to
Section 6.1.2 and Section 6.1.3. Furthermore due to the small gradient it is hard to follow the narrow path to
the minimum.
77
Figure 92: Ravine An example for a ravine of a loss function.
Thus the total loss can also be approximated by the following formula:
1 X 1 X
∇w Li (wt ) = Ei∼U {1,N } ∇w Li (wt ) ≈ ∇w Lb (wt )
N i B
b
This represents a noisy approximation of the total loss, because training on a batch estimates the total loss,
but adds a little noise as the smaller batch sizes lead to a larger variance in the gradients. These batches can
either be chosen randomly or by partitioning the dataset. Either way they should be as independent as possible,
therefore make sure to shuffle the training set. We also introduce some terminology here to make sure we talk
about the same concept.
78
• Iteration = a single gradient update based on a single minibatch wt → wt+1
• Epoch = complete pass through the training set (= N
B iterations)
The algorithm for stochastic gradient descent now looks like this:
1. Initialize weights w0 , pick learning rate η and minibatch size |Xbatch |
2. Draw random (shuffled) minibatch {(x1 , y1 ), . . . , (xB , yB )} ⊆ X (with B N )
3. For all minibatch elements b ∈ {1, . . . , B} do:
(a) Forward propagate xb through network to calculate prediction ŷb
(b) Backpropagate to obtain batch element gradient ∇w Lb (wt ) ≡ ∇w L(ŷb , yb , wt )
4. Update gradients: wt+1 = wt − η B1 b ∇w Lb (wt )
P
Loss
1.5 Initialization
1.0 1.300 1.500
1.100 Minimum
0.5 0.300
0.0
w2
0.010
0.100
0.5 0.500
0.700 0.900
1.0 1.100
1.500 1.300
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 93: SGD loss example An example loss function for 2 parameters shown as contour plot.
Let’s look at an example (Fig. 93) for a 2D parabola loss curve described by L(w) = (0.1 w1 )2 +w22 and shown as
a contour plot, where each ellipse represents a loss value. As we can see by looking at the loss the gradient slopes
upward more quickly in the w2 dimension and more slowly in the w1 dimension (factors 1 and 0.1 respectively).
The resulting gradient can be formulated by:
>
∇w L(w) = (0.02 w1 2w2 ) + N (0, 0.03)
∂L ∂L
with ∂w 1
= 0.02 w1 and ∂w 2
= 2w2 . To simulate the process of stochastic gradient descent of minibatches, we
have added Gaussian noise to the gradient as well.
Although SGD helps with gradient descent’s efficiency, the problem of how to choose the learning rate still
exists:
In Fig. 94 we can observe the effect a learning rate that is chosen too low (very slow convergence - top) or too
high (divergence - bottom) has on the convergence of the model. In the case of a learning rate that is too high,
every update overshoots the previous w2 which leads to oscillation and divergence, although it gets closer to
the optimum in the w1 dimension. We can also see the effect of the noise as the points along w1 are not equally
distributed.
When choosing a better learning rate, the effect is obvious (Fig. 95):
The update path is still oscillating and slow, but does not diverge anymore.
79
SGD ( = 0.10)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
SGD ( = 1.01)
7.5 w1, . . , w100
5.0 w0
2.5
0.0
w2
2.5
5.0
7.5
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 94: SGD bad learning rates Different learning rates on the previous toy example. Top: very low
learning rate. Bottom: too high learning rate.
SGD ( = 0.99)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 95: SGD good learning rate Good learning rate on the previous toy example.
A series is convergent if there exists a number s∗ such that for every arbitrarily small positive number , there
exists an integer N such that for all n ≥ N :
|sn − s∗ | < (23)
Which means that if we go for long enough we will be arbitrarily close to the optimal solution.
80
SGD ( = 0.98)
1.5 w1, . . , w100
1.0 w0
0.5
0.0
w2
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 96: SGD not converging Good learning rate on the previous toy example shows non-convergence at
optimum.
In the case of the SGD update step wt+1 = wt − η ∇w LB (wt ), this results in:
w0 = winit
w1 = w0 − η ∇w L0
w2 = w1 − η ∇w L1 = w0 − η ∇w L0 − η ∇w L1
w3 = w2 − η ∇w L2 = w0 − η ∇w L0 −η ∇w L1 −η ∇w L2
| {z } | {z } | {z }
=a1 =a2 =a3
with elements a1 , a2 and a3 being elements of a series. By applying the convergence criterion (23) we can
conclude that optimization converges if there exists a vector w∗ such that for every arbitrarily small positive
number , there exists an integer T such that for all t ≥ T :
kwt − w∗ k <
and let gt be an unbiased estimate of the gradient ∇w L(wt ), i.e., E[gt ] = ∇w L(wt ).
Then, the series
wt+1 = wt − ηt gt t→∞
converges to a local minimum of L(w).
This means on the one hand that we can guarantee that SGD converges to a local minimum when us-
ing a decaying learning rate (like ηt = ηt ), on the other hand we cannot guarantee that it converges to a global
minimum for a non-convex loss function. Although we can guarantee that it converges to the global minimum
for a convex loss function, because the local and global minima are the same.
In conclusion, the problems of SGD are that the contribution of gradients to the update are scaled
equally across all dimensions. SGD requires a conservative learning rate to avoid divergence, which then slows
down the process. In general, finding a good learning rate is difficult.
mt+1 = β1 mt − η ∇w LB (wt )
wt+1 = wt + mt+1
81
SGD + Momentum ( = 0.99, 1 = 0.9)
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 97: SGD with Momentum SGD with Momentum learning on the previous toy example.
is responsible for the improvement, by keeping a little bit (β1 ; typically β1 = 0.9) of the motion from the
previous iteration mt in the current update mt+1 . By setting β1 = 0, we can simulate the standard stochastic
gradient descent. Although momentum is introduced as stated above, a better parameterization is the following
linear combination
mt+1 = β1 mt + (1 − β1 ) ∇w LB (wt )
wt+1 = wt − η mt+1
as it decouples the momentum β1 and learning rate η hyperparameters. We can now independently change the
momentum and the learning rate and inspect their behaviour.
mt+1 = β1 mt + (1 − β1 ) gt (with m0 = 0)
m1 = β1 m0 + (1 − β1 ) g0 = (1 − β1 ) g0
m2 = β1 m1 + (1 − β1 ) g1
= β1 (1 − β1 ) g0 + (1 − β1 ) g1
m = β1 m2 + (1 − β1 ) g2
3
= β12 (1 − β1 ) g0 + β1 (1 − β1 ) g1 + (1 − β1 ) g2
This shows that the contribution of gradients in the past is much smaller than the contribution of gradients
that are closer to timestep t, as the −i in the exponent of β1t−i−1 marks. Although the contribution gets
smaller and smaller for previous iterations, it’s always there.
Depending on how β1 is set, we either track the gradient more quickly (see Fig. 98 left) or slowly (
Fig. 98 right) and sometimes not even reach the original value of the gradient.
We see that the average effectively dampens the behaviour of the function.
82
1 = 0.80 1 = 0.95
5 gt 5 gt
mt mt
4 4
3 3
2 2
1 1
0 0
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
t t
Figure 98: Exponential Moving Average An example that shows the dampening behaviour of EMA on the
momentum.
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 99: SGD with Nesterov Momentum An example that shows the improved dampening behaviour of
SGD with Nesterov Momentum.
ŵt+1 = wt + β1 mt
mt+1 = β1 mt − η ∇w LB (ŵt+1 )
wt+1 = wt + mt+1
Again, this expression can be rewritten such that the hyperparameters are decoupled:
ŵt+1 = wt − η β1 mt
mt+1 = β1 mt + (1 − β1 ) ∇w LB (ŵt+1 )
wt+1 = wt − η mt+1
The predicted parameters ŵt+1 estimate where we might end up by updating with the previous velocity mt ,
but without the current gradient. This alternative to the standard momentum increases the responsiveness of
momentum, because we take a look into the future, thus it leads to faster dampening as shown in Fig. 99.
6.2.8 RMSProp
An approach with a the same motivation as momentum is RMSprop. We want to have an even distribution of
gradients on each weight dimension, in contrast to the standard SGD (see Fig. 95) with an very uneven gradient
distribution. The idea of RMSprop is to divide the learning rate by a moving average of squared gradients,
which means that we change the learning rate per parameter, i.e. w1 and w2 in Fig. 95.
The moving average of squared gradients or the running variance v is used in this approach to adjust the
83
RMSprop ( = 0.10, 2 = 1.0, = 1e 08)
2.0 w1, . . , w100
1.5 w0
1.0
0.5
w2
0.0
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 100: RMSprop An example that shows the update behaviour of RMSprop.
per-weight step size (e.g. division in w2 direction will be larger, division in w1 direction will be small).
6.2.9 Adam
Adam is the most used and de facto default optimizer, due to its robustness, as it combines all previous mentioned
ideas of Section 6.2.5 and Section 6.2.8. It is also possible to combine Adam with Nesterov’s Momentum,
although not many people use it. The benefits can be seen in Fig. 101 in contrast to Fig. 95.
0.5
1.0
1.5
2.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5
w1
Figure 101: Adam An example that shows the update behaviour of Adam.
Given the update equation we can clearly see the influences of Momentum and RMSprop as both the first-
moment velocity term m and the second-moment variance term v are used.
mt+1 = β1 mt + (1 − β1 ) ∇w LB (wt )
vt+1 = β2 vt + (1 − β2 ) ∇w LB (wt ) ∇w LB (wt )
mt+1 vt+1
m̂t+1 = v̂t+1 =
1 − β1t+1 1 − β2t+1
m̂t+1
wt+1 = wt − η √
v̂t+1 +
In addition, Adam uses a bias correction, namely m̂t+1 and v̂t+1 . But why does this help remove the bias?
84
6.2.10 Bias Correction
Let gt = ∇w L(wt ) denote the gradient of the stochastic objective L(wt ).
Let further m0 = 0. Then, the update mt+1 = β1 mt + (1 − β1 ) gt can be written as:
t−1
X
t
m = (1 − β1 ) β1t−i−1 gi
i=0
Because the expectation is a linear operation we can push it through into the sum. We can approximate the
expectation of previous time steps by E[gt ] and pull it out of the sum again. Also we switch the sum afterwards
from counting downwards to counting upwards:
t−1
X
≈ E [gt ] · (1 − β1 ) β1i
i=0
The sum is now a mathematical series for which a simple solution exists:
= E [gt ] · (1 − β1t )
Here we can see that the term (1 − β1t ) is the factor under the fraction in the bias correction term m̂.
85
Figure 102: Underfitting Model does not Figure 103: Not converged Model requires
have enough capacity to decrease losses. The more iterations to converge. We have to run
typical steep decline at the start is missing. training for longer time.
86
Grid search
1
Second Parameter
0.8
Second Parameter
(use human intuition to define ranges). 0.8
Works well, but is very expense, be- 0.6
cause we need to run a lots of full train- 0.4
ings.
0.2
Initialization
The initialization of the network’s weights is very important, because depending on the initialization the
activation of layers might get worse layer after layer. Therefore we should consider using Xavier or He
initialization to ensure that the activation distribution is a constant Gaussian across all layers.
Batch Normalization
The Batch Normalization normalizes each channel individually by mean and variance over the batch as
shown in the follow equations by the indices b and c.
B
1 X
µc = xb,c
B
b=1
B
1 X
σc2 = (xb,c − µc )2
B
b=1
xb,c − µc
x̂b,c = p
σc2 +
yb,c = γc x̂b,c + βc
87
Similar to Momentum and RMSprop, we want to make sure that any bias is removed before applying the
subsequent operations. Because we have removed any bias, we want to add a learnable bias βc as well as a
learnable scale parameter γc after that. While training the batch normalization layers, that are placed before the
activation function, saves a running average of mean and variance that will be applied at test time to estimate
the mean and variance during training.
Of course there are other forms of normalization that we can add to our DNN:
Residual Networks
Especiall deep networks have the problem that gradients are propagated very slowly and once they arrive at
early layers they are very small (vanishing gradient-problem). The observation is that deeper networks often
perform worse than more shallow ones, which is counter-intuitive, because if we have a shallow network and
add identity layers/transformations, we get the same performance as the shallow network. So the deep model
should be in theory at least as good in terms of accuracy as the shallow model, which is not the case. Therefore
residual networks have been proposed to learn the residual mappings of a few layers i.e. 2, by introducing a
skip connection across these layers (see Fig. 110), which helps the gradient flow. These layers now only have to
learn a delta of the input towards their output.
Pretraining
Pretraining is most useful, when we don’t have a large enough dataset to train a full network. We first pretrain
our backbone of the DNN (e.g. convolution layers) on another task for which a large dataset with labels is
available (e.g. ImageNet). Then we finetune the last layers on our target task/dataset, which should yield a
significant improvement.
Self Supervision
Similar to pretraining, we want to pretrain the backbone on a task for which supervision is generated from the
data itself (e.g. denoising, inpainting, contrastive learning).
Curriculum Learning
Curriculum Learning is the process of starting training on a easy dataset and then successively increasing the
difficulty for the network. We want the network to work its way step-by-step to the original full dataset, such
that very difficult samples don’t take the training off course.
88
6.4 Debugging Strategies
In deep learning many things can go wrong, therefore we will discuss some common failure/errors cases and
how to solve them in the following section about debugging strategies.
• Start with a simple model that is known to work for this type of data (for example, VGG
for images). Use a standard loss if possible. Make sure that we have something working to build
up on.
• Turn off all bells and whistles, e.g. regularization and data augmentation. We want to make
sure that we simplify complex systems to make it easier to debug. Also reduce the size of the dataset to
increase debug cycles.
• If finetuning a model, double check the preprocessing,
for it should be the same as the original model’s training.
• Verify that the input data is correct. It is very easy to get the input wrong, try to visualize to
double-check if it is correct.
• Start with a really small dataset (1–10 samples).
Overfit on it and gradually add more data.
• Start gradually adding back all the pieces that were omitted: augmentation/regularization,
custom loss functions, try more complex models.
89
Verify number of training examples
If you are training a net from scratch (i.e. not finetuning), you probably need lots of data. For image
classification, people say you need a 1000 images per class or more. If this is not possible, try to use more data
augmentation or pretraining to reduce the number of needed samples.
90
Test any custom layers
Did you implement any of the layers in the network yourself? Check and double-check to make sure they are
working as intended. Make sure the output has the right format (e.g., did you pass a softmax to a loss that
expects raw logits?)
Reduce regularization
Too much regularization can cause the network to underfit badly. Reduce regularization such as dropout, batch
norm, weight/bias L2 regularization, etc.
Give it time
Maybe your network needs more time to train before it starts making meaningful predictions. If your loss is
steadily decreasing, let it train some more.
• Monitor the activations, weights, and updates of each layer. Make sure their magnitudes match. For
example, the magnitude of the updates to the parameters (weights and biases) should be 1-e3.
• Consider a visualization library like Tensorboard and Crayon. In a pinch, you can also print weights/bi-
ases/activations.
• Be on the lookout for layer activations with a mean much larger than 0. Try Batch Norm, ELUs or other
forms of activations.
• Weight histograms should have an approximately Gaussian (normal) distribution, after some time. For
biases, these histograms will generally start at 0, and will usually end up being approximately Gaussian.
Keep an eye out for parameters that are diverging or biases that become very large.
91
bad hyperparameters. However, the proper optimizer for a task can be helpful in getting the most training in
the shortest amount of time.
Overcoming NaNs
• Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations.
• NaNs can arise from division by zero or natural log of zero or negative number.
• Try evaluating your network layer by layer and see where the NaNs appear.
7.1 Prerequisites
In the following, we use Einstein Notation. Capital letters denote tensor slices. One element of a matrix A
is denoted by A[i, j]. The i’th row of matrix A is accessed with A[i, J] and the j’th column by A[I, j]. A full
matrix uses both capital letters such that A[I, J] denotes A. The same holds for higher-order tensors, H[i, j, k]
denotes one element of the tensor H. Repeated capital letters in a product denote summation over those letters.
The Einstein Notation of a matrix product y = Ax is given by
X
y = Ax ≡ y[i] = A[i, j]x[j]
j
Capital J indicates the summation over the J’th dimension. The analog case for y = xT A can be written as a
summation over the I’th dimension
X
y = x> A ≡ y[j] = A[i, j]x[i]
i
≡ y[j] = A[I, j]x[I]
In contrast, the convolutional layer has fewer connections (Figure 111c). The weight matrix A defines a filter of
a certain size. A feature in the output layer is computed by multiplying a subset of the input neurons with the
weight matrix A. The subset of neurons that influence an output field is in the corresponding input field’s local
92
neighborhood. The same filter operation is applied at every spatial location (weight sharing). A convolution
filter has the size K × K. The number of weights for a convolutional layer is calculated by
Usually, multiple convolution kernels are convolved with the input, each producing an output channel.
Filter Kernel
Figure 111: (a) spatial dimensions of two succeeding feature maps (b) a fully connected layer (c) convolutional
layer. Only three input, one output channel and a single convolution are shown for clarity.
Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + ∆X, y + ∆Y, Cin ] + bi [cout ]
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias
where b is the batch index, x, y are the spatial locations and cin , cout the feature channels. The field x, y in
hidden layer Hi+1 is computed by applying the activation function g to the product of a subset of A with a
subset of Hi added with the output channels bias term. The subset of A is a square, selected by ∆X, ∆Y ,
of a certain input channel Cin corresponding to a certain output channel cout . The subset of Hi is a square
spanned by the lengths ∆X, ∆Y , fixed around coordinates x, y of the same channel Cin . This operation is done
for every batch b. A simple example using a filter size of two can be seen in Figure 112, where the resulting
field in output map Hi is computed by point-wise multiplying the fields in Ai with the selected fields in Hi−1
summed up and added to a bias term.
1 0 2 1 0 2
2 0 +1 5 2 0 +1 5 1
-2 0 0 -2 0 0
-1 1 -1 1
0 2 -1 0 2 -1
1 0 2 1 0 2
2 0 +1 5 1 2 0 +1 5 1
-2 0 0 -2 0 0
-1 1 -1 -1 1 -1 -2
0 2 -1 0 2 -1
Figure 112: A convolution of input matrix Ai with input map Hi−1 resulting in the output map Hi .
Technically, CNNs implement correlation and not convolution, which would be the case if the kernel was flipped
in the above operations. This misnomer does not matter as a flipped kernel does not decrease the expressiveness
of the CNN.
93
7.2.2 Convolution Operator
The star operator (∗) denotes the convolution operator. The formula of convolving feature map H with filter
kernel A is given by X
[A ∗ H](x) = A(∆x) H(x + ∆x)
∆x∈Z2
where A convolved with H and is evaluated at a particular position x. Moreover the convolution is translation
equivariant. An operation f (·) is invariant to a transformation Tθ if the functions output is the same for any
transformed input Tθ [H]
f (H) = f (Tθ [H])
Whereas an operation f (·) is equivariant if its output transforms as its input for some specific transformation
type
Tθ [f ](H) = f (Tθ [H])
CNNs are translation equivariant but not translation invariant. The CNN is equivariant because shifting the
input results in a shifted output. Put in another way, by transforming the input, the feature maps are also
transformed. If e.g., an image is shifted by a certain amount of pixels, the output feature maps are also shifted
in the same way. The proof for the convolution operations translation equivariance is given by
X
[A ∗ Tt [H]](x) = A(∆x) Tt [H](x + ∆x) definition of convolution
∆x∈Z2
X
= A(∆x) H(x + ∆x − t) expanding translation operator
∆x∈Z2
X
= A(∆x) H((x − t) + ∆x) rearranging
∆x∈Z2
= [A ∗ H](x − t) = Tt [A ∗ H](x) definition of convolution
.
The ConvLayer can be implemented as computation graph. Figure 113 shows the convolutional operations
dependencies between input tensor (h11 , h12 , h13 ), the output tensor (h21 , h22 ) and the kernel (w1 , w2 ), that is
swiped along the input. The gradients get accumulated across locations due to the kernels weight sharing.
Figure 113: Computation graph of a convolutional layer. The arrows show dependencies in the summation.
The weight w1 appears in the summation for h21 and h22 . Therefore w1 receives more gradients/information in
back-propagation than it would have in a fully connected layer.
7.2.3 Padding
Applying a convolution kernel to all fields of the input map (Figure 114a) decreases the succeeding feature maps’
size (Figure 114b). Convolutions can only be executed if the kernel lies entirely within the input domain. This
decreasing in size undesirable as it couples architecture and input size. Padding adds a boundary of appropriate
size with zeros around the input tensor (Figure 114d). Consequently solving the questions which values should
be used for the bordering fields (Figure 114c). The convolution with a padded tensor H 0 [·] is given by
0
Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + ∆X, y + ∆Y, Cin ] + bi [cout ]
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias
94
(a) (b) (c) (d)
Figure 114: (a) A convolution (b) the resulting receptive field. (c) Fields with question-marks indiciate that it
is unclear what values the kernel should be convolved with. (d) Padding: the input tensor is extended to the
appropriate size with zeros (blue) in the rightmost image.
7.3 Downsampling
Convolution operations can sustain the resolution of the input; therefore extra downsampling operations are used
to reduce the resolution of the spatial input. Downsampling is necessary when the network’s input dimension
is larger than the output dimensions. An example is CNNs that reduce images to few labels. Furthermore,
reducing the spatial resolution increases a deep in the network located neurons receptive field.
7.3.1 Pooling
Reducing the spatial dimensions can be achieved by a pooling operation. Pooling requires no parameters and
is typically one of a max, min, or mean operation. A pooling layer is defined as
1 1 2 4
5 6 7 8 6 8
3 2 1 0 3 4
1 2 3 4
Figure 115: (a,b) A pooling operation with stride s = 2 for Hi ’s first two pixels. (c) Two consecutive feature
maps Hi−1 and Hi after applying a max-pooling operation to Hi−1 .
Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, s · x + ∆X, s · y + ∆Y, Cin ] + bi [cout ]
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias
95
stride of s, we can calculate the output tensors spatial dimensions with
Win + 2P − K Hin + 2P − K
+1 × +1
s s
| {z } | {z }
Wout Hout
The equation to calculate Wout is similar to Ho ut. In the case of Wout , we apply twice the padding (both left
and right). The kernel length is subtracted because the total width minus the kernel length is the maximal
index the kernel can be positioned. The term Win + 2P − K is divided by the step size and rounded to the
bottom. An example with parameters set can be seen in Figure 116a.
(a) (b)
Figure 116: (a) Strided convolution for an 5 × 5 image, a 3 × 3 kernel, 1 pixel zero padding and a stride of 2
resulting in a 2 × 2 output image. (b) The receptive field for the feature map H2
(a) (b)
Figure 117: (a) VGG16 (b) The reshape operation, transforming Hi−1 [B, X, Y, C] into Hi [B, C]
7.4 Upsampling
If pixel-level outputs are desired, it is necessary to upsample the features again. Downsampling is still required
to provide good features with large receptive fields in the intermediate layers. Upsampling yields outputs at
the same resolution as the input.
96
7.4.1 Nearest neighbor
Each channel is scaled using nearest-neighbor interpolation, as shown in Figure 118a.
7.4.2 Bilinear
Each channel is scaled using bilinear neighbor interpolation.
7.4.4 Max-Unpooling
For unpooling, the indices of the maximum elements in the earlier corresponding pooling operation are stored.
When applying the unpooling operation, the values are set at the locations from the previous pooling operation.
As a result, corresponding pairs of downsampling and upsampling layers are required. An example can be seen
in Figure 118c. The max-unpooling approach has been used in SegNet.
6 6 8 8 6 0 8 0 1 1 2 4 0 0 0 0
6 8 6 6 8 8 6 8 0 0 0 0 5 6 7 8 Pooling 6 8 4 3 Unpooling 0 4 0 3
3 4 3 3 4 4 3 4 3 0 4 0 3 2 1 0 3 4 1 2 1 0 0 0
3 3 4 4 0 0 0 0 1 2 3 4 0 0 0 2
Figure 118: (a) nearest neighbor upsampling, (b) example bed of nails upsampling shows, (c) max-unpooling
operation with the corresponding downsampling counterpart.
Hi [b, x, y, cout ] = g Ai [∆X, ∆Y, Cin , cout ] Hi−1 [b, x + d · ∆X, y + d · ∆Y , Cin ] + bi [cout ]
| {z } | {z }| {z } | {z }
Current Layer Weights Prev. Layer Bias
Dilated convolutions are used in, e.g., semantic segmentation, depth, optical flow.
(a) (b)
Figure 119: (a) Dilated convolution applied with a dilation factor d = 2. (b) The receptive fields exponential
expansion without loss of resolution or coverage.
7.5 Architectures
In the following, we look at standard CNN architectures and paradigms they introduced.
97
7.5.1 LeNet-5
In 1998 the LeNet-5 was introduced using two convolution layers (5 × 5), two pooling layers (2 × 2) followed by
two fully connected layers. In 1998 LeNet-5 achieved state-of-the-art accuracy on MNIST (before ImageNet).
The architecture is shown in figure 120a.
7.5.2 AlexNet
The deep learning revolution started in 2012 with the introduction of the AlexNet architecture. AlexNet consists
of eight successive layers. It used ReLUs, dropout, data augmentation and was trained on 2 GTX 580 GPUs.
The number of feature channels increases with depth, resulting in decreasing spatial resolution. AlexNet showed
that CNNs work well in practice. The architecture is displayed in figure 120b.
7.5.3 VGG
In 2015 the VGG architecture was introduced. The novelty was using 3 × 3 convolutions everywhere but
maintaining the same expressiveness with fewer parameters. The paper backed the paradigm that using small-
sized kernels is better. A second variant has 19 instead of only 16 layers. The architecture is shown in figure
117a.
7.5.4 Inception
In 2015 the inception architecture was introduced. Inception consists of 22 layers. The modules utilize conv/pool
operations with varying filter size. Backpropagating an error through 22 layers turns out to be very hard. To
mitigate a vanishing gradient, multiple intermediate classification heads improve gradient flow. The use of 1 × 1
convolutions reduces the number of features and leads to higher efficiency.
7.5.5 ResNet
ResNet was introduced in 2016. Residual connections allow for training deeper networks (up to 152 layers).
The network uses a very simple and regular structure with 3 × 3 convolutions and strided convolutions for
downsampling. ResNet and ResNet-like architectures are dominant today in computer vision.
7.5.6 U-Net
U-Net was introduced in 2015 and produced image-level segmentation by combining max-pooling as downscaling
and up-convolution as upscaling operation. U-Net is the defacto standard for many tasks with image output
(e.g., depth, segmentation).
1 64 64
128 64 64 2
input output
image segmentation
392 x 392
390 x 390
388 x 388
388 x 388
map
570 x 570
568 x 568
572 x 572
128 128
256 128
200²
198²
196²
280²
284²
282²
138²
136²
102²
100²
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
32² 64²
66²
54²
52²
avg pool
pool, /2
fc 1000
28²
(a) (b)
98
7.6 Visualization
Visualization provides researchers with methods to open the black box of neural network classifiers. In the
following section, a few visualization techniques are presented.
(a) (b)
Figure 122: The right image shows weight matrix visualizations of a linear classifier. The left image shows the
first layers learned filters of different deep learning architectures.
Dimensionality Reduction
Using dimensionality reduction techniques such as t-SNE or PCA, similar items can be identified or grouped.
Figure 123b shows the FC7 architectures visual embedding of the MNIST dataset where similar numbers
cluster in the same regions.
(a) (b)
Figure 123: (a) Test images the k-nearest neighbors utilizing the visual embedding of the layer preceding the fully
connected classification layer. (b) A lower dimensional embedding of the high dimensional visual embedding
for the MNIST dataset.
7.6.3 Saliency
Saliency determines the relevant parts of an image for a particular classification decision and can uncover
unwanted biases in data or the model.
99
Saliency via masking
An image of a particular class is passed through the network multiple times. Each time a different image region
is masked, and the classification response is recorded. A saliency map can be constructed from these responses.
A pixel in the saliency map defines how much the classifier is sensitive to an occlusion at that pixel wrt. to the
class. An example is shown in figure 124a.
(a) (b)
Figure 124: (a) Saliency via masking w.r.t to the classification of an elephant. The saliency map reveals that
the elephant’s head was most important for the classification decision. (b) A saliency map generated wrt. to
gradients.
7.6.4 Deconvolution
The idea of deconvolution is to reconstruct the CNN in reverse order starting from one neuron. Figure 125a
shows the deconvolution of two layers. The first layers display simple edge detectors, whereas the second layer
shows contours or image parts up to the last layers that display certain objects.
(a) (b)
8 Sequence Models
In real-life, our input and output might not always fit into a tensor of a fixed size. It might expand or shrink
continuously across time. Speech, for example, might consist of a single word or a whole sentence. However,
when we start speaking the model has no way of knowing whether our sentence might be 3 words or 30 words
long. Sequence models have been developed to handle such input and output. The most popular type of
sequencing models are recurrent networks which will be explained in section 8.1. Next, we will discuss different
applications of such models to demonstrate their flexibility before moving on to discuss an improved version
of the classical recurrent networks, i.e. gated recurrent networks. In the last part of this unit, auto-regressive
models will be discussed.
100
A little remark on how computation graphs will be displayed in this section. For better clarity, the weight
input nodes will be consider implicit for all hidden and output layer nodes. Furthermore, output nodes will be
coloured in blue from now on.
...
Figure 126: Feed-forward neural network vs. Recurrent neural network computation graphs.
In RNNs, computation nodes like h are called cells. RNN cells like the one in Fig. 127 receive input from the
previous time-step ht−1 and xt and output the tensor ht and ŷt . After concatenating the two input tensors,
some transformations are applied to the resulting tensor to compute ht . However, the actual translation ŷ of
our input needs one final affine transformation of the previously computed ht .
Output
Concatenate Read
RNN Cell
Input
ht−1 = fh (ht−1 , xt )
(26)
ŷt = fy (ht )
This general formula does not specify the form of the output mappings. Furthermore, neither fh nor fy
change over time, unlike in layers of feed-forward networks.
The single layer RNN consisting of a cell like the one in Fig. 127, makes this more concrete:
101
This hidden state of this single layer (vanilla) RNN ht is the linear combination of the input xt and the
previous hidden state ht−1 . The output is the linear prediction based on the current hidden state ht . The
tanh(·) activation function used here, is the standard for most RNNs. It both takes and returns data in the
range [−1, 1] and comes with all the benefits (e.g., zero-centred) and disadvantage (e.g., saturation and vanishing
gradient) (cf. section 4.2.2. on Tanh function). As mentioned before, the weight parameters Ah , Ax , Ay , and b
are constant over time, although the number of time steps might vary due to differences in the sequence length.
Formulated in Einstein-Notation the RNN equation for a single layer RNN will look like this:
Ht [b, cout ] = tanh(Ah [cout , Cin ]Ht−1 [b, Cin ] + Ax [cout , Cin ]Xt + b[cout ])
Ht−1 [b, Cin ]
= tanh Ah [cout , Cin ] + b[cout ] (28)
Xt [b, Cin ]
Ŷt [b, cout ] = Ay [cout , Cin ]Ht [b, Cin ]
Ht−1 [b, Cin ]
By stacking the input tensors are concatenated to . The concatenation operation of the two
Xt [b, Cin ]
inputs is represented by the c node in Fig. 127.
Since, RNNs allow for processing of inputs and outputs of variable length, they can be categorised into four
types based on their input-output mappings:
Backpropagation. RNNs return an output at multiple times-steps with the number of time-steps not being
fixed. At the same time, each output generates a loss. Thus, each of the losses at a time-step need to be taken
into account when back-propagating, i.e. gradients are back-propagate through time (cf. ).
Loss
Output
Hidden
Layer
Input
The gradient derived from each loss at a time-step is back-propagated to the previous time-step. There, it
is added to the gradient of that time-step and so on. This process is repeated until the initial time-step is
reached. Here, the weight parameters are updated. All of the hidden RNN cells share their parameters; hence,
the gradients are accumulated. However, gradients are getting intractable very quickly – in terms of memory
capacity – for more extensive sequences such as Wikipedia because all previously calculated time-steps had to
be stored along the way.
102
A practical solution is truncated backpropagation through time. While hidden states continue to be
carried forward forever, backpropagation is stopped before reaching the first time-step. In this way, only a
limited number of time-steps is back-propagated. For example, given t = 9 time-steps, backpropagation could
be split into three chunks, i.e., time-steps 1 to 3, 4 to 6, and 7 to 9. This truncation eliminates long-term
dependencies as their gradients are no longer dependent on each other (cf. ).
Unfortunately, it will also cause the RNN to lose its memory. Thus, the truncation can not simply be set to a
random value. Instead, it has to be chosen such that necessary context is preserved. In general, the truncation
ranges between 50 - 60.2
Although it is not uncommon to find RNNs consisting of only a single layer, they can be built with multiple
layers. RNNs are often kept very shallow in practice, i.e., only a couple of layers deep. Multi-layer RNNs are
constructed by adding a second cell/hidden layer after the first one, like in typical feed-forward architectures.
As both of the hidden layers in the RNN feedback to themselves, a computation graph would look like the
diagram in Fig. 131 if unrolled.
Alternatively, the cell itself could be made deeper. In other words, instead of having a single affine transformation
and activationc)function fh inside a cell, you can stack multiple layers inside of it. Such an approach is often
combined with residual connections in the vertical direction, i.e., between the layers.
Output
Hidden
Layer 2
d)
Hidden
Layer 1
Input
103
a) b)
c)
Figure 132: (a) Sequence recognition in multiple object recognition, (b) recurrent instance segmentation, (c)
object tracking in video frames (right).
a) b)
Figure 133: Examples of sequential image generation. a) Generation of handwritten digits; b) Generation of
images based on partially occluded inputs.
the left and the original on the right. All pictures in between are completions generated by the model. This
prediction is based on pixel intensities in the non-occluded part of the input. Some of the resulting images are
more plausible than others.
The wide range of application possibilities also includes image annotation. The Polygon-RNN is an in-
teractive object annotation tool that iteratively annotates the outline of a 2D object instance (e.g., a car) in
an image with polygons (cf. Fig. 134a)). The annotator can manually correct the suggested outlines. Such
tools are beneficial when creating an annotated data set to train an object detection model as they significantly
reduce the amount of time spend on annotation.
a) b)
Figure 134: Examples of object annotation (left) and road layout modeling (right)
Similar to the inference of object outlines in the image annotation, RNNs can model or generate road layouts.
The resulting spatial graphs are based on an image input. An example of road layout generation and aerial
road parsing can be seen in (cf. Fig. 134b)).
With the help of RNNs, we are even able to transform an image into a sentence that describes the picture content
(cf. Fig. 135Fig: 92). For this task, an RNN is trained to sequentially generate words based on a condensed
feature representation of the input image. A CNN previously computed this representation. In Fig. 135 the
attention shifts from one region in the image to another one. The Figure also demonstrates how the attended
regions correspond to the words that are generated.
Although this is quite impressive and it works (cf. Fig. 134c)), there is a lot that can go wrong (cf. Fig. 134d)).
For instance, taking the second picture to the left in the first row of Fig. 134d) indicates that the clock-like
shape identified by the model, is in fact, printing on the woman’s hoodie. Thus, the errors made by such a
104
a) b)
c) d)
Figure 135: Examples of image captioning. a) general process flow; b) Shifted attention of the RNN (top: soft
attention, bottom: hard attention); c) Successful image captioning examples; d) Unsuccessful image captioning
examples.
model can be partially attributed to the regions with high attention. Analyzing these patterns of attention
allow improvements in the model’s performance.
A more recently proposed task for the usage of RNNs is visual question answering. The model receives an
image and a question (and multiple possible answers) as input and predicts the correct answer. For instance,
given the image of a Teddy Bear and the question ”What kind of stuffed animal is shown?” the model should
answer: ”Teddy Bear”.
Figure 136: Google’s neural machine translation system: An Encoder-Decoder model for machine translation.
One of the more common non-image related fields is neural machine translation. When translating from
one language to another, many to many mapping is required as it is not possible to translate sentences word
by word. For example, a German metaphor like ”Ìch verstehe nur Bahnhof” would be translated to ”I only
understand train station.” However, this is not what the model should return. A more appropriate translation
would be something like ”I don’t understand what you just said.”. For this reason, translation models use an
encoder that builds a meaning representation of the original sentence and a decoder to construct a sentence
with the corresponding meaning in the target language (cf. Fig. 136).
RNNs for language generation. In his blog post from 2015 Andrej Karpathy demonstrated the effectiveness
of simple character-level language models to generate natural language character-by-character. To be able
to generate the word “hello”, an alphabet containing four characters: “h”, “e”, “l” and “o” is needed. Each of
the character is represented by a 1 hot vector, e.g., the letter “h” is represented by the vector (1, 0, 0, 0))T . If
fed with the “h”-vector, the model predicts a distribution over the next character via a Softmax function.
The character drawn from the distribution is then in turn fed as an input to the RNN at the next time-step.
However, this little toy example would have to be extended to work on more than just the simple word “hello”.
Thus, a three layer RNN with 512 hidden nodes was trained on William Shakespeare’s works, which consisted
of 4.4 million characters in total. Although the meaning and grammar of the produced sentences are debatable,
105
this simple model makes surprisingly few orthographic errors, as can be seen from Fig. 137a). It can also
understand that there are some underlying structures in natural language such as spacing, punctuation, main
and sub-clauses, and dialogue structure without being told that such things exist.
A closer look at the training process sheds light on what the model learns and at which stage it learns what.
It starts by detecting that space separates words. At the next stage, the model starts to get the hang of some
principles of sentence structuring. It also caught on to some words like “I”, “here”,“in”, “at”, “on”. Then, the
model constantly improves and fine-tunes until it produces actual words and non-sense sentences.
a) b)
PANDARUS
Alas, I think he shall be come approached and the da
When little srain would be attain'd into being never fed
And who is but a chain and subjects of his death
I should not sleep
Second Senator
They are away this miseries, produced upon my soul
Breaking and strongly should be buried, when I peris
The earth and thoughts of many states
DUKE VINCENTIO
Well, your wit is in the care of side and that
Second Lord
They would be ruled after this chamber, an
my fair nues begun out of the fact, to be conveyed
Whose noble souls I'll have the heart of the wars.
c) d)
Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy
Marylen Hammine Janye Marlise Jacacrie Hendred Romand
Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn
Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon
Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie
Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge
Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine
Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica
Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella
Roselina Alessia Chasty Deland Berther Geamar Jackein
Mellisand Sagdy Nenc Lessie Rasemy Guen Gavi Milea Anneda
Margoris Janin Rodelin Zeanna Elyne Janah Ferzina Susta Pey
Castina
Figure 137: Examples of text generation with character-level language model. a) Output from model trained
on Shakespeare text; b) Output from model trained on Latex source code file for mathematics book; c) Output
from model trained on programming source code; d) Output from model trained on baby names
Moving to a more practical application of such a model, it is also possible to train a model to create output in
latex format, source code, or baby names (Fig. 137b) - d)).
When looking at what the single neurons in such a model have learned, one can find neurons that detect a
line’s
:
:
:
.
:
end or the
.
beginning
d
.
,
,
,
y
h
,
and end of a quote. In the source code example, there are neurons trained to the
structure if-statements. However, there are way more cells whose function is not so easily explained. Overall
the behavior of only ∼ 5% of the hidden neurons is logical and can be interpreted by a human.
∂ht ∂ht ∂ht−1 ∂ht−k+1 t
= + ... = Π tanh0i akh (29)
∂ht−k ∂ht−1 ∂ht−2 ∂ht−k i=t−k+1
during backpropagation where k refers to the number of earlier time-steps. Thus, the gradient vanishes as soon
as tanh(·) saturates. Hence, the initialization of the network needs to be chosen carefully to avoid saturation.
Unfortunately, the gradient might still not behave as it should. Assuming the weights have been initialised
properly, it is possible that the activation functions do not saturate at all. The hidden state ht = tanh(ah ht−1 +
ax xt + b) would be ≈ ah ht−1 + ax xt + b ∈ [−1, 1] . If the gradient for the hidden state ht is calculated, i.e. ∂h∂ht−1
t
,
it would be approximately ah . A backpropagation over k previous hidden states would then return a gradient
106
∂ht ∂ht ∂ht−1 ∂ht−k+1
= ... = akh (30)
∂ht−k ∂ht−1 ∂ht−2 ∂ht−k
where the weights are multiplied to the power of k previous time-steps. This becomes a problem if ah > 1.
In this case the gradients will explode, i.e. become very large and lead to divergence. Consider, for example,
ah = 1.1 and k = 100. The resulting gradient would be ∂h∂ht−k
t
= akh = 13781.
Gradient clipping is often applied to prevent exploding gradients in RNNs. First, think about what happens in
case ah < 1. Given ah = 0.9 and k = 100, the gradient will become really really small, ∂h∂ht−k
t
= akh = 0.0000266.
Such vanishing gradients will prevent the model from learning at earlier time-steps. This issue can not be solved
by simply clipping the gradient. Instead, the architecture of the model needs to be changed. Unfortunately,
introducing residual connections will not work as the parameters are shared across time. On top of that, the
input and desired output at each time-step are different.
The same difficulties will arise when extending the RNN to vector- or tensor-valued
Let Ah = QΛQ−1 be the eigendecomposition of the square matrix Ah . Thus, the gradient is Akh =
(QΛQ−1 )k = QΛk Q−1 with a diagonal eigenvalue matrix Λ. Similar to the 1D hidden states before, the
model will encounter exploding gradients for components with eigenvalues > 1 and vanishing gradient for
components with eigenvalue < 1. Again the weight matrix Ah is shared across time.
Now, gradient clipping can be used to dampen the effects of exploding gradients. Gradient clipping refers to a
simple heuristic which clips the gradient to a fixed value τ , before applying the gradient update during SGD.
If the norm of the gradient vector, i.e. ||A.grad||2 , is smaller then the threshold τ , nothing happens. Should
the norm exceed τ , the gradient vector will be divided by its norm and the result will be multiplied with τ . In
other word, first the norm is ”reset” to 1 before it is set to τ , i.e. the maximum gradient value. Thus,
(
A.grad if ||A.grad||2 ≤ τ
A.grad = A.grad (32)
τ ||A.grad||2
otherwise
The maximal gradient magnitude τ is a hyperparameter of the model and often ranges between [1, 10].
Avoiding vanishing gradients is more complicated than that. Here, gates come into play. There are several
types of gates. The first, most complex, and most influential one, the Long- Short-Term Memory (LSTM),
was introduced back in 1997 by Hochreiter. There are two more recent and simpler types of gated RRNs, namely
the Gated Recurrent Unit (GRU, by Cho 2014) and the Update Gated Recurrent Neural Network
(UGRNN, by Collins 2017).
All three gate architectures use their gates to filter information. However, the number of gates and the way
they work is different (σ in the cell architectures in Fig. 138).
a) b) c)
LSTM Cell
UGRNN Cell
GRU Cell
107
At the first gate a Sigmoid function is computed. In addition to the concatenated input matrix, it uses two
weight matrices Auh and Axx – the former is applied to the previous hidden state Ht−1 , the later to data input
Xt – and a bias bu all of which are shared across time-steps. Thus,
Ut [b, cout ] = σ(Auh [cout , Cin ]Ht−1 [b, Cin ] + Aux [cout , Cin ]Xt [b, Cin ] + bu [cout ]) (33)
If the output of the Sigmoid function is 1, then the information from the previous time-step is passed on and
used to update the hidden state. If it is 0, the information accumulated in the previous time-step(s) is discarded.
The concatenated input matrix is also passed into the tanh activation function at the target gate. Like the
Sigmoid function, it also receives two weight matrices As h and As x for Ht−1 and Xt respectively and a bias bs .
Hence,
St [b, cout ] = tanh(Ash [cout , Cin ]Ht−1 [b, Cin ] + Asx [cout , Cin ]Xt [b, Cin ] + bs [cout ]) (34)
One last calculation is necessary to arrive at the output of the current state Ht . Using the output of the update
gate Ut and the target gate St , the following computation takes place
Ht [b, cout ] = Ut [b, Cout ]Ht−1 [b, Cout ] + (1 − Ut [b, Cout ])St [b, Cout ] (35)
The first term of the equation is an element-wise multiplication of the Sigmoid function’s output and the previous
hidden state. It controls how much information of the previous time step should be part of the current hidden
state output. In places where the Sigmoid is 0, the previous state’s information is erased and kept otherwise.
The second term decides how much information from the current state should be passed into the output. By
element-wise subtracting the Sigmoid function’s output from 1 before element-wise multiplying the result with
the target gate, the output information is kept where the memory is erased and deleted where the memory is
preserved. In other words, Ut is a linear weighting between the previous state and the new target state.
In the schematic cell, this equation is indicated by the four operations in circles that follow after calculating
the Sigmoid and Tanh function. The element-wise multiplication is indicated by . This symbol denotes the
Hardamad product. The Hardamad product means that the first element in matrix Ut is multiplied with
the first element in matrix Ht−1 , then the second element in matrix Ut with the second element in matrix Ht−1
and so on. We can drop the Einstein notation of the Equation above as it does not have to be stated explicitly.
However, one can also make it explicit by using standard vector notation and writing the function as
Ht [b, cout ] = Ut [b, Cout ] Ht−1 [b, Cout ] + (1 − Ut [b, Cout ]) St [b, Cout ] (36)
After understanding how a gate works, the question of what does one gain from them remains. Let’s compare
the backpropagation in a regular RNN with that of a gated RNN such as the UGRNN (cf. Fig. 139).
a) b)
UGRNN Cell
RNN Cell
Figure 139: Backpropagation in a regular RNN cell (left) and a gated UGRNN cell (right).
108
with the backpropagation of a regular RNN on the left and a UGRNN on the left. As seen previously, when
discussing vanishing and exploding gradients in RNNs, the gradients can be approximated by the weight matrix
Ah which gets multiplied with itself according to the number of time-steps. In a gated RNN cell, as the UGRNN
referred to here, this is no longer the case. Instead, it can maintain gradient flow despite small Ah by setting
its gate to u ≈ 1. Put differently; the cell can determine how much of the gradient is passed backward (similar
to the forward pass). It can learn to ignore some of the information of a state vector for a very long time and
thus pass it backward for a very long time, too. In the equation this can be seen from ut . No matter how small
any of the other terms in h0t get, we always multiply the backward propagated gradient with a value close to 1,
i.e., the already computed gradient will just be passed through.
Thus, an UGRNN is able to keep the state of a variable over a long time horizon (u ≈ 1). Going back
to the character-level language models, such a model can, for instance, keep track of being inside a quote/if-
statement/etc. or not.
ft = σ(Wf h ht−1 + Wf x xt + bf )
it = σ(Wih ht−1 + Wix xt + bi )
ot = σ(Woh ht−1 + Wox xt + bo )
(39)
st = tanh(Wsh ht−1 + Wsx xt + bs )
ct = ft ct−1 + it st
ht = ot tanh(ct )
In their systematic study, Collins et al. 2017 [3] find that GRUs are the most learnable of the three types of
gated RNNs when it comes to shallow architectures. It is followed by the UGRNN. This finding supports the
observations that the simpler and more recently developed gated RNNs perform as well as (or even better) as
the more complex ones. Thus, GRU and UGRNN can be thought of as improved and simplified versions of the
original LSTM cell.
109
8.4 Auto-regressive Models
RNNs have long been believed to be the only model capable of processing sequential input. Recently, a class
of feed-forward neural networks has be found that are at least on par with RNNs. Such models are called
Auto-regressive Models. A k’th order auto-regressive model predicts the next variable xt in a time series
based on the k previous variables xt−1 , xt−2 , ·. In the example in Fig. 140, one can see that x3 , x4 and x5 are
connected to the previous two inputs, i.e. x5 on x3 and x4 , x4 on x2 and x1 , and x3 on x2 and x1 . Thus, k
would be 2.
Like RNNs, parameters are shared across time, i.e. same function f (·) at each t. What makes them different
from RNNs is that they make a strong conditional independence assumption, e.g. x5 is independent of x2
given x3 and x4 .
The simple concept of autoregressive model can be extended to situations with varying or different input
and output lengths. For instance, as illustrated in this single-layer model in Fig. 141. Here an output ŷt is
dependent on the k-last inputs, i.e. {xi |t − k ≤ i ≤ t}, and independent of everything that came before, i.e.
{xi |i < t − k}. Thus, ŷt = f (xt , xt−1 , . . . , xt−k ).
From this most basic form of an auto-regressive model, it becomes evident that the past information is not
summarised in a hidden state h. Thus, such a model does not have infinite memory. However, it is easier to
train because there is no need for backpropagation through time.
Like all other neural networks, auto-regressive models can be extended to deep models with multiple layers
by adding some hidden layers between the input and output layers (cf. Fig. 142). Thus,
110
Figure 142: Example of a multi-layer auto-regressive model
At each hidden layer, we consider a larger amount of past input information – similar to a growing receptive
field in ConvNets – due to the usage of dilated convolutions. The model architecture also contains residual
and skip connections.
The success of WaveNet was surprising as the generation of audio signals is considered hard as it needs 16k
samples per second and structure at multiple time scales. RNNs usually avoid raw audio wave-forms for
this reason. However, WaveNet was able to do this and outperform traditional LSTMs on speech synthesis, as
demonstrated for North American English and Mandarin Chinese by the authors.
a) b)
Output
Dilation = 8
Hidden Layer
Dilation = 4
Hidden Layer
Dilation = 2
Hidden Layer
Dilation = 1
Input
This idea has been developed into a simpler type of autoregressive models called Temporal Convolution
Networks (TCNs) by Bai et al. 2018 [1] (cf. Fig. 143 b)). These networks use zero-padding to handle
sequences of arbitrary length. Aside from the residual layers, the rest of the deep multi-layer architecture
c)
is kept simple. Like the WaveNet, it also uses dilated convolutions to increase the receptive field size, also called
”context”. TCNs have been shown to perform better than LSTMs, GRUs, and RNNs on several different
sequence tasks.
How come that these simpler models that do not have a memory like an RNN, are working so well? This
blogpost concludes that – at least in theory – a stable RNN can be approximated by a feed-forward network for
both inference as well as training. Based on current literature, it states that the strength of RNNs, namely their
”infinite memory”, has no chance to unfold its power in practice. Furthermore, the unlimited context offered
by RNNs is not strictly necessary in some tasks (e.g., language modeling). Models that truncate at a sequence
length of 13 or 25 are often competitive with infinite memory models.
111
indicate the end of a sentence. Therefor: xT = <EOS>. This means that whenever we predict the next word or
token, we can either predict a word form the vocabulary or we can predict the end of sentence token.
T
Y
p(x) = p(x1 , . . . , xT ) = p(xt |x1 , . . . , xt−1 )
t=1
= p(x1 ) p(x2 |x1 ) p(x3 |x1 , x2 ) . . .
We can model a probability distribution over a sequence x of discrete tokens (x1 , . . . , xT ). The joint distribution
over this sequence decomposes into a product of t conditional distributions p of xt given all of the x with an
index smaller than xt which is a consequence of the product rule or chain rule of probability. This is not an
approximation. We simply can rewrite or decompose any joint distribution in such a form. By decomposing it
that way each token in a sequence only depends on the previous tokens in that sequence by iteratively applying
the product rule.
We have already seen in the lecture about recurrent networks and feed forward models that there are models
that operate on the character level but there’s also language models that operate on the word level as we will
see in this lecture. The difficulty of operating on the word level is that the vocabularies are much larger. But
it is also easier to operate on the word level since it is easier to model long-term dependencies while at the
character level long-term dependencies are hard. At character level vocabularies are typically small, like 10, 20
or 30 characters, while at the word level typical vocabularies are of size 10.000, 20.000 or 30.000, depending on
the language. So therefor we have to predict distributions over a very large state space.
As we can see in the example above, the sentence ’The dog ran away <EOS>’ decomposes based on the product
rule into the conditional distributions p of ’The’, times p of ’dog’ given ’The’, times p of ’ran’ given ’The’ and
’dog’ and so forth. We can see that language models in general but also in particular the given word language
model are autoregressive models that predict the next token given all the previous tokens in this sentence or
sequence. If a model is good then it has a high probability of predicting likely next words.
9.1.1 Applications
Language Recognition:
One of the most straightforward applications of language models is language recognition. Assume we have
trained two language models p and p0 which assign probabilities to sentences.
Since they are different models they might assign a different probability to the same sentence. Lets assume that
p has been trained on a large text corpora of English sentences and p0 on French sentences.
We can then determine which sentence a language is from by simply classifying according to:
(
English if p(x) > p0 (x)
Language(x) =
French otherwise
Generative Model:
We can also use a language model to sample new sequences. Assume a language model over sentences x:
p(x) = p(x1 , . . . , xT )
And p(x) has already been trained. By using the decomposition into conditional distributions
we can efficiently sample new sentences from the model distribution. Sow e start by sampling x1 , the first
word, and then we sample the second word conditioned on the first word, then we sample the third world
conditioned on the first two words and so forth. By doing this we can sample from this model in linear time
which is a sampling of words based on all the previous words.
112
translation. Assume we have a language model trained p of x and lets consider this as a prior over possible
sentences x in the form of a language model:
p(x) = p(x1 , . . . , xT )
Now assume a likelihood or proposal mechanism that tells us how likely sentence x translates to sentence y:
Which tells us according to this model how likely this translation is. This is how machine translation have
worked in the past. These were rule based and very complicated systems that propose sentences and assign
likelihoods to the translated sentences given the input sentence. It therefor assigns a probability to y given x.
The original sentence x and the translated sentence y might not have the same length. This is indicated by
the last indice of x being T and for y being T 0 . If we have p(x) and p(y|x) we can use Bayes rule to infer the
posterior over translated sentences:
p(y|x) p(x)
p(x|y) =
p(y)
We multiply these two, p(x) and p(y|x), together to get the posterior probability of x in the target language
given the sentence in the source language.
Modern machine translations don’t work like this. In this example we are using a generative model to infer
a discriminative decision rule. In modern machine translation systems the probability of x given y is directly
modeled by conditioning on the source sentence and modeling a distribution for the output sentence. We will
be looking at some examples of these type of models in the last units.
9.1.2 Training
So far we have considered just generic distributions but now to make it more precise we use pmodel (x|w) to
distinguish the model from the data distribution and to indicate that the model has some parameter w that we
want to train. So now let X = {xi }N i i
i=1 denote a training set with sentences xi = {x1 , . . . , xTi }.
We train the unconditional language model pmodel (xi |w) via maximum likelihood:
N
Y
ŵM L = argmax pmodel (xi |w)
w
i=1
N
X
= argmax log pmodel (xi |w)
w
i=1
= argmin −Epdata [log pmodel (x|w)]
w
This is a standard formulation that we have seen before where the maximum likelihood prediction is the argmax
over the parameters of the product of the model distribution for the entire dataset, where N is the size of the
dataset. By applying the logarithm, which is a monotonic function, we can turn the product into a sum. This
on the other hand corresponds to a minimization problem of the negative expectation of pdata of the logarithm
of pmodel x given w. So we minimize the cross entropy between the data and the model distribution. If we
manage to minimize this then the model is as similar as possible to the data distribution.
9.1.3 Evaluation
We want to be able to measure the performance of such a trained language model. Character language
models typically measure performance in bits per character.
In order to do that we first need to introduce some basic quantities from information theory:
Shannon Information:
Given a character sequence x of length T with probability p(x), the so called surprise, or shannon information,
normalized by the sequence length T is the normalizedT negative log-likelihood of x:
1 1X
I(x) = − log2 p(x) = − log2 p(xt |x1 , . . . , xt−1 ) [bits]
T T t=1
The − T1 is a normalizing constant to normalize with respect to the sequence length and thereby getting inde-
pendent of the sequence length. Since p(x) can be decomposed into this sequence of conditionals we then have
113
the sum of t from 1 to t over this decomposition. The unit of this is bits since we are using the basis of 2 for
the logarithm.
Intuitively if we consider a distribution over x. If we observe an x and the probability for that x is low under
that distribution then the negative logarithm of that is large, therefor the surprise is large. If we are observing
something that is unlikely then the information or the surprise is large. Conversely if we observe something
that has a high probability of occurring then the surprise is low. We are not surprised to see that because it is
likely to occur under our model.
Cross Entropy:
The expected surprise of the model under the data distribution (for sequences of length T ) is thus given by
the (normalized) cross entropy:
1
H(pdata , pmodel ) = Epdata − log2 pmodel (x)
T
Here we took the expression from the shannon information and took the expectation over the data distribution,
we average over the entire dataset. In other words this is the expected surprise given our trained model pmodel .
A model is better if the expected surprise is low. So if the model models the data distribution well then the
expected surprise is low and better models have a smaller number of bits.
as each sequence occurs in proportion to its probability anyways if we consider long enough sequences (think
of an infinite text x generated from the data distribution). Because this model distribution factorizes into
conditional probabilities where a word depends on all the previous words. So one can think of this with the
logarithm as a large sum. So we have to sum over the entire dataset of for example sentences and then we have
to sum over each sentence. Now if we let the sequence or sentence length go to infinity then the expectation
does not matter anymore because each sequence occurs in proportion to its probability anyways. So each of
these terms that are added together anyways occur in proportion to their probability. So for long sequence
lengths this is a reasonable approximation to make. Therefor in practice this is typically computed when cross
entropy in bits is reported, which is the measure for character level language models. Additionally in practice,
H(pdata , pmodel ) is evaluated on a test or validation sequence x. It is not evaluated on the training sequence
because that might just measure overfitting, but we are measuring generalization performance by evaluating
on a test or validation sequence x.
Example 1:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ). So the probability distribution over sequences of length t factorizes completely into
the probability of each individual token in a sequence. In other words in this simple example the probability of
the next word is completely independent of the previous words. Let’s further assume that the probability for
each of the symbols in the vocabulary is fifty percent: p(xt ) = 21 for both data and model distribution.
Then
10
1 1
H(pdata , pmodel ) = − log2 = log2 2 = 1 bit
10 2
The amount of information needed to predict the next character is 1 bit with this simple (unigram) model as
the next character is either A or B with equal probability. In other words, we can’t find a better encoding of
this language than using 1 bit per character.
Remark: A uniform distribution always maximizes the entropy (⇒ upper bound for |V|).
Example 2:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 1 and p(xt = B) = 0
for both data and model distribution.
Then
114
1
H(pdata , pmodel ) = −log2 110 = log2 1 = 0 bits
10
In this case, the amount of information needed to predict the next character is 0 bits as the next character
is always A. In other words, we don’t need any capacity to transmit this language through some channel, it
contains no information.
Remark: 0 bits is the minimal value for the entropy or cross-entropy (⇒ lower bound).
Example 3:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1 and p(xt = B) = 0.9
for both data and model distribution.
Then
1 9
1 1 9
H(pdata , pmodel ) = − log2 = 0.47 bits
10 10 10
We need 0.47 bits now as we sometimes observe A, but most often B. Thus, the information conveyed in this
language is larger than 0 bits (lower bound) and smaller than 1 bit (upper bound).
Example 4:
Consider a vocabulary V = {A, B} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1 and p(xt = B) = 0.9
for the model distribution and p(xt = A) = 1 for the data distribution.
Then
10
1 1
H(pdata , pmodel ) = − log2 = log2 10 = 3.32 bits
10 10
Because in the model we have a probability of 0.1 of observing A but in the data we only observe A.
We need more than 1 bit now as the model fits the data badly. We need 0.47 bits to encode any possible
outcome of pdata using the code optimized for pdata and 2.85 bits to encode any possible outcome of pdata using
the code optimized for pmodel :
In other words, and this is universally correct, the cross entropy of the model distribution with respect to the
data distribution is th entropy of the data distribution plus the KL divergence between the data and the model
distribution. Because this KL divergence must always be bigger or equal to zero we know that the entropy of
pdata is a lower bound on the cross entropy of pmodel with respect to pdata . So the cross entropy that we obtain,
because our model is always imperfect, must always be bigger or equal to the entropy of the data distribution.
1
= pmodel (x)− T
T
!− T1
Y
= pmodel (xt |x1 , . . . , xt−1 )
t=1
Thus, perplexity can be interpreted as the inverse probability of the test set, normalized by the sequence
length T which acts as a geometric mean.
Again, Perplexity(pdata , pmodel ) is evaluated on a test or validation sequence x.
Example 1:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Again we assume unigrams: p(x) = t=1 p(xt ) with p(xt ) = 31 for both data and model distribution.
Then
115
1
10 !− 10
1
Perplexity(pdata , pmodel ) = =3
3
We see that the perplexity models the number of possible next tokens to choose from (i.e., here the model is
maximally confused which of the 3 tokens A, B or C to pick). Because they all occur independently with equal
probability of one third.
Thus, perplexity is often also called the average weighted branching factor. Because we have three
branching opportunities here that are all equally likely, so the model has a average branching factor of three.
Remark: A uniform distribution maximizes the perplexity (⇒ upper bound for |V|).
Example 2:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 1 and p(xt ∈ {B, C}) = 0,
for both data and model distribution.
Then
− 1
Perplexity(pdata , pmodel ) = 110 10 = 1
We see that the perplexity reduces in this case as the next choice is certain. The average branching factor is
one, we know we have to choose A. The model is not surprised to see the test set as it is able to predict the
test set exactly (all A’s).
Remark: 1 is the minimal value for the perplexity measure (⇒ lower bound). However, this is only achievable
for languages that contain only a single token.
Example 3:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1, p(xt = B) = 0.9, and p(xt = C) = 0 for both data and model
distribution.
Then
1
1 9 !− 10
1 9
Perplexity(pdata , pmodel ) = = 1.38
10 10
In this case, the perplexity is slightly larger than 1 as the model is quite certain to predict B as the next
character, but sometimes it should predict A.
Example 4:
Consider a vocabulary V = {A, B, C} and sequences of length T = 10.
QT
Assume p(x) = t=1 p(xt ) with p(xt = A) = 0.1, p(xt = B) = 0.9, and p(xt = C) = 0 for the model distribution
and p(xt = A) = 1 for the data distribution.
Then
1
10 !− 10
1
Perplexity(pdata , pmodel ) = = 10
10
In this case the perplexity is larger than 3 as the model fits the data badly.
9.1.4 Summary
• For character language models, current performance is roughly 1 bit per character
• For word language models, perplexities of about 60 were typical until 2017
• According to Quora, there are 4.79 letters per word (excluding spaces)
• Assuming 1 bit per character, we have a perplexity of 25.79 = 55.3
• State-of-the-art models (GPT-2, Megatron-LM) yield perplexities of 10 − 20
• Be careful: Metrics not comparable across vocabularies or datasets
Additional Resources:
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
https://towardsdatascience.com/the-relationship-between-perplexity-and-entropy-in-nlp-f81888775ccc
116
9.2 Traditional Language Models
In this section we’re going to be talking about n-grams which have been the default language model for several
decades. A quick recap about language models:
T
Y
p(x) = p(x1 , . . . , xT ) = p(xt |x1 , . . . , xt−1 )
t=1
= p(x1 ) p(x2 |x1 ) p(x3 |x1 , x2 ) . . .
p(xt−n+1 , . . . , xt ) p(xt−n+1 , . . . , xt )
p(xt |xt−n+1 , . . . , xt−1 ) = =P
p(xt−n+1 , . . . , xt−1 ) xt p(xt−n+1 , . . . , xt )
p(xt−1 , xt ) p(xt−1 , xt )
p(xt |xt−1 ) = =P
p(xt−1 ) xt p(xt−1 , xt )
117
We see that we simply need to count the number of n-grams and (n-1)-grams in the training set to populate the
probability table of the n-gram model. We simply need to count um how often is xt−1 followed by xt divided
by the number of occurrences of xt−1 .
Smoothing: For large n, the n-gram probabilities are often zero as they haven’t been observed in the training
set but they might occur in the test set which means that we are overfitting. A simple heuristic is to add one
to all n-gram counts.
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less f rst you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.
One can see that as we increase the context size from a unigram, to a bigram, to a trigram, to a 4-gram, that
the text becomes much more realistic.
9.2.4 Summary
• n-gram models are simple models that make a Markov assumption
to model a distribution over sequences via probability tables
• However, they have limited history context, they cannot model long-term dependencies, and param-
eters grow exponentially
• Smoothing heuristics are required to deal with the resulting sparsity
• They can’t directly model conditional distributions (e.g., for translating sentences)
• In contrast to neural language models, they can be considered as local non-parametric predictors (thus
suffering the curse of dimensionality)
• Tokens are encoded as discrete items
– Large vocabularies are typically reduced to a shortlist (removing infrequent words)
√
– Any two words have the same distance ( 2 in one-hot vector space)
– Thus, n-gram models can’t share information between related (=close) words
118
9.3.1 Local Word Representations
• n-gram models, that effectively are just conditional probability tables, suffer from the curse of dimen-
sionality
• Modeling the joint distribution of n = 10 consecutive words with a vocabulary
of size |V| = 10000 results in intractable |V|n = 1040 parameters, even if we limit the history context to
10.
• However, and this is where the idea for this type of neural language models came from, when modeling
continuous variables, we obtain generalization more easily than in the discrete case, because the function
to be learned is expected to behave locally smoothly
• This has been observed in other contexts but they haven’t been used for NLP and its usage here has made
a tremendous impact on the performance of NLP algorithms
• Discrete word representations assume the same distance between words:
√
(1, 0, 0, . . . , 0)> − (0, 1, 0, . . . , 0)> = 2
| {z } | {z }
w1 w2 2
119
Figure 145: Distributed Word representations. Depiction of distributed word representations.
walking
1 walking 1
cat cat
1 dog
dog
0 0
1 1
A distributed representation learns to embed “cat” and “dog” nearby. So the distance between those two words
should be low.
Given that p(The cat is sitting) is high, and “cat” and “dog” are related
Once we have learned these word embeddings, we can also visualize these word embeddings. An example of
a two-dimensional t-SNE visualization of a word embedding can be seen on Fig. 147. The shown regions are
zoomed in because the vocabulary is much larger. When zooming into particular regions we see that in the
region on the left we see countries and in the region on the right we observe dates or years. So the model has
learned to relate the semantic entities to each other by learning them from text.
But we always have to be careful with 2D visualizations - Geoffrey Hinton once said “In a 30-dimensional
grocery store, anchovies can be next to fish and next to pizza toppings.” This illustrates how difficult it is to
imagine and how easy it is to be misled by high dimensional spaces. In high dimensions everything can be next
to everything. Even if we visualize word embeddings we can really just do it in 2D, everything else is beyond
the scope of our imagination.
120
Figure 147: Word embeddings. Two-dimensional t-SNE visualization of a word embedding model.
Key Ideas:
• Associate with each word in the vocabulary a distributed
word feature vector (a real-valued vector in RM )
• Express the joint probability function of word sequences in terms
of the feature vectors of these words in the sequence, and
• Learn simultaneously the word feature vectors and the parameters
of that probability function
So we have an end-to-end training process. This was the first end-to-end model that simultaneously learned
the word feature vectors and the parameters of the probability function.
121
Feedforward Model:
Formal description of the Feedforward Model:
• Input: Sequence of words
eyidx(wt )
• Output: Prob. of next word P (wt |wt−1 , . . . , wt−n+1 ) = P yi
ie
• 3 Layers: y = b + Wx + U tanh(d + Hx)
x = (C wt−n+1 , . . . , C wt−1 )
1. Fully connected ⇒ embedding
2. Fully connected + tanh
3. Fully connected + softmax where C ∈ RM ×|V| . Thus the model
• Input to 2nd layer: vector of concate- • .. scales linearly with |V|
nated word embeddings
• .. scales linearly with n
• Optional: direct connections
The model is shown in Fig. 148. The model has some inputs at the bottom of the figure which are the indices of
the words, where the words are one-hot-vectors. This builds a sequence of words which then is transformed by
the following matrix C from a one-hot-vector to a word embedding. So for each word we get a high-dimensional
word embedding and then all of these word embeddings get concatenated into a vector. Which builds the first
layer, which is then fully connected and followed by a tanh as a hidden second layer. Next this is fully connected
and followed by a softmax as the third layer. This softmax is then outputting the probability of the next word
wt in the sequence given all the previous words wt−n+1 until wt−1 .
There also exists a small variation of this model which is visualized with the dotted lines in Fig. 148. These
lines show direct connections from the first layer to the last layer
In the formal description of this feedforward model the probability for the next word wt is given by a softmax.
yidx(wt ) in the softmax is used to get the index of the one-hot-vector which is then normalized by all the others.
The y’s then are linear predictions of the direct connections followed by a tanh-nonlinearity from x. x is a
concatenation of all the one-hot-vectors for each previous word multiplied with the embedding matrix C. Now
C is in RM ×|V| . We can see that the model scales linearly with the vocabulary size because if we increase
the vocabulary then C also has to increase. But different to before now the model also scales linearly with n.
Before we had |V|n and but now the model scales linearly with n because we are increasing the length of the
concatenated vector that we input to the second layer and so the number of parameters in that second layer
increases linearly, not exponentially.
This is actually a common trick that’s used when working with neural networks that instead of representing some
conditioning explicitly, the conditioning is an input to a neural network and therefore, in terms of complexity, the
model becomes more tractable. This is also something that’s for instance used in the context of reinforcement
learning. This way there is a lot more sharing of information and parameters that takes place.
• Neural probabilistic language models lead to significantly better results in comparison with the best of
the n-grams, with a test perplexity difference of about 24% on Brown and about 8% on AP News (which
are both datasets), when taking the MLP versus the n-gram that worked best on the validation set.
• The results also suggests that the neural network was able to take advantage of more context (on Brown,
going from 2 words of context to 4 words brought improvements to the neural network, not to the n-grams).
This is because of the similarities that are learned and the generalization that is implied by it.
• Also showed that the hidden units are useful (MLP3 vs MLP1 and MLP4 vs MLP2), and that mixing the
output probabilities of the neural network with the interpolated trigram always helps to reduce perplexity.
• Skip-grams predict a word in their surrounding context, which is kind of an unsupervised training task
• Instead of predicting a distribution over words,
switch to a binary prediction problem
• The model is given pairs of words and needs to distinguish if the words occur next to each other in the
training corpus or they are sampled randomly
122
• Logistic regression on inner product of word embeddings
• Can be trained very efficiently with lots of data
Using this we can some word vector arithmetics:
Word Vector Arithmetics
In Fig. 149 we can see some expression on the left and the nearest most likely token on the right. For example,
we can take the word embedding vector of Paris and subtract the word embedding of France and add Italy and
we get Rome as an outcome. This also shows that the semantics that this word embeddings have learned is
quite meaningful.
• Encoding operates in reverse order of the input sentence to introduce short-term dependencies.
Because we end with the first word and then the decoder starts to decode with the first word in the
target language. By reversing the input sentence you get shorter dependencies because the first word in
the input sentence is likely related to earlier words in the output sentence.
• Otherwise this is a simple LSTM that processes the reverse ordered words, so that the words are embedded
in a distributed word representation. Also there is a hidden state that is carried forward through this
recurrent neural network and then there’s an intermediate representation:
• Intermediate representation produced by the encoder is called thought vector. This is the representation
that stores the meaning of that sentence that shall be translated. So once we reach the end of sentence
symbol we have the thought vector as the hidden representation. Which is a global representation of
the input sentence and then that fourth vector is decoded. The reason why we have this interface of the
Thought
Vector
123
forward vector is that the input sentence and the output sentence may have different length, also the
word order may be very different in different languages depending on the grammar. So we create a global
representation of the input sentence default vector that is passed to decoder.
• Encoding using 1000 dim. word embeddings, decoding via beam search
• First end-to-end system that outperforms rule-based models and also the first system that demonstrated
that this is really possible at large scale which led to deployment.
9.4.2 Decoding
Let w1 , . . . , wT denote the target sentence and let v denote the thought vector. Sampling a translation from
the LSTM decoder is simple, because we have the autoregressive nature that we can just sample a new word
from the distribution that is predicted by the LSTM, given the thought vector and the previous words. Where
the previous words are summarized in this hidden state in the case of an LSTM:
wt ∼ p(wt |v, w1 , . . . , wt−1 )
But this is not what we want to do in translation, in that case we like to compute the most probable
translation:
w1 , . . . , wT = argmax p(w1 , . . . , wT |v)
w1 ,...,wT
This is costly because there is a large number of possibilities. So searching that exhaustively is intractable. But
often a greedy algorithm often works well in practice:
wt = argmax p(wt |v, w1 , . . . , wt−1 )
wt
So we only take the argmax over wt instead of over the sequence w1 , . . . , wT . Also the probability is calculated
for wt given the thought vector v and w1 , . . . , wT .
124
In Fig. 151 an illustration of beam search can be seen. We start with an empty sequence and then we have
A, B, C, D, E as candidates and A and C are most likely so with a beam of size two we we keep track of A
and C.Next for A B is most likely and for C E is most likely. Then we continue with AB and CE. We always
maintain this short list of two hypotheses. If in the first step AB and AC would both be more likely than CE
then we would of course continue with those two possibilities.
Self-Attention
• The fundamental innovation of the Transformer is the self-attention layer
• For each position t in the sequence we compute an attention over the other positions in the sequence
• The transformer uses multiple heads (because multiple heads are empirically better), i.e., it computes
the attention operation multiple times (K = 8 in the original implementation)
• Self-attention then constructs a tensor A[k, t1 , t2 ] – the strength of the attention weight from t1 to t2 for
head k. So the attention that t1 pays to t2 in a particular layer of the transformer for a particular head k.
• In the paper, an embedding dimension of DJ = 512 is chosen per token
• Using K = 8 heads, this results in a dimension of DQ = DK = DV = 64 for the query, key and value
embeddings that are used for each token (can be different). So they do computations at 64 dimensions
but they do it eight times and then they concatenate them again to get 512 dimensions.
Multi-Headed Self-Attention:
In the following we see the equations for the (multi-headed) self-attention-layer, the most fundamental layer of
The Transformer. Note that there are more layers which are not described here.
For each head k and word position t, we compute a key, query and value vector. The queries Q and the keys
125
Figure 152: The Transformer. http://jalammar.github.io/illustrated-transformer/
K are used to compute the self-attention matrix A for head k. A is then multiplied with the values V to yield
embedding vectors H that are concatenated.
Q
Q`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (41)
K
K`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (42)
V
V`+1 [k, t, i] = Wl+1 [k, i, J] L` [t, J] (43)
" #
1
A`+1 [k, t1 , t2 ] = softmax p Q`+1 [k, t1 , I] K`+1 [k, t2 , I] (44)
t2 DQ
H`+1 [k, t, i] = A`+1 [k, t, T ] V`+1 [k, T, i] (45)
L
L`+1 [t, j] = W`+1 [j, I] (H`+1 [1, t, I], . . . , H`+1 [K, t, I]) (46)
(1)-(3): So we start at the bottom where we have just the words represented as one-hot-vectors or word
embeddings of one-hot-vectors. Then we multiply this first layer with a matrix W to get a vector. This
multiplication runs over i and yields a 64 dimensional vector for each head and each word. This is done three
times, once for the query vector, once for the key vector and once for the value vector.
(4): For computing the attention matrix A, we first take the dot product of the query and the key vector. The
divison by the square root of DQ serves as a normalization factor. All of this is then put into a softmax over t2 . As
can be seen in the equation, Q goes over t1 and K goes over t2 and so we get this matrix A were t1 attends to t2 .
(5): Next we multiply this attention matrix A to the value vector V , where we now sum over the T dimension
of this vector V and the t2 dimension of matrix A.
(6): So now we have these 64 dimensional vectors H that we now concatenate over the different heads and
then we compute a matrix product with W L in oder to get the output of that layer.
In summary we compute queries, we compute keys and we compute values for each head and for each time
step, or word, and then from the queries and keys we compute attention and that attention is multiplied with
the values so we take the values where the model attends to and these are then concatenated and fed into the
next layer. Fig. 152 shows an illustration of the transformer that includes the here described self-attention layer
which is the core of the transformer.
If one wants to use the transformer for translation you will have a sequence of input words that are going to
be encoded which are depicted on the bottom left of Fig. 152, underneath the encoder. Then information from
this encoder is given to the decoder. In particular the key and the queries are given to the decoder and they are
combined with the values of each word in the sequence encoded through this encoder that’s part of the decoder.
The output sentence is successfully established through the decoder by combining keys and queries from the
encoder with values from the decoder and the predicting the next word.
126
10 Graph Convolution Networks
10.1 Machine Learning on Graphs
This section gives a motivation why GCNs have a better performance compared to other network architectures
and illustrates some areas of application of GCNs.
10.1.1 Motivation
Multi layer perceptrons (MLPs) are very flexible function approximators. Theoretically a MLP with only one
layer can already act as a universal function approximator, given that this layer can grow infinitely wide.
However, MLPs do not scale well. If the input to an MLP is large the amount of model parameters grows large
as well. Furthermore, the more parameters a neural network has, the more it tends to overfit: Its generalization
capability decreases. For structured signal grids like 2D images or 1D time series a CNN addresses this issue and
allows good generalization due to its convolution operation over the the regular grid of signals, which reduces
the number of signals. Unfortunately a lot of signals cannot be described in such a structured way such as
molecules or natural language. These signals may be better described with the help of graphs. In order to
exploit graph structured data, a model class is necessary that scales better than an MLP when receiving large
input data without loosing its predictive power. Further on this model has to be more flexible than a CNN in
order to exploit the local connectivity structure of any graph - not only structured ones - as prior information.
Figure 153: Comparison of graph structures. An illustration of a regular structured graph (left) and an
unstructured graph (right). The red dots marks the point of interest that is calculated by exploiting the graph
structure and the neighboring nodes (green).
Graphs are descriptors of a signal structure, where the signals are described as nodes (points) and the similarity
between signals with edges (lines between signals). Fig. 153 shows a comparison between a structured graph
(left) and an unstructured graph (right). On both graphs a convolution operation is applied in order to calculate
the red dot by exploiting the graph structure and the neighboring nodes (green). On the structured graph on
the left a convolution filter is applied, that computes a value for the red dot by calculating the dot product
between the elements in green and red. The equivalent operation is done on the right by using a convolution
that exploits the locality in the graph. These convolutions on unstructured graphs are polynomials conditioned
on the graph structure with the graph structure being encoded in a matrix derived from the graph.
10.1.2 Applications
Since different types of data can be represented in form of a graph, GCNs can be applied in several tasks in
multiple domains. Some of these applications are scene graph generation, multi-object tracking, authorship
attribution, recommendation systems, learning molecular fingerprints, protein interface prediction, interaction
networks, learnable physics engines for control, decentralized control of robot swarms.
10.2 Graphs
GCNs operate on graphs. Therefore some basic knowledge of graphs and operations on graphs is presented in
this section.
A graph can be represented as a triplet G = (V, E, W) with vertices V, edges E and weights W. Vertices or
nodes are defined as a set of N numerical labels V = {1, . . . , N }. Edges are ordered pairs of these numerical
labels (i, j), where (i, j) ∈ E is interpreted as ”node i is influenced by node j”. Weights wij ∈ R are numbers
associated to edges (i, j) that determine the strength of the influence that node j has on node i. Depending on
these basic properties, a graph can be classified as a directed graph or as a symmetric graph.
127
In the case of a directed graph Fig. 154 the edge (i, j) differs from edge (j, i). Thus, a connection between
between two nodes can be one-way only, meaning that (i, j) ∈ E and (j, i) ∈ / E. Furthermore, if a connection
between two nodes is bidirectional {(i, j), (j, i)} ⊆ E, their weights can be different wij 6= wji , e.g. the connections
between nodes 3 and 5 in Fig. 154 could have different weights.
The undirected or symmetric graph is a directed graph, with the special property that its edge set and its
weights are symmetric. Therefore, if the edge set E contains (i, j) it implies that (j, i) ∈ E as well. This is
illustrated in Fig. 155 with the red line with arrows on both sides. For the weights of these connections (i, j)
that means that wij = wji for all (i, j) ∈ E.
A special case in directed and symmetric graphs is the unweighted graph, where all existing connections
(i, j) ∈ E have weights wij = 1. Since the weights of a graph convey valuable information as a prior for GCNs,
most graphs that are encountered in the context of deep learning are weighted.
Figure 154: Directed graph. Nodes are illus- Figure 155: Symmetric graph. The illustration
trated by numerical labelled purple dots, edges are is similar to Fig. 154. Due to the symmetry of the
denoted by arrow lines and weights with a descrip- graph the edges are directed in both directions and
tion in the form wij . there is exactly one weight per edge.
Graph Matrix Representation. The visual graph with its connections can be mathematically represented
with the help of different matrices. A common representation of the edges of a graph G is the adjacency
matrix A. In this sparse N × N matrix, each row and each column represent a vertex and each entry Aij
contains the weight wij of all connections i, j ∈ E. Furthermore if the graph is symmetric its holds that A = A>
as seen in Fig. 156.
0 1 1 0 0 2 0 0 0 0
1 0 1 1 0 0 3 0 0 0
1
A= 1 0 0 1 0
D= 0 3 0 0
0 1 0 0 1 0 0 0 2 0
0 0 1 1 0 0 0 0 0 2
Figure 156: Unweighted graph and corresponding matrices. The unweighted graph is transformed into
respective matrices A and D. Colors mark the corresponding location of the edges in the matrices
Another representation is the degree matrix D that contains the degree of a vertex on its diagonal axis (see
Fig. 156). The degree di of node i is the sum of weights of its incident edges:
X
di = wij , (47)
j∈N (i)
where N (i) - the neighborhood - is the set of nodes that influence node i:
The diagonal Dii therefore contains the degree d of vertex i: Dii = di . D can be expressed in terms of adjacency
matrix as D = diag(A1)4
The Laplacian matrix L combines matrices D and A as follows:
L = D − A. (49)
128
Both the Laplacian and the adjacency matrix can be normalized to achieve a more homogeneous representation
of a graph. That is especially helpful for asymmetric graphs, where some nodes have a lot of neighbors and/or
a high degree and some nodes have only a few neighbors and/or a low degree. The normalized adjacency
matrix expresses weights relative to node degrees:
1 1 wij
Ā = D− 2 AD− 2 ⇒ Āij = p . (50)
di dj
The Laplacian matrix, the adjacency matrix and the normalized forms of both are in the following represented
by the Graph Shift Operators S. That is, because for the theoretical analysis of GCNs the specific graph
matrix representation is irrelevant. Nevertheless during deployment the specific representation matters and
leads to different results. It holds that if G is symmetric S = S> .
Graph Signal Diffusion. Given the mathematical representation of a graph G as S that captures the struc-
ture of this graph, a graph signal can be defined as a vector x ∈ RN that assigns a value xi ∈ R to every node
i. In that way S encodes the expected proximity or similarity between components of x. A multiplication Sx
yields to a diffused signal y over G:
y = Sx (52)
P
If S is the adjacency matrix that would yield to yi = j wij xj . The operation mixes the neighboring values of
xi and diffuses its signal along the edges over G. Since one application of S moves the signal of x only one edge
at a time, a diffusion sequence over k steps can be defined as
The convolution therefore successively aggregates information from local to global neighborhoods. This is done
by a linear combination of the elements of the diffusion sequence Sk x weighted by the filter coefficients h.
Fig. 157 shows an exemplary application of a graph convolution.
Time Convolutions as Graph Convolutions. Time can be represented in form of a 1D line graph of
infinite length, where an input signal is shifted infinite many times. A resulting graph shift operator could
therefore be an infinitely big adjacency matrix A with ones only one left the diagonal. Now, given an input
signal x andPS = A, time convolution can be thought of as a polynomial on the adjacency matrix of the line
∞
graph: y = k=0 hk Sk x.
129
Figure 157: Graph convolutions as diffusion operators. A graph convolution can be visualized in a block
diagram. Iteratively a shift of S, a weighting of Sk x with hk and a summation of the weighted terms is applied.
Graph Fourier Transforms. Graph Fourier transforms (GFT) are equivalent to standard Fourier transforms
in the regular domain but on graphs. Given the eigendecomposition of the graph shift operator S = VΛV∗ ,
the GFT of a graph signal is given by
x̃ = V∗ x. (57)
The GFT is thus a projection onto the eigenspace of S. The GFT is a helpful tool for analyzing graph
information
P∞ processing systems, since in the GFT domain, graph convolutions are only pointwise operations
ỹi = k=0 hk λki x̃i . Further information on GFTs can be found at https://GCN.seas.upenn.edu/.
During the training a loss L is minimized between the predicted output signal ŷ = fh (x, S) and the original
output signal y in order to find the optimal values for the filter coefficients h:
X
h∗ = argmin L (fh (x, S), y) (59)
h
(x,y,S)∈D
If the inference problem at hand has a distinct input and output dimension, a readout layer is needed. Let
N denote the number of vertices in our graph, thus the input dimension of xN and M the output dimension of
ŷM with N 6= M . The readout layer is defined as:
K−1
X
ŷ = fh (x, S) = R hk Sk x (60)
k=0
with the readout matrix R ∈ RM ×N that matches the input and output dimension. Typically, R is not learned
but a design choice of the programmer. R could be such that it reads out only the value of node i by using
a unit vector at location i R = e>i . Another exemplary readout matrix could be the summation over all N
elements of all the graph R = 1> . This is useful for classification purposes.
Graph Perceptron. The graph filters as defined above have limited expressive power as they can only learn
linear mappings. In order to achieve a higher expressive power, the graph filters are combined with point-
wise non-linearities g(·), such as sigmoid, tanh or ReLU activation functions. This function is named graph
perceptron, since it introduces the same features for GCN as the perceptron for general neural networks. The
graph perceptron can be expressed as:
K−1
!
X
k
fh (x, S) = g hk S x . (61)
k=0
130
Because of the introduced non-linearity, the graph perceptron is able to express a larger function class.
Graph Convolution Networks. A GCN can now be defined by stacking multiple layers of graph perceptrons
on top of each other. Similar to a MLP, the GCN is recursively composed
K−1
!
X
k
x` = g h`k S x`−1 , (62)
k=0
where it is assumed that the input to the first layer is set to the input signal x0 = x. Note that compared to
(61) x and hk have index ` now to denote the layer. A short hand notation for the recursive application of the
graph perceptron over L layers is:
ŷ = fH (x, S) = xL , (63)
where H denotes a set of L vectors of trainable filter coefficients H = {h1 , . . . , hL }. The recursive composition
of a GCN is illustrated in Fig. 161. Empirical evidence suggests that, similar to CNNs deeper GCNs have a
better prediction performance due to more layers of non-linearities.
Learning Graph Convolution Networks. The filter parameters of a GCN are learned in same way as in
the single-layer-case but with the filter coefficients of all layers H as the optimization objective:
X
H∗ = argmin L (fH (x, S), y) . (64)
H
(x,y,S)∈D
Although GCNs are a special case of MLPs, their generalization capabilities to unseen input signals are better.
That is because GCNs are able to exploit a priori information about the symmetries of a graph that are carried
with the graph shift operator S. GCNs learn the structure of a graph as seen in Fig. 158. Furthermore, unlike
MLPs, GCNs can be trained on different graphs by passing different graph shift operators to them.
Figure 158: Comparing the generalization of MLPs and GCNs. The three images depict an output
signal on a graph, where the color indicates the signal strength. An MLP and a GCN receive the same input
and output pairs during training and produce a similar output signal (left). During inference time an unseen
signal is passed into the model and the MLP predicts an output signal similar to the predictions during training
(center). The GCN in contrast generalizes better since it knows the underlying structure of the graph. Therefore
it is able to predict an output signal that has a structure similar to the structures predicted during training but
on another part of the graph (right). This is due to the translation equivariance of such a network.
131
Figure 159: Filter banks. The block diagram on the left receives a signal x of the graph and applies this signal
separately with 3 different graph filters (G = 3). This produces a graph signal matrix Z = (z1 , z2 , z3 with each
vecotr containing one feature per node (right).
If now the input to a MIMO GCN is multidimensional with F features per node, the feature vector xf is
processed through G filters weighted by coefficients hfk g :
K−1
X
uf g = hfk g Sk xf (66)
k=0
The MIMO Graph Filter, thus generates an output with F × G features per node. Applying filter banks
over multiple layers would therefore lead to an unwanted exponential growth of features and parameters. For
this reason, in each layer the output of each graph filter G is summed up. This reduces the number of features
to G (see Fig. 160:
XF XF K−1
X fg
zg = uf g = hk Sk xf (67)
f =1 f =1 k=0
where 11
h1G
K−1 hk ··· k
F .. .. .
X
k
z1 G 1
··· z = S
|{z} x ··· x . .
k=0 hF 1 hF G
| {z } | {z }
= Z ∈ RN ×G
∈R N ×N
= X ∈ RN ×F ···
| k {z k
}
= Hk ∈ RF ×G
A MIMO GCN is now defined by stacking multiple MIMO Graph Filters on top of each other for L layers.
This yields to the following equation:
K−1
!
X
k
X` = g S X`−1 H`k (69)
k=0
with X0 = X being the input signal matrix to the first layer. (69) is expressed as a short-hand notation as:
Ŷ = fH (X, S) = XL (70)
11 Autoencoders
11.1 Latent Variable Models
Latent variable models capture the structure of the data space using latent variables that can be modelled
using unsupervised learning without the need for data-label pairs. Formally, a latent variable model relates the
observation space x ∈ RD and the latent space z ∈ RQ with N << Q. The mapping can be fw : x → z referred
as encoder or gw : z → x referred as decoder. Latent variable models always have a decoder but may or may not
have an encoder. Furthermore, the mapping can either be linear or non-linear and deterministic or probabilistic.
Table 2 cites one example for each of mapping types. Latent variable models capture the underlying structure
and semantics of the data manifold usually in a lower dimensional space.
132
Figure 160: Filter Banks with multiple inputs. Each of the F feature vectors xf is applied to G graph
filters and produces the outputs uf g that are then summed up to z g .
Figure 161: Block diagram of a GCN. The in- Figure 162: Block diagram of a MIMO GCN.
put vector x0 is passed into the first graph per- In contrast to the GCN, the MIMO GCN receives
ceptron in the first layer ` = 1. The output x1 is an input matrix X and produces with the parame-
equally passed into layer ` = ` + 1 until the final ter coefficient matrix H the output matrix Z con-
output x3 is is produced. taining the values from the filter bank.
where p(z) is the prior probability over the latent variable z ∈ RQ and p(x|z) represents the likelihood of x
given z. Given a dataset X , generative latent variable models aim to maximize p(x) where x ∈ X by learning
a prior p(z) and the likelihood p(x|z). Each ata point x is associated to a unique latent variable z. Fig. 163
shows graphical representation of generative latent variable model.
133
Deterministic Probabilistic
Linear Principle Component Analysis(PCA) Probablistic PCA
Non-Linear with Encoder Autoencoder Variational Autoencoder
Non-Linear without Encoder Generative Adversarial Networks
Figure 163: Plate notation of generative latent variable models. The model parameters w are fixed for
all the N data points. However, each input is associated with one distinct latent variable
PN
where x̂i is the reconstruction/prediction of the data point xi and x = N1 i=1 xi is data mean, V =
(v1 , . . . , vQ ) is an orthonormal basis of the latent space. Essentially, PCA constitutes a decoder given
by x = x + Vz (obtained by rewriting the above equation in matrix form) and an encoder z = VT (x − x)
obtained by inverting the above linear mapping. The goal of PCA is to minimize the L2 reconstruction loss wrt
latent variables Z and the orthogonal basis V. We formalize the reconstruction loss below:
N
X N
X Q
X
L(Z, V) = kx̂i − xi k2 = kx + zij vj −xi k2
i=1 i=1 j=1
| {z }
x̂i
N X
X Q
= k zij vj + x − xi k2
i=1 j=1
N
X XQ Q
X
2
= zij +2 zij vjT (x − xi ) + kx − xi k2
i=1 j=1 j=1
134
∂L(Z, V) !
= 2zij + 2vjT (x − xi ) = 0
∂zih
∗
=⇒ zij = −vjT (x − xi )
We plugin the solution for Z = Z∗ into the reconstruction loss L(Z, V) which can be simplified and rewritten
as
XN XQ
L(Z∗ , V) = − ∗2
zij + kx − xi k2
i=1 j=1
Q
X N
X
=− vjT Svj + kx − xi k2
j=1 i=1
PN
where S = i=1 (x − xi )(x − xi )T is the scatter matrix of the dataset X.
We then proceed to find the orthogonal basis V using the simplified form. Since, V is constrained to be the
orthonormal basis, the solution V∗ that minimizes L(Z∗ , V) is subjected to an equality constraint vjT vj = 1 for
all j ∈ 1, . . . , Q. We employ Lagrange multiplier to solve for V and the Lagrange expression for the same is
given by:
Q
X N
X Q
X
L(Z∗ , V, λ) = − vjT Svj + kx − xi k2 + λj (vjT vj − 1)
j=1 i=1 j=1
We find V∗ by setting the partials of the above Lagrange expression wrt V to zero
∂L(Z∗ , V, λ) !
= −2Svj + 2λj vj = 0
∂vj
=⇒ Svj = λj vj
The possible solutions for (λ, V) are the eigen values and vectors of the scatter matrix S ∈ RD×D . There can
exist upto D eigen vectors. For latent space with dimensions Q < D we choose the eigen vectors with top Q
largest eigen values. This
PQ ensure the P reconstruction loss is minimized as the loss is proportional to the sum of
Q
pruned eigen values. j=1 vjT Svj = j=1 λj .
V ar(z) = E (vT (x − x) − E vT (x − x) )2
∝ vT Sv
which shows maximizing the variance is proportional to maximizing the term vT Sv. Thus, one can solve the
below optimization objective wrt V for variance maximization of the latent variables where V is constrained to
be a orthonormal basis like before.
Q
X Q
X
(V∗ , λ∗ ) = argmax vjT Svj + λj (vjT vj − 1)
V,λ j=1 j=1
The above objective can be maximized by the Q largest eigen values and their corresponding eigenvectors.
11.2.3 Applications
Despite being a linear model, PCA yields good reconstructions with very low dimensional latent spaces. Besides
dimensionality reduction, PCA can also be used to explore the latent properties of the datasets. Fig. 164
showcases two applications of PCA on image manifold.
135
(a)
(b)
Figure 164: Applications of PCA to Image ManiFold a) Application of principal component analysis on
MNIST digit dataset. PCA is applied on the images with digit 3. The reconstruction becomes better as the
number of basis are increased. b) shows reconstruction of face images using only 3 eigen components achieving
significant dimenstionality reduction.
11.3 Autoencoder
Autoencoders generally constitutes a deterministic non-linear encoder fw and a decoder gw . They are usually
learnt to predict the input x ∈ RD as output x̂ ∈ RD . The encoder fw : x → z maps the input to a latent
code z ∈ RQ while decoder gw : z → x tries to reconstructs the input back from the latent code z. Hence,
autoencoders are trained to minimize the reconstruction loss such as squared error, kx̂ − xk2 . Fig. 165 presents
a schematic illustration of auto encoders.
Figure 165: Schema of an autoencoder. Autoencoder takes the input x and encodes into the latent repre-
sentation z using the encoder network fw (x). The latent code z is then reconstructed back using the decoder
network gw (z) which outputs the prediction x̂. To learn the model parameters w squared reconstruction error
is used.
The non-linear mappings fw and gw are parameterized by neural networks with weights w. One can choose
an appropriate class of the neural networks(Convolutional Neural Networks, Multi Layer Perceptron, etc.) based
on the nature of the input data.
136
the goal of autoencoders is to minimize the reconstruction error the optimization objective can be written as:
N
X
w∗ = argmin kx̂i − xi k2
w
i=1
N
X
= argmin kgw (fw (xi )) − xi k2
w
i=1
N
X
= argmin k(B(Axi + a) + b) − xi k2
w
i=1
N
X
= argmin k(Cxi + c) −xi k2
w | {z }
i=1
x̂i
(c) Autoencoder with 1D latent space (d) Autoencoder with 2D latent space
Figure 166: Comparison of PCA and autoencoder on cosine manifolda) PCA applied on cosine data
with 1 dimensional latent space. b) Linear autoencoder with same latent space as PCA. It is interesting to
see that after sufficient number of iterations the reconstructions of PCA and linear autoencoder are almost
indistinguishable. c) Non-linear autoencoder with 1 dimensional latent space. With the non-linear functions
autoencoder produces much better reconstructions than standard PCA. d) With increase in the latent space
the reconstruction completely overlaps with the data points.
137
11.4 Variational Auto Encoders(VAE)
11.4.1 Intractability of Learning Generative Latent Variable Models
Generative latent variable models disccussed in Section11.1.1 essentially captures the underlying data distri-
bution p(x) by considering a simple Bayesian model with prior probability p(z) over the latent space and
conditional probability p(x|z) over the input space given z. However, learning optimal parameterized distri-
butions is often intractable. Consider pw (x) to be a distribution parameterized by some weights w. The goal
of generative latent variable models is to maximize p(x) for the data points x ∈ X in our dataset. To find the
optimal weights w∗ that maximizes p(x) one can use the familiar minimization objective stated below:
The objective requires computing expectation over the latent variable z with distribution p(z). One is forced to
draw huge number of samples from p(z) to approximate the expectation and this number exponentially grows
with increase in latent dimensions making computation of the objective intractable.
Figure 167: Illustration of intractability with prior probability p(z). We wish to estimate the probability of
p(xi ) by drawing three samples from p(z) For all the three samples p(xi |z) is almost zero leading to draw more
samples from p(z). However, using the recognition model q(z|xi ) the search space for z is reduced and thereby
leading to significant estimate of p(xi ) with just three samples.
138
11.4.3 The Evidence Lower Bound
The log likelihood pw (x) can be rewritten as follows using the recognition model qw (z|x). We drop the parameter
w in below equations as the bound holds in general.
p(x)p(z|x)
log p(x) = Ez∼q(z|x) log
p(z|x)
p(x, z) q(z|x)
= Ez∼q(z|x) log + Ez∼q(z|x) log
q(z|x) p(z|x)
p(x, z)
= Ez∼q(z|x) log + KL(q(z|x), p(z|x))
q(z|x) | {z }
≥0
Note that the above reformulation maintains the equality. Furthermore, the KL divergence term in the above
equation is intractable as it relies on the true posterior p(z|x). However, KL divergence is non-negative
distance measure. Thus, we eliminate the intractable KL divergence and obtain a lower bound for the log
likelihood defined by:
p(x, z) q(z|x)
log p(x) ≥ Ez∼q(z|x) log =⇒ − log p(x) ≤ Ez∼q(z|x) log
q(z|x) p(x, z)
ELBO
Alternatively, one can upper bound the negative log likelihood with the negative ELBO. For better intrepreta-
tion we expand the negative ELBO into:
q(z|x) q(z|x)
Ez∼q(z|x) log = Ez∼q(z|x) log − log(p(x|z))
p(x, z) p(z)
= KL(q(z|x), p(z)) + Ez∼q(z|x) [log p(x|z)]
One can interpret the above bound as enforcing KL measure to inhibit the divergence of the approximate
recognition model term from the prior distribution p(x) while the other term measures the reconstruction
error wrt to the input x.
1 1 1 T −1
qw (z|x) = exp − (z − µw (x)) Σ w (x) (z − µw (x))
(2π)Q/2 |Σw (x)|1/2 2
139
Figure 168: In this example the random variable z on the left takes a standard Gaussian distribution. However,
by finding an approximate function the random variable can evoke a different distribution. Here when g(z) =
z/10 + z/kzk the normally distributed random variable can generate the distribution on the right using g(z).
This is how VAEs are expressive despite simple prior probability distribution.
where Σw (x) and µw (x) are parameterized by neural network with weights w. Typically, Σw (x) = diag(σ 2w (x))
a diagonal matrix and shares the same backbone with µw (x). The reason for restricting the recognition model
and the prior to be Gaussian distributions such that KL divergence term in the learning objective is tractable
and can be computed directly using an analytical solution.
One final problem with the current framework is that the gradients for reconstruction term wrt w needs to
backpropagate through sampling operation that is non-differentiable. This makes training the parameters
with stochastic gradient methods challenging. VAE resolves this problem by invoking an alternate method for
sampling from qw (z|x) which is called as the reparameterization trick.
Reparameterization trick moves the sampling step to an input layer and avoids directly sampling from qw (z|x).
Given an input xi , Σw (xi ) and µw (xi ) can be deterministically computed. Alternative to sampling directly
from the distribution N (µw (xi ), σ w (xi ), ) one can sample a standard Gaussian distribution by ∼ N (0, I) and
construct the latent sample as
K
z = µw (xi ) + σ w (xi )
In this way one can sample from the latent variables by first sampling the error variable and computing z
using deterministic functions. With the above trick the reconstruction term can be written as
h K i
Ez∼qw (z|xi ) [− log pw (xi |z)] = E∼N (0,1) − log pw (xi |z = µw (xi ) + σ w (xi ) )
11.4.6 Applications
Although, variational autoencoders assume prior distribution of the latent space pw (z) to be a standard normal
distribution, with powerful neural networks they can be quite expressive. VAE can learn a mapping from the
standard normal distribution to any distribution of the latent variables using the first few layers of the decoder
networks. See Fig. 168 for an example. A broad range of recent demonstrations of VAEs are illustrated in
Fig. 170.
140
(a) VAE regular
12.1.2 GANs
Generative Adversarial Networks are currently the most popular generative models. Just like (Variational)
Autoencoders they are non-linear, but GANs are implicit models, do not have an encoder and are trained
differently.
GANs use an adversarial process in which two models are trained simultaneously. Given a prior (e.g. gaussian)
p(z) over latent variables z ∈ RQ as well as observations x ∈ RD the two models can be described as:
• a generator GwG : RQ 7→ RD that gets a sample z ∼ p(z) as input and generates x̂ ∼ pmodel with pmodel
approximating pdata
• a discriminator DwD : RD 7→ [0, 1] that estimates the probability that a sample comes from pdata
The goal of G is to maximize the probability of D making a mistake while the goal of D is to make as little
mistakes as possible. Due to the two models playing against each other, both of them need to improve constantly.
This can be described as a two-player minimax game with a value function V (G, D):
V (G, D) describes the expectation of D recognizing samples from pdata and rejecting samples generated by
G. The parameters of D are optimized towards maximising this expectation while the parameters of G are
optimized towards minimizing V (G, D). The optimization of the models is attained via Backpropagation.
Generator and discriminator can be implemented with different types of neural networks (e.g. MLPs,
CNNs, RNNs). After training only the generator is kept in order to sample from pmodel . The discriminator
can be seen as a learned loss function on x̂ that is only necessary for training the generator. Fig. 171 shows an
overview of the functionality of GANs.
12.1.3 Algorithm
In order to train GANs, the discriminator D is updated first. For k iterations an equal amount of real samples
x as well as latent samples z are drawn. The latent samples are used to generate samples x̂. In each iteration
the weights pf the discriminator are updated by stochastic gradient ascent of V (G, D).
Afterwards, the generator is updated. After drawing new latent samples z, the weights of G are updated via
141
stochastic gradient descent. Here, only the second part of V (G, D) is used since the first part of the sum is
independent of the generator.
D is optimized k times (typically with k ∈ {1, . . . , 5}) for each optimization of G in order to maintain it near
its optimal solution, while preventing overfitting if the dataset is finite. For this to work, the generator has to
be changed slowly, i.e. its learning rate must be small. The algorithm stops when the value function converges
to some extreme.
12.1.5 Expressiveness
GANs are very expressive. Fig. 173 shows an one-dimensional example of a gaussian pmodel distribution shifting
its mean and increasing the variance in order to adapt to pdata . It can be seen how with increasingly overlapping
distributions pmodel and pdata the discriminator becomes increasingly uncertain until it returns a probability of
about 0.5 everywhere.
∗ pdata (x)
DG (x) =
pdata (x) + pmodel (x)
Proof. The training criterion for the discriminator D is to maximize (wrt. D):
Z Z
V (G, D) = pdata (x) log(D(x)) dx + p(z) log(1 − D(G(z))) dz (73)
Zx z
142
Set the derivative to 0 to calculate the maximum:
a b a b
0= − ⇔ = (78)
y (1 − y) y (1 − y)
⇔ a − ay = by ⇔ a = by + ay (79)
a
⇔y= (80)
a+b
This proposition shows that the optimal discriminator will return the probability of an input x being part of
the data distribution pdata .
∗
Here, it is proved that for the value function V (G, DG ) to be minimal even though the discriminator is
optimal, pdata has to be equal to pmodel .
Proposition 2. Convergence
∗
If G and D have enough capacity, and at each update step the discriminator D is allowed to reach D = DG ,
and pmodel is updated to improve
∗ ∗ ∗
V (pmodel , DG ) = Ex∼pdata [log DG (x)] + Ex∼pmodel [log(1 − DG (x))] (86)
Z
∝ sup pmodel (x) log(1 − D(x)) dx (87)
D x
Proof. The argument of the supremum is convex in pmodel . The supremum doesn’t change convexity,
∗
thus V (pmodel , DG ) is also convex in pmodel with global optimum pmodel = pdata as shown in Theorem 1.
This proposition shows that with G and D large and complex enough and the discriminator being al-
lowed to reach optimality each time it is updated, the generator will achieve to adapt pmodel exactly to pdata .
143
12.1.7 Mode Collapse
A common failure in GANs is mode collapse. If the generator learns to cover a part of pdata very well, the
gradients regarding this part are low while the gradients regarding the rest of pdata are high. This can lead to
the generator repeatedly learning different parts of the data distribution while unlearning the previous parts.
Thus, the generator is always capable of producing high-quality samples but with very low variability.
There exist different strategies for avoiding mode collapse.
• encourage diversity: in minibatch discrimination the discriminator outputs a probability regarding the
whole batch. As a result, a broader range of samples have to be generated closely to pdata
• anticipate counterplay: stabilize the training of the generator via anticipating the response of the discrim-
inator. Fig. 174 shows an example of anticipate counterplay. This strategy requires backpropagating the
generator gradient through the anticipation steps.
• experience replay: minimizes hopping back and forth between modes
• train multiple GANs: approximate the whole data distribution with multiple GANs covering a range of
modes
• optimization objective: use different optimization strategies in order to improve the learning
144
F ID could be proven to increase when manipulating images or using images from different distributions. While
it can detect mode collapse within a class of images, it is not suitable for measuring mode collapse over the
whole data distribution - unless knowing where to find modes and testing each of them separetely.
The regularizer penalizes large gradients of the discriminator with respect to x. This has been shown to lead to
converging instead of oscillating parameters (Fig. 176). When applying gradient penalties to DCGANs, high
quality images can be produced without needing to rely on ideas like the ones mentioned in 12.2.1.
12.2.4 CycleGAN
CycleGANs are used for Image-to-Image translation, i.e. they map images from one domain X to another
domain Y and back. A cycle consistency loss is used to check how well an image x is reconstructed after being
mapped to another dimension and back. See Fig. 177 for visualization. Possible use cases are the creation of
paintings out of photographs or photographs out of line drawings.
12.2.6 BigGANs
Really large image sets like ImageNet are difficult to handle for GANs due to their diversity. The idea of
BigGANs is to create class-conditional GANs which receive the class label as an additional input. Instead of
using regularizers which might hamper the performance of GANs, the authors monitored the singular values
of the weight matrices of generator and discriminator. By means of these singular values they could determine
when mode collapse started to happen and stop training at these locations.
fθ : R3 × X → [0, 1]
The first input is a 3D-location. X inputs a condition, e.g. an image and the neural network then outputs an
occupancy probability.
The representation can then be learned by a VAE in order to create new samples. Moreover, it can be extended
to 4-dimensional space and used e.g. to model conditional surface light fields or Generative Radiance Fields.
References
[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv.org, 2018.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. Journal of Machine Learning Research (JMLR), 3:1137–1155, 2003.
145
[3] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capability and trainability in recurrent neural
networks. Proc. of the International Conf. on Learning Representations (ICLR), pages 1–17, 2017.
[4] Mané Vasudevan Le Cubuk, Zoph. Learning augmentation strategies from data. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2019.
[5] Rauber Schütt Bethge Geirhos, Temme and Wichmann. Generalisation in humans and deep neural net-
works. 2018.
[6] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256, 2010.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
[9] Leal-Taixe and Niessner. I2dl.
[10] Herbert Robbins and Sutton Munro. A stochastic approximation method. The Annals of Mathematical
Statistics, 1951.
[11] Krizhevsky Sutskever Srivastava, Hinton and Salakhutdinov. Dropout: a simple way to prevent neural
networks from overfitting. In Journal of Machine Learning Research (JMLR), 2014.
[12] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initializa-
tion and momentum in deep learning. In Proc. of the International Conf. on Machine learning (ICML),
volume 28, pages 1139–1147, 2013.
[13] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems (NeurIPS), 2014.
[14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.
ISCA Speech Synthesis Workshop, pages 1–15, 2016.
[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing
Systems (NeurIPS), pages 5998–6008, 2017.
146
References
[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling. arXiv.org, 2018.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language
model. Journal of Machine Learning Research (JMLR), 3:1137–1155, 2003.
[3] Jasmine Collins, Jascha Sohl-Dickstein, and David Sussillo. Capability and trainability in recurrent neural
networks. Proc. of the International Conf. on Learning Representations (ICLR), pages 1–17, 2017.
[4] Mané Vasudevan Le Cubuk, Zoph. Learning augmentation strategies from data. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2019.
[5] Rauber Schütt Bethge Geirhos, Temme and Wichmann. Generalisation in humans and deep neural net-
works. 2018.
[6] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256, 2010.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
[13] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems (NeurIPS), 2014.
[14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal
Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.
ISCA Speech Synthesis Workshop, pages 1–15, 2016.
[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing
Systems (NeurIPS), pages 5998–6008, 2017.
147
(a)
(b)
(c)
148
Figure 172: The gradient trick. While log(1 − D(G(z))) has a low gradient in the beginning, log(D(G(z)))
has a high gradient and is thus more useful for training.
149
Figure 174: Anticipating the response of the discriminator. The first row shows the result of a GAN on
a target distribution with multiple modes when using an anticipation strategy. The second row shows how the
GAN jumps between modes if no anticipation strategy is used.
Figure 175: Vector arithmetic on averaged z vectors of samples. z vectors are combined via arithmetic
operations and Gaussian noise is added to produce new samples.
Figure 176: Gradient penalties. (a) the parameters oscillate. (g) and (h) the paramters converge to an
equilibrium.
150
Figure 177: Cycle GAN image-to-image translation.
Figure 178: 3d representations of images. 3D-R2N2 uses Voxels. PSGN uses Points. Pix2Mesh and
AtlasNet use Meshes.
151