0% found this document useful (0 votes)

173 views149 pages

A Little Book of Deep Learning - Francois Fleuret

Uploaded by

twistedfateofthecards

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views149 pages

A Little Book of Deep Learning - Francois Fleuret

Uploaded by

twistedfateofthecards

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 149

The Little Book

of
Deep Learning

François Fleuret

beta-2023.05.15
François Fleuret is professor of computer science
at the University of Geneva, Switzerland.

The cover illustration is the schematic of the

Neocognitron by Fukushima [1980], a key ances-
tor of deep neural networks.
Contents

List of figures 7

Foreword 8

I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and over-fitting . . . . . 15
1.4 Categories of models . . . . . . 17
2 Efficient computation 19
2.1 GPUs, TPUs, and batches . . . . 20
2.2 Tensors . . . . . . . . . . . . . . 22
3 Training 24
3.1 Losses . . . . . . . . . . . . . . 25
3.2 Autoregressive models . . . . . 28
3.3 Gradient descent . . . . . . . . 31

3 149
3.4 Backpropagation . . . . . . . . 36
3.5 Training protocols . . . . . . . 41
3.6 Training data . . . . . . . . . . 44

II Deep models 46
4 Model components 47
4.1 The notion of layer . . . . . . . 48
4.2 Linear layers . . . . . . . . . . . 50
4.3 Activation functions . . . . . . 59
4.4 Pooling . . . . . . . . . . . . . . 62
4.5 Dropout . . . . . . . . . . . . . 65
4.6 Normalizing layers . . . . . . . 67
4.7 Skip connections . . . . . . . . 71
4.8 Attention layers . . . . . . . . . 74
4.9 Token embedding . . . . . . . . 81
4.10 Positional encoding . . . . . . . 82
5 Architectures 84
5.1 Multi-Layer Perceptrons . . . . 85
5.2 Convolutional networks . . . . 87
5.3 Attention models . . . . . . . . 94

III Applications 101

6 Prediction 102
6.1 Image denoising . . . . . . . . . 103
6.2 Image classification . . . . . . . 105
6.3 Object detection . . . . . . . . . 106
6.4 Semantic segmentation . . . . . 111
4 149
6.5 Speech recognition . . . . . . . 114
6.6 Text-image representations . . . 116
7 Synthesis 119
7.1 Text generation . . . . . . . . . 120
7.2 Image generation . . . . . . . . 122

The missing bits 126

Afterword 131

Bibliography 132

Index 141

5 149
List of Figures

1.1 Kernel regression . . . . . . . . . . 14

1.2 Overfitting of kernel regression . . 15

3.1 Causal autoregressive model . . . . 30

3.2 Gradient descent . . . . . . . . . . . 32
3.3 Back-propagation . . . . . . . . . . 36
3.4 Train and validation losses . . . . . 42

4.1 1d convolution . . . . . . . . . . . . 52
4.2 2d convolution . . . . . . . . . . . . 53
4.3 Stride, padding, and dilation . . . . 54
4.4 Receptive field . . . . . . . . . . . . 57
4.5 Activation functions . . . . . . . . . 60
4.6 Max pooling . . . . . . . . . . . . . 63
4.7 Dropout . . . . . . . . . . . . . . . . 66
4.8 Batch normalization . . . . . . . . . 68
4.9 Skip connections . . . . . . . . . . . 72
4.10 Interpretation of the attention operator 75
4.11 Attention operator . . . . . . . . . . 77

6 149
4.12 Multi-Head Attention layer . . . . . 79

5.1 Multi-Layer Perceptron . . . . . . . 85

5.2 LeNet-like convolutional model . . 88
5.3 Residual block . . . . . . . . . . . . 89
5.4 Downscaling residual block . . . . . 90
5.5 ResNet-50 . . . . . . . . . . . . . . . 91
5.6 Self and cross-attention blocks . . . 95
5.7 Transformer . . . . . . . . . . . . . 96
5.8 GPT model . . . . . . . . . . . . . . 98
5.9 ViT model . . . . . . . . . . . . . . 99

6.1 Convolutional object detector . . . 107

6.2 Object detection with SSD . . . . . 108
6.3 Semantic segmentation with PSP . . 112
6.4 CLIP zero-shot prediction . . . . . . 118

7.1 Denoising diffusion . . . . . . . . . 123

7 149
Foreword

The current period of progress in artificial in-

telligence was triggered when Krizhevsky et al.
[2012] showed that an artificial neural network
with a simple structure, that had been known
for more than twenty years [LeCun et al., 1989],
could beat complex state-of-the-art image recog-
nition methods by a huge margin, simply by
being large and trained on a large data set.

This breakthrough was made possible thanks to

Graphical Processing Units (GPU), mass-market
highly parallel computing devices developed for
real-time image synthesis, and repurposed for
artificial neural networks.

Since then, under the umbrella term of “

”, innovations in the structures of these
deep learning
networks, the strategies to train them and ded-
icated hardware, have allowed for an exponen-
8 149
tial increase in both their size and the quantity
of training data they take advantage of [Sevilla
et al., 2022]. This resulted in a wave of successful
applications across technical domains, from com-
puter vision and robotics, to speech, and natural
language processing.

Although the bulk of deep learning is not par-

ticularly hard to understand, it combines di-
verse components, which makes it complicated
to learn. It involves multiple branches of mathe-
matics such as calculus, probabilities, optimiza-
tion, linear algebra, or signal processing, and it
also is deeply anchored in computer science, pro-
gramming, algorithmic, and high-performance
computing. Instead of going into details and try-
ing to be exhaustive, this little book is limited to
the necessary background and technical tools to
understand a few important models.

If you did not get this book from its official url

https://fleuret.org/public/lbdl.pdf

please do so, so that I can estimate the number

of readers.

François Fleuret
April 21, 2023

9 149
Part I

Foundations

10 149
Chapter 1

Machine Learning

Deep learning belongs historically to the larger

field of statistical machine learning, as it funda-
mentally concerns methods able to learn repre-
sentations from data. The techniques involved
come originally from artificial neural networks,
and the “deep” qualifier highlights that models
are long compositions of mappings, now known
to achieve greater performance.

The modularity of deep models, their versatility,

and scaling qualities, resulted in a plethora of
specific mathematical methods and software de-
velopment tools that established deep learning
as a separate and vast technical field.

11 149
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.

In many real-world situations where x is a high-

dimension signal captured in an uncontrolled
environment, it is too complicated to come up
with an analytical recipe that relates x and y.

What one can do is to collect a large

training set
𝒟 of pairs (xn ,yn ), and devise a
parametric model
f , a piece of computer code that incorpo-
rates trainable parameters w that modulate its
behavior, and such that, with the proper values
w∗ , it is a good predictor. “Good” here means
that if an x is given to this piece of code, the
value ŷ = f (x;w∗ ) it computes is a good esti-
mate of the y that would have been associated
to x in the training set if they had been there.

This notion of goodness is usually formalized

with a loss ℒ (w) which is small when f (·;w) is
good on 𝒟 . Then, training the model consists of
computing a value w∗ that minimizes ℒ (w∗ ).

Most of the content of this book is about the defi-

12 149
nition of f which, in realistic scenarios, is a com-
plex combination of pre-defined sub-modules.

The trainable parameters that compose w are

often referred to as weights, by analogy with
the synaptic weights of biological neural net-
works. In addition to these parameters, models
usually depend on meta parameters which are
set according to domain prior knowledge, best
practices, or resource constraints. They may also
be optimized in some ways, but with techniques
different than those used to optimize w.

13 149
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1

and f (·;w) is a linear combination of a pre-

defined basis of functions f1 ,...,fK , with w =
(w1 ,...,wK ) their respective weights.

Since f (xn ;w) is linear with respect to the wk s

and ℒ (w) is quadratic with respect to f (xn ;w),
the loss ℒ (w) is quadratic with respect to the
wk s, and finding w∗ that minimizes it boils down
to solving a linear system. See Figure 1.1 for an
example with Gaussian kernels as fk .

Figure 1.1: Given a basis of functions (blue curves)

and a training set (black dots), we can compute an
optimal linear combination of the former (red curve)
to approximate the latter for the mean squared error.
14 149
1.3 Under and over-fitting
A key element is the interplay between the
capacity
of the model, that is its flexibility and
ability to fit diverse data, and the amount and
quality of the training data. When the capacity
is insufficient, the model cannot fit the data and
the error during training is high. This is referred
to as under-fitting.

On the contrary when the amount of data is

insufficient, as illustrated with an example on
Figure 1.2, the performance during training can
be excellent, but unrelated to the actual fit to
the data structure, as in that case the model will
often learn random noise present in the signal.

Figure 1.2: If the amount of training data is small com-

pared to the capacity of the model, the performance
during training reflects poorly the actual fit to the un-
derlying data structure, and consequently the useful-
ness for prediction.

15 149
This is over-fitting.

So a large part of the art of applied machine

learning is to design models which are not too
flexible, but still able to fit the data. This is done
by crafting the right inductive bias in a model,
which means that its structure corresponds to
the underlying structure of the data at hand.

Even though this classical perspective is relevant

for reasonably sized deep models, things get con-
fusing with large ones which have very large
number of trainable parameters and extreme ca-
pacity, but still perform well for prediction. We
will come back to this in § 3.5.

16 149
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:

Regression consists of predicting a continuous

valued vector y ∈ RK , for instance a geometrical
position of an object, given an input signal X.
This is a multi-dimensional generalization of the
setup we saw in § 1.2. The training set is com-
posed of pairs of an input signal, and a
ground truth
value.

Classification aims at predicting a value from

a finite set {1,...,C}, for instance the label Y of
an image X. As for regression, the training set
is composed of pairs of input signal, and
ground truth
quantity, here a label from that set. The
standard way of tackling this is to predict one
score per potential class, such that the correct
class has maximum score.

Density modeling has for objective to model

the probability density function of the data µX
itself, for instance images. In that case the train-
ing set is composed of values xn without associ-
ated quantities to predict, and the trained model
should allow either to evaluate the probability
density function, or to sample from the distribu-
tion, or both.
17 149
Both regression and classification are generally
referred to as supervised learning since the value
to predict, which is required as a target during
training, has to be produced, for instance by hu-
man experts. On the contrary density modeling
is usually seen as unsupervised learning since
it is sufficient to take existing data, without the
need for producing an associated ground-truth.

These three categories are not disjoint, for in-

stance classification can be cast as class score
regression, or discrete sequence density model-
ing as iterated classification. Also, they do not
cover all cases. One may want to predict com-
pounded quantities, or multiple classes, or model
a density conditional to a signal.

18 149
Chapter 2

Efficient computation

From an implementation standpoint, deep learn-

ing is about executing heavy computations with
large amounts of data. The
Graphical Processing Units
(GPUs) have been instrumental in the suc-
cess of the field by allowing such computations
to be run on affordable hardware.

The importance of their use, and the resulting

technical constraints on the computations that
can actually be done efficiently, force the re-
search in the field to constantly balance mathe-
matical soundness and implementability of novel
methods.

19 149
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting to deep models. As their usage
for AI has increased, GPUs got equipped with
dedicated sub-components referred to as
,tensor cores
and deep-learning specialized chips
such as Google’s Tensor Processing Units (TPUs)
have been produced.

A GPU possesses several thousands of parallel

units, and its own fast memory. The limiting fac-
tor is usually not the number of computing units
but the read-write operations to memory. The
slowest link is between the CPU memory and
the GPU memory and consequently one should
avoid copying data across devices. Moreover
the structure of the GPU itself involves multiple
levels of cache memory, which are smaller but
faster, and computation should be organized to
avoid copies between these different caches.

This is achieved in particular by organizing the

computation in batches of samples that can fit
entirely in the GPU memory and are processed
in parallel. When an operator combines a sample
and model parameters, both have to be moved

20 149
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice a GPU processes
a batch that fits in memory almost as quickly as
a single sample.

A standard GPU has a theoretical

peak performance
of 1013 -1014 floating point operations
(FLOPs) per second, and usually its memory size
is between 8 and 80 gigabytes. The standard
FP32 encoding of float numbers is on 32 bits but
empirical results show that using encoding on 16
bits, or less for some operands does not degrade
performance.

Typical vision models have 10–100 millions

of trainable parameters and require 1018 –1019
FLOPs for training [He et al., 2015; Sevilla et al.,
2022]. Language model have from 100 millions
to hundreds of billions of parameters and require
1020 –1023 FLOPs for training [Devlin et al., 2018;
Brown et al., 2020; Chowdhery et al., 2022; Sevilla
et al., 2022]. The latter require machines with
multiple high-end GPUs.

21 149
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to pro-
cess by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.

Tensors are used to represent both the signals to

process, the trainable parameters of the models,
and the intermediate quantities they compute.
The latters are called activations, in reference to
neuronal activations.

A time series for instance is naturally encoded

as a D×T tensor, where T is its duration and D
the dimension of the feature representation at
every time step, often referred to as the number
of channels. Similarly a 2d-structured signal can
be represented as a D×H ×W tensor, where H
and W are its width and height. An RGB image
would correspond to D = 3, but the number of
channels can grow up to several thousands in
large models.

Adding more dimensions allows for the repre-

sentation of series of objects. Fifty RGB images
of resolution 32×24 can for instance be encoded
as a 50×3×24×32 tensor.
22 149
Deep learning libraries all provide a large num-
ber of operations that encompass standard linear
algebra, complex re-shaping and extraction, and
deep-learning specific operations, some of which
we will see in Chapter 4. The implementation of
tensors separates the shape representation from
the storage layout of the coefficients in mem-
ory, which allows many re-shaping, transposing,
and extraction operations to be done without
coefficient copy, hence extremely rapidly.

In practice, virtually any computation can be

decomposed into elementary tensor operations,
which avoids non-parallel loops at the language
level, and poor memory management.

Beside being convenient tools, tensors are instru-

mental in achieving computational efficiency.
All the people involved in designing the complex
object that is an operational deep model, from
the researchers and software developers design-
ing the model, the libraries, and the drivers, to
the engineers designing the computers, and the
computing chips themselves, everybody knows
that the data will be manipulated as tensors. The
resulting constraints on locality and block de-
composability allow all the actors of this chain
to optimize their designs.

23 149
Chapter 3

Training

As introduced in § 1.1, training a model con-

sists of minimizing a loss ℒ (w) which reflects
the performance of the predictor f (·;w) on a
training set 𝒟 . Since the models are usually
extremely complex, and their performance is
directly related to how well the loss is mini-
mized, this minimization is a key challenge, that
involves both computational and mathematical
difficulties.

24 149
3.1 Losses
The example of the mean squared error of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.

For classification, the usual strategy is that the

output of the model is a vector with one com-
ponent f (x;w)y per class y, interpreted as the
logarithm of a non-normalized probability, or
logit. With X the input signal and Y the class to
predict we can then compute from f an estimate
of the posterior probabilities:

expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z

This expression is generally referred to as the

softmax, or more adequately softargmax, of the
logits.

To be consistent with this interpretation the

model should be trained to maximize the proba-
bility of the true classes, hence to minimize the
cross-entropy, expressed as
N
1X expf (xn ;w)yn
ℒce (w) = −log P .
N z expf (xn ;w)z
n=1 | {z }
Lce (f (xn ;w),yn )

25 149
For density modeling, the standard loss is the
likelihood of the data. If f (x;w) is to be inter-
preted as a normalized log-probability or density,
the loss is the opposite of the sum of its value
over training samples.

In certain setups, even though the value to pre-

dict is continuous, the supervision takes the
form of ranking constraints. The typical domain
where this is the case is metric learning, where
the objective is to learn a measure of distance
between samples such that two samples from
the same semantic class, e.g. two pictures of a
same person, are closer to each other than to a
sample from another class, e.g. any picture of
someone else.

The standard approach for such cases is to mini-

mize a contrastive loss, in that case for instance
the sum over triplets (xa ,xb ,xc ), such that ya =
yb ̸= yc , of

max(0,1−f (xa ,xc ;w)+f (xa ,xb ;w)).

This quantity will be strictly positive unless

f (xa ,xc ;w) ≥ 1+f (xa ,xb ;w).

It is also possible to add terms to the loss that

depend on the trainable parameters of the model
themselves to favor certain configurations.
26 149
The weight decay regularization for instance
consists of adding to the loss a term proportional
to the sum of the parameters squared. It can
be interpreted as having a Gaussian Bayesian
prior on the parameters which favors smaller pa-
rameters, and reduces the influence of the data.
This degrades performance on the training set,
but reduces the gap between the performance in
training and that on new, unseen data.

Usually the loss to minimize is not the actual

quantity one wants to optimize ultimately, but a
proxy for which finding the best model parame-
ters is easier. For instance cross-entropy is the
standard loss for classification, even though the
actual performance measure is a classification
error rate, because the later has no informative
gradient, a key requirement as we will see in
§ 3.3.

27 149
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:

P (X1 = x1 ,X2 = x2 ,...,XT = xT ) =

P (X1 = x1 )
×P (X2 = x2 | X1 = x1 )
...
×P (XT = xT | X1 = x1 ,...,XT −1 = xT −1 ).

Although it is valid for any type of random quan-

tities, this decomposition finds its most efficient
use when the signal of interest can be encoded
into a sequence of discrete tokens from a finite
vocabulary {1,...K}.

With the convention that the additional token ∅

stands for an “unknown” quantity, we can rep-
resent the event {X1 = x1 ,...,Xt = xt } as the
vector (x1 ,...,xt ,∅,...,∅).

Then, given a model

f (x1 ,...,xt−1 ,∅,...,∅;w) =

log P̂ (Xt | X1 = x1 ,...,Xt−1 = xt−1 ),
28 149
the chain rule states that one can sample a full se-
quence of length T by sampling the xt s one after
another, each according to the predicted poste-
rior distribution, given the x1 ,...,xt−1 already
sampled. This is an autoregressive generative
model.

Training such a model could be achieved naively

by minimizing the sum across training sequences
x and time steps t of

Lce f (x1 ,...,xt−1 ,∅,...,∅;w),xt ,
however such an approach is inefficient, as most
computation done for t < t′ have to be repeated
for t′ .

The standard strategy to address this issue is to

design a model that predicts the distributions
of all the xt of the sequence at once, but with a
structure such that the prediction of xt ’s logits
depends only on the input values x1 ,...,xt−1 .
Such a model is called causal, since it corre-
sponds in the case of temporal series to not let-
ting the future influence the past, as illustrated
on Figure 3.1. As we will see in § 7.1, it can be
trained with the cross-entropy summed over all
the time steps for every sequence processed.

One important technical detail is that when deal-

ing with language, the representation as tokens
29 149
y1 y2 y3 ... yT yT +1

cst x1 x2 ... xT −1 xT

Figure 3.1: An autoregressive model f , is causal if a

time step xt of the input sequence can only modulate
a predicted ys = P̂ (Xs | Xt<s ) for s > t.

can be done in multiple manners, from the finest

granularity that are individual symbols, to entire
words. The conversion to and from the token
representation is done by a separate algorithm
called a tokenizer.

A standard method is the Byte Pair Encoding

(BPE, Sennrich et al., 2015) that constructs the
tokens by hierarchically merging groups of char-
acters, trying to get tokens that represent frag-
ments of words of various lengths, but of similar
frequencies, allocating tokens to long frequent
fragments, as to rare individual symbols.

30 149
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed form expression. In the general
case the tool of choice to minimize a function
is gradient descent. It consists of initializing the
parameters with a random w0 , and then improv-
ing this estimate by iterating gradient steps, each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it

wn+1 = wn −η∇ℒ |w (wn ). (3.1)

This procedure corresponds to moving the cur-

rent estimate a bit in the direction corresponding
locally to the maximum decrease of ℒ (w), as
illustrated on Figure 3.2.

The meta parameter η is referred to as the

learning rate
. It is a positive value that modulates how
quickly the minimization is done, and has to be
chosen carefully. If it is too small, the optimiza-
tion will be slow at best, and may be trapped in
a local minimum early. If it is too large the opti-
mization may bounce around a good minimum
and never descend into it. As we will see in § 3.5,
it can depend on the iteration number n.

31 149
w

ℒ (w)

Figure 3.2: At every point w, the gradient ∇ℒ |w (w) is

in the direction that maximizes the increase of ℒ , or-
thogonal to the level curves (top). The gradient descent
minimizes ℒ (w) iteratively by subtracting a fraction
of the gradient at every step, resulting in a trajectory
that follows the steepest descent (bottom).

32 149
As for many algorithms, intuition tends to break
in very high dimension, and although it seems
that this procedure would be very easily trapped
in a local minimum, in reality, due to the number
of parameters, the design of the models, and
the stochasticity in the data, its efficiency is far
greater than one could expect.

All the losses used in practice can be expressed

as an average of a per sample, or per small group
of samples, loss
N
1X
ℒ (w) = 𝓁n (w),
N
n=1

where
𝓁n (w) = L(f (xn ;w),yn )
for some L, and the gradient is then
N
1X
∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
n=1

The resulting gradient descent would compute

exactly the sum in 3.2, which is usually computa-
tionally heavy, and then update the parameters
according to 3.1. However, under reasonable as-
sumptions of exchangeability, for instance if the
samples have been properly shuffled, any partial
33 149
sum of 3.2 is an unbiased estimator of the full
sum, albeit noisy. So updating the parameters
from partial sums corresponds to doing more
gradient steps for the same computational bud-
get, with noisier estimates of the gradient. Due
to the redundancy in the data, this happens to
be a far more efficient strategy.

We saw in § 2.1 that processing a batch of sam-

ples small enough to fit in the computing de-
vice’s memory is generally as fast as processing
a single one. Hence the standard approach is to
split the full set 𝒟 into batches, and to update
the parameters from the estimate of the gradient
computed from each. This is referred to as mini-
batch stochastic gradient descent, or
stochastic gradient descent
(SGD) for short.

It is important to note that this process is ex-

tremely gradual and that the number of mini-
batches and gradient steps are typically of the
order of several millions.

Plenty of variations of this standard strategy

have been proposed. The most popular one is
Adam [Kingma and Ba, 2014], which keeps run-
ning estimates of the mean and variance of each
component of the gradient, and normalizes them
automatically, avoiding scaling issues and differ-

34 149
ent training speeds in different parts of a model.

35 149
3.4 Backpropagation
Using gradient descent requires a tech-
nical means to compute ∇𝓁|w (w) where
𝓁= L(f (x;w);y). Given that f and L are both
compositions of standard tensor operations, as
for any mathematical expression, the chain rule
allows to get an expression of it.

For the sake of making notation lighter–which

unfortunately will be needed in what follows–
we do not specify at which point gradients are
computed, since the context makes it clear.

fd (·;wd )
x(d−1) x(d)
×Jfd |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jfd |w

∇𝓁 |wd

Figure 3.3: Given a model f = fD ◦···◦f1 , the forward

pass (top) consists of computing the outputs x(d) of
the mappings fd in order. The backward pass (bottom)
computes the gradients of the loss with respect to the
activation x(d) and the parameters wd backward by
multiplying them by the Jacobians.

36 149
Forward and backward passes
Consider the simple case of a composition of
mappings

f = f1 ◦f2 ◦···◦fD .

The output of f (x;w) can be computed by start-

ing with x(0) = x and applying iteratively

x(d) = fd (x(d−1) ;wd ),

with x(D) as the final value.

The individual scalar values of these interme-

diate results x(d) are traditionally called
activations
in reference to neuron activations, the
value D is the depth of the model, the individual
mappings fd are referred to as layers, as we will
see is § 4.1, and their sequential evaluation is the
forward pass (see Figure 3.3, top).

Conversely the gradient ∇𝓁 |x(d−1) of the loss

with respect to the output x(d−1) of fd−1 is the
product of the gradient ∇𝓁 |x(d) with respect
to the output of fd multiplied by the Jacobian
Jfd−1 |x of fd−1 with respect to its first variable x.
So the gradients with respect to the outputs of all
the fd s can be computed recursively backward,
starting with ∇𝓁 |x(D) = ∇L|x .
37 149
And the gradient that we are interested in for
training, that is ∇𝓁 |wd , is the gradient with re-
spect to the output of fd multiplied by the Jaco-
bian Jfd |w of fd with respect to the parameters.

This iterative computation of the gradients with

respect to the intermediate activations, com-
bined with that of the gradients with respect
to the layers’ parameters is the backward pass
(see Figure 3.3, bottom). The combination of
this computation with the procedure of gradient
descent is called backpropagation.

In practice the implementation details of the for-

ward and backward passes are hidden from pro-
grammers. Deep learning frameworks are able
to automatically construct the sequence of op-
erations to compute gradients. A particularly
convenient algorithm is autograd [Baydin et al.,
2015], which tracks tensor operations, and builds
on the fly the combination of operators for gra-
dients. Thanks to this, a piece of imperative
programming that manipulates tensors can auto-
matically compute the gradient of any quantity
with respect to any other.

38 149
Resource usage
Regarding the computational cost, as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass, and two for the products by
the Jacobians for the backward pass. This makes
the latter roughly twice more costly than the
former.

The memory requirement during inference is

equal roughly to that of the most demanding
individual layer. For training however, the back-
ward pass requires to keep the activations com-
puted during the forward to compute the Jaco-
bians, which results in a memory usage that
grows proportionally to the model’s depth. Tech-
niques exist to trade the memory usage for com-
putation by either relying on reversible layers
[Gomez et al., 2017], or using checkpointing,
that consists of storing activations for some lay-
ers only, and recomputing the others on the fly
with partial forward passes during the backward
pass [Chen et al., 2016].

Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-

39 149
wards through an operator it may be rescaled
by a multiplicative factor, and consequently de-
crease or increase exponentially when it tra-
verses many layer. When it decreases exponen-
tially this is called the vanishing gradient, and
it may make the training impossible, or in its
milder form, make different parts of the model
being updated at different speeds, degrading
their co-adaptation [Glorot and Bengio, 2010].

As we will see in Chapter 4, multiple techniques

have been developed to prevent this from hap-
pening, reflecting a change of perspective that
was crucial in the success of deep-learning: in-
stead of trying to improve generic optimization
methods, the effort moved to engineer the mod-
els themselves to make them optimizable.

40 149
3.5 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.

As we saw in § 1.3, the performance on the train-

ing samples may be misleading, so in the sim-
plest setup one needs at least two sets of samples:
one is a training set used to optimize the model
parameters, and the other is a test set to estimate
the performance of the trained model.

Additionally, there are usually meta parameters

to adapt, in particular related to the model archi-
tecture, the learning rate, and the regularization
terms in the loss. In that case one needs a
validation set
disjoint from both the training set and
the test set to assess what the best setup is.

The full training is usually decomposed in

epochs, each of them corresponding to going
through all the training examples once. The
usual dynamic of the losses is that the train loss
decreases as long as the optimization runs while
the validation loss may reach a minimum after
a certain number of epochs and then starts to
increase, reflecting an over-fitting regime, as in-
troduced in § 1.3 and illustrated on Figure 3.4.
41 149
Overfitting

Loss
Validation

Train
Number of epochs

Figure 3.4: As training progresses, a model’s perfor-

mance is usually monitored through losses. The train
loss is the one driving the optimization process and
goes down, while the validation loss is estimated on
an other set of examples to assess the over-fitting of
the model. This phenomenon appears when the model
starts to take into account random structures specific
to the training set at hands, resulting in the validation
loss starting to increase.

Paradoxically, although they should suffer from

severe over-fitting due to their capacity, large
models usually continue to improve as the train-
ing progresses. This may be due to the
of the model becoming the main driver of
inductive bias
the optimization when the performance is near
perfect on the training set [Belkin et al., 2018].

42 149
An important design choice is the
learning rate schedule
during training. The general policy is
that the learning rate should be initially large to
avoid having the optimization being trapped in
a bad local minimum early, and that it should
get small so that the optimized parameter values
do not bounce around, and reaches a good mini-
mum in a narrow valley of the loss landscape.

43 149
3.6 Training data
One key aspect of deep learning is the steady
improvement of performance with the
training set
size. There has been an accumulation of em-
pirical results showing that performance, for in-
stance estimated through the loss on test data,
improves with the amount of data according to
accurate scaling laws, as long as the model size
increases correspondingly [Kaplan et al., 2020].

Benefiting from these scaling laws in the multi-

billions of samples regime is possible in part
thanks to the structural plasticity of models, that
allows to scale them up, as we will see, by in-
creasing the number of layers or feature dimen-
sions. But it is also made possible by the dis-

Dataset Year Nb. of images Size

ImageNet 2012 1.2M 150Gb
Cityscape 2016 25K 60Gb
LAION-5B 2022 5.8B 240Tb

Dataset Year Nb. of books Size

WMT-18-de-en 2018 14M 8Gb
The Pile 2020 1.6B 825Gb
OSCAR 2020 12B 6Tb

Table 3.1: Some examples of publicly available datasets.

The equivalent number of books is an indicative esti-
mate for 250 pages of 2000 characters per book.

44 149
tributed nature of these models, that can take ad-
vantage of massively parallel computing devices,
and by the stochastic gradient descent, which
requires only a tiny fraction of the data at a time,
and can operate with data sets whose size is or-
ders of magnitude greater than that of the com-
puting device’s memory.

This has resulted in a tension between produc-

ing datasets of moderate size with a detailed and
expensive to produce ground-truth, and datasets
which are automatically produced by combining
data available on the internet with minimal cura-
tion, if any. The later category includes datasets
that combine multiple modalities, for instance
text and images from web pages, or sound and
images from videos, which can be used for large
scale supervised training.

The current most impressive successes of arti-

ficial intelligence with deep learning methods
rely on extremely large text datasets (Table 3.1).

45 149
Part II

Deep models

46 149
Chapter 4

Model components

A deep model is nothing more than a com-

plicated tensorial computation that can be de-
composed ultimately into standard mathemati-
cal operations from linear algebra and analysis.
Through the years, the field has developed a
large collection of high-level modules that have
a clear semantic, and complex models combining
these modules, which have proven to be efficient
in specific application domains.

Empirical evidences and theoretical results show

that greater performance is achieved with deeper
architectures, that is, long compositions of map-
pings. As we saw in section § 3.4, training such
a model is challenging due to the vanishing, and
multiple important technical contributions miti-
gated this problem.
47 149
4.1 The notion of layer
We call layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters, and correspond to a convenient level of
granularity to design and describe large deep
models. The term is inherited from the simple
multi-layer neural networks, even though a mod-
ern model may take the form of a complex graph
of such modules, and incorporates multiple par-
allel pathways.

Y
4×4
g n=4

f
×K
32×32
X

In the following pages I try to stick to the con-

vention for model depiction illustrated above
that:

• operators / layers are depicted as boxes,

• darker coloring indicates that they embed

trainable parameters,

48 149
• non-default valued meta-parameters are
added in blue on their right,

• a dashed outer frame with a multiplicative

factor indicates that a group of layers is repli-
cated in series, each with its own set of trainable
parameters if any, and

• the dimension of their output is specified on

the right when it differs from their input.

Additionally, layers that have a complex internal

structure are depicted with a greater height.

49 149
4.2 Linear layers
Linear layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.

Fully connected layers

The most basic one is the fully connected layer,
parameterized by w = (W,b), where W is a
D′ ×D weight matrix, and b is a bias vector of di-
mension D′ . It implements a matrix/vector prod-
uct generalized to arbitrary tensor shape. Given
an input X of dimension D1 ×···×DK ×D, it
computes an output Y of dimension D1 ×···×
DK ×D′ with

∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b.

While at first sight such an affine operation

seems limited to geometric transformations such
as rotations or symmetries, they can implement
far more than that. In particular projections for
dimension reduction, or signal filtering, but also
from the perspective of the dot product being a
measure of similarity, a matrix-vector product
50 149
can be interpreted as computing matching scores
between a query as encoded by the vector, and
keys as encoded by the matrix rows.

As we saw in § 3.3, the gradient descent starts

with the parameters' random initialization. If
this is done too naively, as seen in § 3.4, the
network may suffer from exploding or vanishing
activations and gradients [Glorot and Bengio,
2010]. Deep learning frameworks implement
initialization methods that modulate the random
parameters’ scales according to the tensor shape
to prevent pathological behaviors of the signal
during the forward and the backward passes.

Convolutional layers
A linear layer can take as input an arbitrarily
shaped tensor by reshaping it into a vector, as
long as it has the right number of coefficients.
However such a layer is poorly adapted to deal-
ing with large tensors since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance to process an RGB im-
age of size 256×256 as input and compute a
result of same size, it would require ≃ 4×1010
parameters and multiplications.

51 149
Y Y

ϕ ψ

X X

Y Y

ϕ ψ

X X
... ...

Y Y

ϕ ψ

X X

1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.

52 149
ϕ ψ

Y X
X Y

2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .

Beside these practical issues most of the high-

dimension signals are strongly structured. For
instance images exhibit short-term correlations,
and statistical stationarity to translation, scaling,
and certain symmetries. This is not reflected
in the inductive bias of a fully connected layer,
which totally ignores the signal structure.

To leverage these regularities, the tool of choice

are convolutional layers, which are also affine,
but process time-series or 2d signals locally, with

53 149
Y

Y ϕ

X
ϕ
p=2
X
Padding
Y
Y
ϕ

X ϕ

X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.

54 149
the same operator everywhere.

A 1d convolution is mainly defined by three

meta-parameters: its kernel size K, its number
of input channels D, its number of output chan-
nels D′ , and by the trainable parameters w of an
′
affine mapping ϕ(·;w) : RD×K → RD ×1 .

It can process any tensor X of size D×T with

T ≥ K, and applies ϕ(·;w) to every sub-tensor
D×K of X and stores the results into a tensor
Y of size D′ ×(T −K +1), as pictured on Figure
4.1 (left).

A 2d convolution is similar but has a K ×L ker-

nel and takes as input a D×H ×W tensor, see
Figure 4.2 (left).

Both operators have for trainable parameters

those of ϕ that can be envisioned as D′ filters
of size D×K or D×K ×L respectively, and a
bias vector of dimension D′ .

They also admit three additional meta-

parameters, illustrated on Figure 4.3:

• The padding specifies how many zeros coeffi-

cients should be added around the input tensor
before processing it, in particular to maintain
the tensor size when the kernel size is greater

55 149
than one. Its default value is 0.

• The stride specifies the step used when going

through the input, allowing to reduce the out-
put size geometrically by using large steps. Its
default value is 1.

• The dilation specifies the index count between

the filter coefficients of the local affine opera-
tor. Its default value is 1, and greater values
correspond to inserting zeros between the coef-
ficients, which increases the filter / kernel size
while keeping the number of trainable parame-
ters unchanged.

Except for the number of channels, a convolu-

tion’s output is usually strictly smaller than its
input by roughly the size of the kernel, or even
by a scaling factor if the stride is greater than
one.

Given an activation computed by a convolutional

layer, or the vector of values for all the channels
at a certain location, the portion of the input
signal that it depends on is called its
receptive field
(see Figure 4.4). One of the H ×W sub-
tensors corresponding to a single channel of a
D×H ×W activation tensor is referred to as an
activation map.

56 149
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its receptive field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.

Convolutions are used to recombine informa-

tion, generally to reduce the spatial size of the
representation, trading it for a greater number
of channels, that translates into a richer local
representation. They can implement differential
operators such as edge-detectors, or template
matching mechanisms. A succession of such lay-
ers can also be envisioned as a compositional and
hierarchical representation [Zeiler and Fergus,
2014], or as a diffusion process in which infor-
mation can be transported by half the kernel size
when going through a layer.

57 149
A converse operation is the
transposed convolution
that consists also in a localized affine op-
erator, defined by similar meta and trainable
parameters as the convolution, but which ap-
plies, for instance in the 1d case, an affine map-
′
ping ψ(·;w) : RD×1 → RD ×K , to every D×1
sub-tensor of the input, and sums the shifted
D′ ×K resulting tensors to compute its output.
Such operator increases the size of the signal
and can be understood intuitively as a synthe-
sis process (see Figure 4.1, right and Figure 4.2,
right).

A series of convolutional layers is the usual ar-

chitecture to map a large dimension signal such
as an image or a sound sample to a low dimen-
sion tensor. That can be for instance to get class
scores for classification, or a compressed repre-
sentation. Transposed convolutions layers are
used the opposite way to build a large dimen-
sion signal from a compressed representation,
either to assess that the compressed representa-
tion contains enough information to build back
the signal, or for synthesis, as it is easier to learn
a density model over a low dimension represen-
tation. We will come back to this in § 5.2.

58 149
4.3 Activation functions
If a network was combining only linear compo-
nents it would itself be a linear operator, so it
is essential to have non-linear operations. They
are implemented in particular with
activation functions
which are layers that transforms ev-
ery component of the input tensor individually
through a mapping, resulting in a tensor of same
shape.

There are many different activation functions

but the most used is the Rectified Linear Unit
(ReLU, [Glorot et al., 2011]), which sets nega-
tive values to zero and keeps positive values un-
changed (see Figure 4.5, top right):
(
0 if x < 0
relu(x) =
x otherwise.

Given that the core training strategy of deep-

learning relies on the gradient, it may seem prob-
lematic to have a mapping that is not differen-
tiable at zero, and constant on half the real line.
However, the main property gradient descent
requires is that the gradient is informative on
average. Parameter initialization and data nor-
malization make half of the activations positive

59 149
Tanh ReLU

Leaky ReLU GELU

Figure 4.5: Activation functions.

when the training starts, ensuring that this is the

case.

Before the generalization of ReLU, the standard

activation function was Tanh (see Figure 4.5, top
left) which saturates exponentially fast on both
the negative and the positive side, which was
aggravating the vanishing gradient.

Other popular activation functions follow the

same idea of keeping positive values unchanged
and squashing the negative values. Leaky ReLU
[Maas et al., 2013] applies a small positive multi-
60 149
plying factor to the negative values (see Figure
4.5, bottom left):
(
ax if x < 0
leakyrelu(x) =
x otherwise.

And GELU [Hendrycks and Gimpel, 2016] is de-

fined with the cumulative distribution function
of the Gaussian distribution, that is

gelu(x) = xP (Z ≤ x),

where Z ∼ 𝒩 (0,1). It roughly behaves like a

smooth ReLU (see Figure 4.5, bottom right).

The choice of an activation function, in partic-

ular among the variants of ReLU, is generally
driven by empirical performance.

61 149
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a kernel size.

This layer computes the maximum activation per

channel, over non-overlapping sub-tensors of
spatial size equal to the kernel size. These values
are stored into a result tensor with the same num-
ber of channels as the input, and whose spatial
size is divided by the kernel size. As the convo-
lution, this operator has three meta-parameters
padding, stride, and dilation, with the stride be-
ing equal to the kernel size by default.

The max operation can be intuitively interpreted

as a logical disjunction, or when it follows a
series of convolutional layer that compute lo-
cal scores for the presence of parts, as a way of
encoding that at least one instance of a part is
present. It loses precise location, which makes
it invariant to local deformations.

A standard alternative is the average pooling

layer that computes the average instead of the
62 149
Y

max

...

max

1d max pooling

Figure 4.6: A 1d max pooling takes as input a D×

T tensor X, computes the max over non-overlapping
1×L sub-tensors and stores the values in a resulting
D×(T /L) tensor Y .

63 149
max over the sub-tensors. This is a linear opera-
tion, while max pooling is not.

64 149
4.5 Dropout
Some layers have been designed to explicitly
facilitate training, or improve the quality of the
learned representations.

One of the main contributions of that sort was

dropout [Srivastava et al., 2014]. Such a layer has
no trainable parameter but one meta-parameter
p, and takes as input a tensor of arbitrary shape.

It is usually switched off during test, in which

case its output is equal to its input. When it is
active, it has a probability p to set to zero each
activation of the input tensor independently, and
it re-scales all the activations by a factor 1−p1

to maintain the expected value unchanged (see

Figure 4.7).

The motivation behind dropout is to favor

meaningful individual activation and discourage
group representation. Since the probability that
a group of k activations remains intact through
a dropout layer is (1−p)k , joint representations
get unreliable, which makes the training proce-
dure avoid them. It can also be seen as a noise
injection that makes the training more robust.

When dealing with images and 2d tensors, the

short-term correlation of the signals and the re-
65 149
Y Y

0
1 1 1 1 1 1
0 1 1 1 0
1 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1

X X

Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.

sulting redundancy negates the effect of dropout

since activations set to zero can be inferred from
their neighbors. Hence, dropout for 2d tensors
sets entire channels to zero instead of individual
activations.

Although dropout is generally used to improve

training and is inactive during inference, it can
be used in certain setups as a randomization
strategy, for instance to estimate empirically con-
fidence scores [Gal and Ghahramani, 2015].

66 149
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the
normalizing layers
which force the empirical mean and
variance of groups of activations.

The main layer in that family is

batch normalization
[Ioffe and Szegedy, 2015] which is the
only standard layer to process batches instead
of individual samples. It is parameterized by a
meta-parameter D and two series of trainable
scalar parameters β1 ,...,βD and γ1 ,...,γD .

Given a batch of B samples x1 ,...,xB of dimen-

sion D, it first computes for each of the D com-
ponents an empirical mean m̂d and variance v̂d
across the batch
B
1X
m̂d = xb,d
B
b=1
B
1 X
v̂d = (xb,d − m̂d )2
B
b=1

from which it computes for every component

xb,d a normalized value zb,d , with empirical
mean 0 and variance 1, and from it the final
result value yb,d with mean βd and standard de-

67 149
D
H,W

x⊙γ +β x⊙γ +β

√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ

batchnorm layernorm

Figure 4.8: Batch normalization normalizes across the

sample index dimension B and all spatial dimensions
if any, so B,H,W for a B ×D×H ×W batch tensor,
and scales/shifts according to D, which is implemented
as a component-wise product by γ and a sum with β
of the corresponding sub-tensors (left). Layer normal-
ization normalizes across D and spatial dimensions,
and scales/shifts according to the same (right).

68 149
viation γd
xb,d − m̂d
zb,d = √
v̂d +ϵ
yb,d = γd zb,d +βd .

Because this normalization is defined across a

batch, it is done only during training. During
test, the layer transforms individual samples ac-
cording to the m̂d s and v̂d s estimated with a
moving average over the full training set, which
boils down to a fix affine transformation per-
component.

The motivation behind batch normalization was

to avoid that a change of scaling in an early layer
of the network during training impacts all the
layers that follow, that then have to adapt their
trainable parameters accordingly. Although the
real mode of action may be more complicated
than this initial motivation, this layer facilitates
considerably the training of deep models.

In the case of 2d tensors, to follow the prin-

ciple of convolutional layers of processing all
locations similarly, the normalization is done
per-channel across all 2d positions, and β and
γ remain vectors of dimension D so that the
scaling/shift does not depend on the 2d posi-
tion. Hence if the tensor to process is of shape
69 149
B ×D×H ×W , the layer computes (m̂d ,v̂d ), for
d = 1,...,D from the corresponding B ×H ×W
slice, normalizes it accordingly, and finally scales
and shifts its components with the trainable pa-
rameters βd and γd .

So, given a B ×D tensor, batch normalization

normalizes it across B and scales/shifts it ac-
cording to D, which can be implemented as a
component-wise product by γ and a sum with
β. Given a B ×D×H ×W it normalizes across
B,H,W and scales/shifts according to D (see
Figure 4.8, left).

This can be generalized depending on these di-

mensions. For instance layer normalization [Ba
et al., 2016], computes moments and normal-
izes across all components of individual samples,
and scale and shift components individually (see
Figure 4.8, right). So, given a B ×D tensor, it
normalizes across D, and scales/shifts also ac-
cording to D. Given a B ×D×H ×W tensor, it
normalizes it across D,H,W and scales/shifts
according to the same.

Contrary to batch normalization, since it pro-

cesses samples individually, it behaves the same
during training and test.

70 149
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep architec-
ture are the skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections are the
residual connections
which combine the signal with
a sum, and usually skip only a few layers (see
Figure 4.9, right).

The main desirable property of this design is to

ensure that, even in the case of gradient-killing
processing at a certain stage, the gradient will
still propagate through the skip connections.
Residual connections in particular allow to build
deep models with up to several hundreds layers,
and key models, such as the residual networks
[He et al., 2015] in computer vision, see § 5.2,
and the Transformers [Vaswani et al., 2017] in
natural language processing, see § 5.3, are en-
tirely composed of blocks of layers with residual
connections.
71 149
...

f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...

Figure 4.9: Skip connections, highlighted in red on this

figure, transport the signal unchanged across multiple
layers. Some architectures (center) that downscale and
re-upscale the representation size to operate at multiple
scales, have skip connections to feed outputs from the
early parts of the network to later layers operating at
the same scales [Long et al., 2014; Ronneberger et al.,
2015]. The residual connections (right) are a special
type of skip connections that sum the original signal
to the transformed one, and are usually short-term,
bypassing at max a handful of layers [He et al., 2015].

72 149
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible size. In the case of residual connec-
tions, they may also facilitate the learning by
simplifying the task to finding a differential im-
provement instead of a full update.

73 149
4.8 Attention layers
In many applications there is a need for a pro-
cessing able to combine local information at lo-
cations far apart in a tensor. This can be for
instance distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.

Fully connected layers cannot process large di-

mension signals, nor signals of variable size, and
convolutional layers are not able to propagate
information quickly. Strategies that aggregate
the results of convolutions, for instance by aver-
aging them over large spatial areas, suffer from
mixing multiple signals into a limited number of
dimensions.

Attention layers are specifically addressing this

problem by computing for every component of
the resulting tensor an attention score to every
component of the input tensor, without locality
constraints, and averaging features across the
full tensor accordingly [Vaswani et al., 2017].

Even though they are substantially more com-

plicated than other layers, they have become a
standard element in many recent models. They
are in particular the key building block of
.Large Language Models
See § 5.3 and § 7.1.

Attention operator
Given

• a tensor Q of queries of size N Q ×DQK ,

• a tensor K of keys of size N KV ×DQK , and
• a tensor V of values of size N KV ×DV ,

the attention operator computes a tensor

Y = att(K,Q,V )

of dimension N Q ×DV . To do so, it first computes

for every query index q and every key index k an
75 149
attention score Aq,k as the softargmax of the dot
products between the query Qq and the keys:

exp √ 1 QK Q⊤ q K k
Aq,k = P D , (4.1)
√ 1 Q⊤ Kl
l exp QK
D q

where the scaling factor √ 1 QK keeps the range

D
of values roughly unchanged even for large DQK .

Then a retrieved value is computed for each

query by averaging the values according to the
attention scores:
X
Yq = Aq,k Vk . (4.2)
k

So if a query Qn matches one key Km far more

than all the others, the corresponding attention
score An,m will be close to one, and the retrieved
value Yn will be the value Vm associated to that
key. But if it matches several keys equally, then
Yn will be the average of the associated values.

This can be implemented as

QK ⊤

att(Q,K,V ) = softargmax √ V.
DQK
| {z }
A

This operator is usually extended in two ways,

as depicted on Figure 4.11. First, the attention
76 149
Y

·
A

dropout

1/Σk
Masked
softargmax M ⊙
exp

Q K V

Figure 4.11: The attention operator Y = att(Q,K,V )

computes first an attention matrix A as the per-query
softargmax of QK ⊤ , which may be masked by a con-
stant matrix M before the normalization. This atten-
tion matrix goes through a dropout layer before being
multiplied by V to get the resulting Y . This operator
can be made causal by taking M full of 1s below the
diagonal and zero above.

77 149
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows for instance to make the
operator causal by taking M full of 1s below
the diagonal and zero above, preventing Yq to
depend on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.

Multi-head Attention Layer

This parameter-less attention operator is the key
element in the Multi-Head Attention layer de-
picted in Figure 4.12. This layer has for meta-
parameters a number H of heads, and the shapes
of three series of H trainable weight matrices

• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK , and
• W V of size H ×D×DV ,

to compute respectively the queries, the keys,

and the values from the input, and a final weight
matrix W O of size HDV ×D to aggregate the
per-head results.

It takes as input three sequences

78 149
Y

×W O

(Y1 | ··· | YH )

attatt
attatt
att

Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H

×H

XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .

79 149
• X Q of size N Q ×D,
• X K of size N KV ×D, and
• X V of size N KV ×D,

from which it computes, for h = 1,...,H

Yh = att X Q WhQ ,X K WhK ,X V WhV .

These sequences Y1 ,...,YH are concatenated

along the feature dimension and each individual
element of the resulting sequence is multiplied
by W O to get the final result

Y = (Y1 | ··· | YH )W O .

As we will see in § 5.3 and on Figure 5.6, this

layer is used to build two model sub-structures:
self-attention blocks, in which the three input
sequences X Q , X K , and X V are the same, and
cross-attention blocks, where X K and X V are
the same.

It is noteworthy that the attention operator, and

consequently the multi-head attention layer, is
invariant to a permutation of the keys and values,
and equivariant to a permutation of the queries,
as it would permute the resulting tensor simi-
larly.

80 149
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an
embedding layer
which consists of a lookup table
that directly maps integers to vectors.

Such a layer is defined by two meta-parameters:

the number N of possible token values, and the
dimension D of the output vectors, and one train-
able N ×D weight matrix M .

Given as input an integer tensor X of dimen-

sion D1 ×···×DK and values in {0,...,N −1}
such a layer returns a real-valued tensor Y of
dimension D1 ×···×DK ×D with

∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].

81 149
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor, and to the position of the
resulting activation in the output tensor, convo-
lutional layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial to deal with a
stationary signal.

However this can be an issue in certain situations

where a proper processing has to access the abso-
lute positioning. This is the case for instance for
image synthesis, where the statistics of a scene
is not totally stationary, or in natural language
processing where the relative positions of words
strongly modulate the meaning of a sentence.

The standard way to cope with this problem is to

add or concatenate to the feature representation,
at every position, a positional encoding, that is a
feature vector that depends on the location, and
allows to recover it. It can be learned as other
layer parameters, or defined analytically.

For instance for a series of vectors of dimension

82 149
D, Vaswani et al. [2017] add

pos-enc[t,d] =

 sin d/Dt
if d ∈ 2N
T
t
 cos (d−1)/D
T
otherwise,

with T = 104 .

83 149
Chapter 5

Architectures

The field of deep learning has developed through

the years for each application domain multiple
deep architectures that exhibit good trade-off
with respect to multiple criteria of interest: e.g.
ease of training, accuracy of prediction, memory
footprint, computational cost, scalability.

84 149
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the
Multi-Layer Perceptron
(MLP), which takes the form
of a succession of fully connected layers sepa-
rated by activation functions. See an example
on Figure 5.1. For historical reasons, in such a
model, the number of hidden layers refers to the
number of linear layers, excluding the last one.

A key theoretical result is the

universal approximation theorem
[Cybenko, 1989] that states that
if the activation function σ is not polynomial,

Y
2
fully-conn

relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.

85 149
any continuous function f can be approximated
arbitrarily well uniformly on a compact by a
model of the form l2 ◦σ◦l1 where l1 and l2 are
affine. Such a model is a MLP with a single hid-
den layer, and this result implies that it can ap-
proximate anything of practical value. However
this approximation holds if the dimension of the
first linear layer’s output can be arbitrarily large.

In spite of their simplicity, MLPs remain an im-

portant tool when the dimension of the signal
to process is not too large.

86 149
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that
combines multiple convolutional layers, either
to reduce the signal size before it can be pro-
cessed by fully connected layers, or to output a
2d signal also of large size.

LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2d
convolutional layers and max pooling layers that
play the role of feature extractor, with a series of
fully connected layers which act like a MLP and
performs the classification per se. See Figure 5.2
for an example.

This architecture was the blueprint for many

models that share its structure and are simply
larger, such as AlexNet [Krizhevsky et al., 2012]
or the VGG family [Simonyan and Zisserman,
2014].

Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
87 149
P̂ (Y )

10
fully-conn
Classifier
relu
200
fully-conn
256
reshape

relu
64×2×2
maxpool k=2
64×4×4
Feature conv-2d k=5
extractor
relu
32×8×8
maxpool k=3
32×24×24
conv-2d k=5
1×28×28
X
Figure 5.2: Example of a small LeNet-like network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.

88 149
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1

relu
batchnorm
conv-2d k=3 p=1

relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.

from the vanishing gradient problem. The

residual networks
, or resnets, proposed by He et al.
[2015] explicitly address the issue of the van-
ishing gradient with residual connections (see
§ 4.7), that allow hundreds of layers. They have
become standard architectures for computer vi-
sion applications, and exist in multiple versions
depending on the number of layers. We are go-
ing to look in detail at the architecture of the
ResNet-50 for classification.

89 149
Y
4C
S ×H W
S × S
relu
+
batchnorm batchnorm
4C
S ×H W
S × S
conv-2d k=1 s=S conv-2d k=1

relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1

relu
batchnorm
C
S ×H ×W
conv-2d k=1

C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.

As other resnets, it is composed of a series of

residual blocks, each of them combining sev-
eral convolutional layers, batch norm layers, and
ReLU layers, wrapped into a residual connection.
Such a block is pictured on Figure 5.3.

A key requirement for high performance with

real images is to propagate a signal with a large
number of channels, to allow a rich represen-

90 149
P̂ (Y )

1000
fully-conn
2048
reshape
2048×1×1
avgpool k=7

resblock
×2
2048×7×7
dresblock
S=2

resblock
×5
1024×14×14
dresblock
S=2

resblock
×3
512×28×28
dresblock
S=2

resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3

3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
91 149
tation. However the parameter count of a con-
volutional layer, and its computational cost, are
quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then up-scaling the number of chan-
nels, again with a 1×1 convolution.

The network reduces the dimensionality of the

signal to finally compute the logits for the clas-
sification. This is done thanks to an architec-
ture composed of several sections, each starting
with a downscaling residual block that halves
the height and width of the signal, and doubles
the number of channels, followed by a series
of residual blocks. Such a downscaling resid-
ual block has a structure similar to a standard
residual block, except that it requires a residual
connection that changes the tensor shape. This
is achieved with a 1×1 convolution with a stride
of two (see Figure 5.4).

The overall structure of the ResNet-50 is pre-

sented on Figure 5.5. It starts with a 7×7 con-
volutional layer that converts the three chan-
nel input image to a 64 channel image of half
the size, followed by four sections of residual

92 149
blocks. Surprisingly, in the first section, there
is no downscaling, only an increase of the num-
ber of channels by a factor of 4. The output of
the last residual block is 2048×7×7, which is
converted to a vector of dimension 2048 by an
average pooling of kernel size 7×7, and then
processed through a fully connected layer to get
the final logits, here for 1000 classes.

93 149
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].

Transformer
The original Transformer, pictured on Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence,
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.

The self-attention block, pictured on the left of

Figure 5.6, combines a Multi-Head Attention
layer, see § 4.8, that recombines information
globally, allowing any position to collect infor-

94 149
Y Y

+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm

+ +
mha mha
Q K V Q K V

layernorm layernorm

X QKV XQ X KV
Figure 5.6: Self-attention block (left) and
cross-attention block
(right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.

mation from any other positions, with a one

hidden layer MLP that updates representations
at every position separately.

The cross-attention block, pictured on the right

of Figure 5.6, is similar except that it takes as

95 149
P̂ (Y1 ),..., P̂ (YS | Ys<S )

S ×V
fully-conn
S ×D
cross-att
Q KV

Decoder causal
self-att ×N
pos-enc +
S ×D
embed

S
0,Y1 ,...,YS−1

Z1 ,...,ZT

T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed

T
X1 ,...,XT

Figure 5.7: Original encoder-decoder

Transformer model
for sequence-to-sequence translation [Vaswani
et al., 2017].

96 149
input two sequences, one to compute the queries,
and one the keys and values.

The encoder of the Transformer (see Figure

5.7, bottom), recodes the input sequence of dis-
crete tokens X1 ,...XT with an embedding layer,
see § 4.9, and adds a positional encoding, see
§ 4.10, before processing it with several self-
attention blocks to generate a refined represen-
tation Z1 ,...,ZT .

The decoder (see Figure 5.7, top), takes as in-

put the sequence Y1 ,...,YS−1 of result tokens
produced so far, similarly recodes them through
an embedding layer, adds a positional encoding
and processes it through alternating causal self-
attention blocks, and cross-attention blocks, to
produce the logits predicting the next tokens.
These cross-attention blocks compute their keys
and values from the encoder’s result represen-
tation Z1 ,...,ZT , which allows the resulting se-
quence to be a function of the original sequence
X1 ,...,XT .

As we saw in § 3.2, being causal means that for

a given s the logits for P̂ (Ys | Yt<s ) it computes
depend only on the tokens Yt ,t < s in the in-
put sequence (see Figure 3.1). This ensures that,
given a full input sequence, the output at every

97 149
P̂ (X1 ),..., P̂ (XT | Xt<T )

T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed

T
0,X1 ,...,XT −1

Figure 5.8: GPT model [Radford et al., 2018].

position is the output that would have been ob-

tained if the input was only available until just
before that position.

Generative Pre-trained Transformer

The Generative Pre-trained Transformer (GPT,
Radford et al., 2018, 2019), pictured on Figure 5.8
is a pure autoregressive model that consists of a
succession of causal self-attention blocks, hence
a causal version of the original Transformer en-
coder. This class of models scales extremely well,
up to hundreds of billions of trainable parame-
ters [Brown et al., 2020].

98 149
P̂ (Y )

C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn

D
Z0 ,Z1 ,...,ZM

(M +1)×D
self-att
×N
pos-enc +

(M +1)×D
E0 ,E1 ,...,EM

Image E0 ×W E
encoder M ×3P 2
X1 ,...,XM

Figure 5.9: Vision Transformer model [Dosovitskiy

et al., 2020].

99 149
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.

It splits the three-channel input image into M

patches of resolution P ×P , that it flattens to cre-
ate a sequence of vectors X1 ,...,XM of shape
M ×3P 2 . This sequence is multiplied by a train-
able matrix W E of shape 3P 2 ×D to map it to a
M ×D sequence, to which is concatenated one
trainable vector E0 . The resulting (M +1)×D
sequence E0 ,...,EM is then processed through
multiple self-attention blocks. See § 5.3 and Fig-
ure 5.6.

The first element Z0 in the result sequence,

which corresponds to E0 and is not associated
to any part of the image, is finally processed by
a two hidden layer MLP to get the final C logits.
Such a token added for a readout of a class pre-
diction was introduced by Devlin et al. [2018]
in the BERT model and is referred to as a
CLS token
.

100 149
Part III

Applications

101 149
Chapter 6

Prediction

A first category of applications such as face

recognition, sentiment analysis, object detection,
or speech recognition, requires to predict an un-
known value from an available signal.

102 149
6.1 Image denoising
A direct application of deep models to image pro-
cessing is to recover from degradation by using
the redundancy in the statistical structure of im-
ages. The petals of a sunflower on a grayscale
picture can be colored with high confidence, and
the texture of a geometric shape such as a table
on a low-light grainy picture can be corrected
by averaging it over a large area likely to be
uniform.

A denoising autoencoder is a model that takes

as input a degraded signal X̃ and computes an
estimate of the original one X.

Such a model is trained by collecting a large num-

ber of clean samples paired with their degraded
inputs. The latters can be captured in degraded
conditions, such as low-light or inadequate fo-
cus, or generated algorithmically, for instance
by converting the clean sample to grayscale, re-
ducing its size, or compressing it aggressively
with a lossy compression method.

The standard training procedure for denoising

autoencoders uses the MSE loss, in which case
the model aims at computing E(X | X̃). This
quantity may be problematic when X is not to-
tally determined by X̃, in which case some parts
103 149
of the generated signal may be an unrealistic
blurry average.

104 149
6.2 Image classification
Image classification is the simplest strategy to
extract semantics from an image, and consists
of predicting a class among a finite predefined
number of classes, given an input image.

The standard models for this task are convolu-

tional networks, such as ResNets, see § 5.2, and
attention-based models such as ViT, see § 5.3.
Those models generate a vector of logits with as
many dimensions as there are classes.

The training procedure simply minimizes the

cross-entropy loss, see § 3.1. Usually perfor-
mance can be improved with
data augmentation
, which consists of modifying the training
samples with hand-designed random transforma-
tions, that do not change the semantic content of
the image, such as cropping, scaling, mirroring,
or color changes.

105 149
6.3 Object detection
A more complex task for image understanding
is object detection, in which case the objective
is, given an input image, to predict the classes
and positions of objects of interest.

An object position is formalized as the four coor-

dinates (x1 ,y1 ,x2 ,y2 ) of a rectangular bounding
box, and the ground truth associated to each
training image is a list of such bounding boxes,
each labeled with the class of the object in it.

The standard approach to solve this task, for

instance by the Single Shot Detector (SSD, Liu
et al., 2015), is to use a convolutional neural
network that produces a sequence of image
representations Zs of size Ds ×Hs ×Ws , s =
1,...,S, with decreasing spatial resolution Hs ×
Ws down to 1×1 for s = S (see Figure 6.1). Each
of those tensors covers the input image in full,
so the h,w indices correspond to a partitioning
of the image lattice into regular squares that
gets coarser when s increases. As seen in § 4.2,
and illustrated on Figure 4.4, due to the succes-
sion of convolutional layers, a feature vector
(Zs [0,h,w],...,Zs [Ds −1,h,w]) is a descriptor
of an area of the image, called its receptive field,
that is larger than this square but centered on

106 149
X

Z1
Z2
ZS−1
ZS
...

...

Figure 6.1: A convolutional object detector processes the

input image to generate a sequence of representations
of decreasing resolutions. It computes for every h,w, at
every scale s, a pre-defined number of bounding boxes
whose centers are in the image area corresponding to
that cell, and whose size are such that they fit in its
receptive field. Each prediction takes the form of the
estimates (x̂1 , x̂2 , ŷ1 , ŷ2 ), represented by the red boxes
above, and a vector of C +1 logits for the C classes of
interest, and an additional “no object” class.

107 149
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].

108 149
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2

Detection is achieved by adding S convolutional

layers, each processing a Zs and computing
for every tensor indices h,w the coordinates of
a bounding box, and the associated logits. If
there are C object classes, there are C +1 logits,
the supplementary one standing for “no object.”
Hence each additional convolution layers has
4+C +1 output channels. The SSD algorithm in
particular generates several bounding boxes per
s,h,w, each dedicated to a hard-coded range of
aspect-ratios.

Training sets for object detection are costly to

create, since the labeling with bounding boxes
requires a slow human intervention. To miti-
gate this issue, the standard approach is to start
with a convolutional model that has been
pre-trained
on a large classification data set such as
VGG-16 for the original SSD, and to replace its
final fully connected layers by additional con-
volutional ones. Surprisingly, models trained
for classification only have learned feature rep-
resentations that can be re-purposed for object
detection, even though that task involves the

109 149
regression of geometric quantities.

During training every ground truth bounding

box is associated to its s,h,w, and induces a loss
term composed of a cross entropy loss for the
logits, and a regression loss such as MSE for the
bounding box coordinates. Every other s,h,w
free of bounding-box match induces a cross-
entropy only penalty to predict the class “no
object”.

110 149
6.4 Semantic segmentation
The finest grain prediction task for image under-
standing is semantic segmentation, which con-
sists of predicting for every pixel the class of the
object it belongs to. This can be achieved with
a standard convolutional neural network, that
outputs a convolutional map with as many chan-
nels as classes, that carry the estimated logits for
every pixel.

While a standard residual network for instance

can generate a dense output of same resolution
as its input, as for object detection, this task
requires to operate at multiple scales. This is
necessary so that any object, or sufficiently in-
formative sub-part, whatever its size, is captured
somewhere in the model by the feature represen-
tation at a single tensor position. Hence, stan-
dard architectures for that task downscale the
image with a series of convolutional layers, to in-
crease the receptive field of the activations, and
re-upscale it with a series of transposed convo-
lutional layers, or other upscaling methods such
as bilinear interpolation to make the prediction
at high resolution.

However, a strict downscaling-upscaling archi-

tecture does not allow to operate at a fine grain

111 149
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].

when making the final prediction, since all the

signal has been transmitted through a low res-
olution representation at some point. Models
that apply such downscaling-upscaling serially
mitigate this issues with skip connections from
layers at a certain resolution, before downscal-
ing, to layers at the same resolution, after upscal-
ing [Long et al., 2014; Ronneberger et al., 2015].
Models that do it in parallel, after a convolutional

112 149
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].

Training is achieved with a standard cross-

entropy summed over all the pixels. As for object
detection, training can start from a
network pre-trained
on a large-scale image classification data
set to compensate for the limited availability of
segmentation ground-truth.

113 149
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Transformer, that we described in § 5.3.

Their model first converts the sound signal into

a spectrogram, which is a one dimension series
T ×D, that encodes at every time step a vector of
energies in D frequency bands. The associated
text is encoded with the BPE tokenizer, see § 3.2.

The spectrogram is processed through a few 1d

convolutional layers, and the resulting repre-
sentation is fed into the encoder of the Trans-
former. The decoder directly generates a discrete
sequence of tokens, that correspond to one of the
possible tasks considered during training. Multi-
ple objectives are considered for training: tran-
scription of English or non-English text, transla-
tion from any language to English, or a detection
of non-speech sequences, such as background
music or ambient noise.

This approach allows to leverage extremely large

114 149
data sets that combine multiple types of sound
sources with diverse ground truth.

It is noteworthy that even though the ultimate

goal of this approach is to produce a transla-
tion as deterministic as possible given the input
signal, it is formally the sampling of a text dis-
tribution conditioned on a sound sample, hence
a synthesis process. The decoder is in fact ex-
tremely similar to the generative model of § 7.1.

115 149
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations.

The Contrastive Language Image Pre-training

(CLIP) proposed by Radford et al. [2021] com-
bines an image encoder f , which is a ViT, and a
text encoder g, which is a GPT. See § 5.3 for both.
To repurpose a GPT as a text encoder, instead
of a standard autoregressive model, they add to
the input sequence an “end of sentence” token,
and use the representation of this token in the
last layer as the embedding. Both embeddings
have the same dimension which, depending on
the configuration, is between 512 and 1024.

Those two models are trained from scratch from

a data set of 400 million image-text pairs (ik ,tk )
collected from the internet. The training proce-
dure follows the standard mini batch stochastic
gradient descent approach but relies on a
contrastive loss
. The embeddings are computed for
every image and every text of the N pairs in
the mini-batch, and a cosine similarity measure
is computed not only between text and image
embeddings from each pair, but also across pairs,

116 149
resulting in a N ×N matrix of similarity score

lm,n = f (im )⊤ g(tn ), m = 1,...,N,n = 1,...,N.

The model is trained with cross entropy so that,

∀n the values l1,n ,...,lN,n interpreted as logit
scores predict n, and similarly for ln,1 ,...,ln,N .
This means that ∀n,m, s.t. n ̸= m the similarity
ln,n is unambiguously greater than both ln,m and
lm,n .

When it has been trained, this model can be used

to do zero-shot prediction, that is classifying a
signal in absence of training examples by defin-
ing a series of candidate classes with text de-
scriptions, and computing the similarity of the
embedding of an image with the embedding of
each of those descriptions (see Figure 6.4).

Additionally, since the textual descriptions are

often detailed, such a model has to capture richer
representation of images, and pick up cues over-
looked by classifier networks. This translates to
excellent performance on challenging datasets
such as ImageNet Adversarial [Hendrycks et al.,
2019] which was specifically designed to degrade
or erase cues on which standard predictors rely.

117 149
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.

118 149
Chapter 7

Synthesis

A second category of applications distinct from

prediction is synthesis. It consists of fitting a
density model to training samples, and providing
means to sample from this model.

119 149
7.1 Text generation
The standard approach to text synthesis is to use
an attention-based autoregressive model. The
most successful in this domain is the GPT [Rad-
ford et al., 2018], that we described in § 5.3.

The encoding into tokens and the decoding is

done with the BPE tokenizer, see § 3.2.

When it has been trained on very large datasets,

a Large Language Model (LLM) exhibits ex-
tremely powerful properties. Besides the syntac-
tic and grammatical structure of the language, it
has to integrate very diverse knowledge, e.g. to
predict the word following “The capital of Japan
is”, “if water is heated to 100 Celsius degrees it
turns into”, or “because her puppy was sick, Jane
was”.

This results in particular in the ability to solve

zero-shot learning, where no training example
is available and the objective is defined in nat-
ural language, e.g. “In the following sentences,
indicate which ones are aggressive”. More sur-
prising, when such a model is put in a statistical
context by a “prompt” carefully crafted, it can
exhibit abilities for question answering, problem
solving, and chain-of-thought that appear eerily
close to high-level reasoning [Chowdhery et al.,
120 149
2022; Bubeck et al., 2023].

Du to these remarkable capabilities, these mod-

els are sometime referred to as
foundation models
[Bommasani et al., 2021].

121 149
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high dimension density.
A powerful one for image synthesis relies on
inverting a diffusion process.

The principle consists of defining analytically

a process that gradually degrades any sample,
and consequently transforms the complex and
unknown density of the data, into a simple and
well-known density such as a normal, and to
train a deep architecture to invert this degrada-
tion process [Ho et al., 2020].

In practice, given a fixed T , the diffusion process

defines a probabilities over series of T +1 im-
ages as follows: samples x0 uniformly in the data
set, and then go on sampling xt+1 ∼ p(xt+1 | xt )
where the conditional distribution p is defined
analytically, and such that it gradually erases
the structure that was in x0 . The setup should
be such that the distribution p(xT ) of xT has a
simple, known form, so in particular does not de-
pend on the complicated data distribution p(x0 ),
and can be sampled.

For instance Ho et al. [2020] normalize the data

to mean 0 and variance 1, and their diffusion pro-
cess consists of adding a bit of white noise and
122 149
xT

Figure 7.1: Image synthesis with denoising diffusion

[Ho et al., 2020]. Each sample starts as a white noise
xT (top), and is gradually de-noised by sampling iter-
atively xt−1 | xt ∼ 𝒩 (xt +f (xt ,t;w),σt ).

123 149
re-normalizing the variance to 1. This process
reduces exponentially the importance of x0 , and
xt ’s density can rapidly be approximated with a
normal.

The denoiser f is a deep architecture that

should model, and allow to sample from,
f (xt−1 ,xt ,t;w) ≃ p(xt−1 | xt ). It can be shown,
thanks to a variational bound, that if this one
step reverse process is accurate enough, sam-
pling xT ∼ p(xT ) and denoising T steps with f
results in a x0 that follows p(x0 ).

Training f can be achieved by generating a large

(n) (n)
number of sequences x0 ,...,xT , picking a tn
in each, and maximizing
(n) (n)
X
logf (xtn −1 ,xtn ,tn ;w).
n

Given their diffusion process, Ho et al. [2020]

have a denoising of the form

xt−1 | xt ∼ 𝒩 (xt +f (xt ,t;w);σt ), (7.1)

where σt is defined analytically.

In practice, what such a model does is to ini-

tially hallucinate structures by pure luck in the
random noise, and then to gradually build more
124 149
elements that emerge from the noise by reinforc-
ing the most likely continuation of the image
obtained so far.

This approach can be extended to text-

conditioned synthesis, to generate images
that match a description. For instance Nichol
et al. [2021] add to the mean of the denoising
distribution of Equation 7.1 a bias that goes in
the direction of increasing the CLIP matching
score (see § 6.6) between the image produced
and the conditioning text description.

125 149
The missing bits

For the sake of concision, this volume skips a lot

of important topics, in particular:

Recurrent Neural Networks

Before attention models demonstrated greater
performance, Recurrent Neural Networks
(RNN) were the standard approach to deal
with temporal sequences such as text or sound
samples. These architectures possess an internal
hidden state, that get updated every time a
component of the sequence get processed. Their
main components are layers such as LSTM
[Hochreiter and Schmidhuber, 1997] or GRU
[Cho et al., 2014]. Training a recurrent archi-
tecture amounts to unfolding it in time, which
results in a long composition of operators. This
historically prompted the design of key tech-

126 149
niques now used for deep architectures such as
rectifiers and gating, a form of skip connections
which are modulated dynamically.

Autoencoder
An autoencoder is a model that maps the input
signal, possibly of high dimension, to a low-
dimension latent representation, and maps it
back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can be used to discover au-
tomatically a meaningful low-dimension param-
eterization of the data manifold. The
Variational Autoencoder
(VAE) proposed by Kingma and
Welling [2013] has a similar structure, but im-
poses through the loss a distribution to the latent
representation. After training, it allows to gener-
ate signals by sampling the latent representation
according to this imposed distribution, and then
mapping back through the decoder.

Generative Adversarial Networks

Another approach to density modeling are the
Generative Adversarial Networks (GAN) intro-
duced by Goodfellow et al. [2014]. This method
combines a generator, that takes a random in-
put following a fixed distribution as input and
127 149
produces a structured signal such as an image,
and a discriminator, that takes as input a sample
and predicts if it comes from the training set,
or if it was generated by the generator. Train-
ing optimizes the discriminator to minimize a
standard cross-entropy loss, and the generator
to maximize the discriminator’s loss. It can be
shown that at the equilibrium the generator pro-
duces samples indistinguishable from real data.
In practice, when the gradient flows through the
discriminator to the generator, it informs the lat-
ter about the cues that the discriminator uses,
that should be fixed.

Reinforcement Learning
Many problems necessitate a model to estimate
an accumulated long-term reward given action
choices and an observable state, and what ac-
tions to choose to maximize that reward.
Reinforcement Learning
(RL) is the standard frame-
work to formalize such problems, and strategy
games or robotic control, for instance, can be for-
mulated in it. Deep models, in particular convo-
lutional neural networks, have demonstrated ex-
cellent performance for this class of tasks [Mnih
et al., 2015].

128 149
Fine-tuning
As we saw in § 6.3 for object detection, or in § 6.4
for semantic segmentation, Fine-tuning deep ar-
chitectures is an efficient strategy to deal with
small training sets. Beside, due to the dramatic
increase in the size of architectures, particularly
that of Large Language Models, training a single
model can cost several millions of dollars, and
fine-tuning is a crucial, and often the only way,
to achieve high performance on a specific task.

Graph Neural Networks

Many applications require to process signals
which are not organized regularly on a grid. For
instance molecules, proteins, 3D meshes, or geo-
graphic locations are more naturally structured
as graphs. Standard convolutional network or
even attention models are poorly adapted to
process such data, and the tool of choice for
such a task are Graph Neural Networks (GNN,
[Scarselli et al., 2009]). These models are com-
posed of layers that compute activations at each
vertex by combining linearly the activations lo-
cated at its immediate neighboring vertices. This
operation is very similar to a standard convo-
lution, except that the data structure does not
reflect a geometrical information associated to

129 149
the feature vectors they carry.

Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word,
Large Language Models
trained on large unlabeled data-sets such
as GPT (see § 5.3) are able to solve various tasks
such as identifying the grammatical role of a
word, answering questions, or even translating
from a language to another [Radford et al., 2019].
Such models constitute one category of a larger
class of methods that fall under the name of
,self-supervised learning
and try to take advantage
of unlabeled data-sets [Balestriero et al., 2023].
The key principle of these methods is to define a
task that does not require labels but necessitates
feature representations which are useful for the
real task of interest, for which a small labeled
data set exists. In computer vision, for instance,
a standard approach consists of optimizing im-
age features so that they are invariant to data
transformations that do not change the semantic
content of the image, while being statistically
uncorrelated [Zbontar et al., 2021].

130 149
Afterword

Recent developments in Artificial Intelligence

have been incredibly exciting, and it is difficult
to comment on them without being overly dra-
matic. There are few doubts that these technolo-
gies will cause fundamental changes in how we
work, how we interact with knowledge and in-
formation, and that they will force us to rethink
concepts as fundamental as intelligence, under-
standing, and sentience.

In spite of its weaknesses, in particular its sheer

brutality and its computational cost, deep learn-
ing is likely to remain an important component
of AI systems for the foreseeable future and, as
such, a key element of this new era.

131 149
Bibliography

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer

Normalization. CoRR, abs/1607.06450, 2016.
[pdf]. 70

R. Balestriero, M. Ibrahim, V. Sobal, et al. A

Cookbook of Self-Supervised Learning. CoRR,
abs/2304.12210, 2023. [pdf]. 130

A. Baydin, B. Pearlmutter, A. Radul, and

J. Siskind. Automatic differentiation in
machine learning: a survey. CoRR,
abs/1502.05767, 2015. [pdf]. 38

M. Belkin, D. Hsu, S. Ma, and S. Mandal. Rec-

onciling modern machine learning and the
bias-variance trade-off. CoRR, abs/1812.11118,
2018. [pdf]. 42

R. Bommasani, D. Hudson, E. Adeli, et al. On

the Opportunities and Risks of Foundation
132 149
Models. CoRR, abs/2108.07258, 2021. [pdf].
121

T. Brown, B. Mann, N. Ryder, et al. Lan-

guage Models are Few-Shot Learners. CoRR,
abs/2005.14165, 2020. [pdf]. 21, 98

S. Bubeck, V. Chandrasekaran, R. Eldan, et al.

Sparks of Artificial General Intelligence:
Early experiments with GPT-4. CoRR,
abs/2303.12712, 2023. [pdf]. 121

T. Chen, B. Xu, C. Zhang, and C. Guestrin. Train-

ing Deep Nets with Sublinear Memory Cost.
CoRR, abs/1604.06174, 2016. [pdf]. 39

K. Cho, B. van Merrienboer, Ç. Gülçehre,

et al. Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine
Translation. CoRR, abs/1406.1078, 2014. [pdf].
126

A. Chowdhery, S. Narang, J. Devlin, et al. PaLM:

Scaling Language Modeling with Pathways.
CoRR, abs/2204.02311, 2022. [pdf]. 21, 120

G. Cybenko. Approximation by superpositions

of a sigmoidal function. Mathematics of Con-
trol, Signals, and Systems, 2(4):303–314, De-
cember 1989. [pdf]. 85

133 149
J. Devlin, M. Chang, K. Lee, and K. Toutanova.
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018. [pdf]. 21, 100
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al.
An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. CoRR,
abs/2010.11929, 2020. [pdf]. 99, 100
K. Fukushima. Neocognitron: A self-organizing
neural network model for a mechanism of
pattern recognition unaffected by shift in po-
sition. Biological Cybernetics, 36(4):193–202,
April 1980. [pdf]. 2
Y. Gal and Z. Ghahramani. Dropout as
a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning. CoRR,
abs/1506.02142, 2015. [pdf]. 66
X. Glorot and Y. Bengio. Understanding the dif-
ficulty of training deep feedforward neural
networks. In International Conference on Arti-
ficial Intelligence and Statistics (AISTATS), 2010.
[pdf]. 40, 51
X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse
Rectifier Neural Networks. In International
Conference on Artificial Intelligence and Statis-
tics (AISTATS), 2011. [pdf]. 59
134 149
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 39
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
et al. Generative Adversarial Networks. CoRR,
abs/1406.2661, 2014. [pdf]. 127
K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
ual Learning for Image Recognition. CoRR,
abs/1512.03385, 2015. [pdf]. 21, 71, 72, 89, 91
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 61
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 117
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 122, 123, 124
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 126
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
135 149
Conference on Machine Learning (ICML), 2015.
[pdf]. 67

J. Kaplan, S. McCandlish, T. Henighan, et al. Scal-

ing Laws for Neural Language Models. CoRR,
abs/2001.08361, 2020. [pdf]. 44

D. Kingma and J. Ba. Adam: A Method for

Stochastic Optimization. CoRR, abs/1412.6980,
2014. [pdf]. 34

D. P. Kingma and M. Welling. Auto-Encoding

Variational Bayes. CoRR, abs/1312.6114, 2013.
[pdf]. 127

A. Krizhevsky, I. Sutskever, and G. Hinton. Ima-

geNet Classification with Deep Convolutional
Neural Networks. In Neural Information Pro-
cessing Systems (NIPS), 2012. [pdf]. 8, 87

Y. LeCun, B. Boser, J. S. Denker, et al. Back-

propagation applied to handwritten zip code
recognition. Neural Computation, 1(4):541–
551, 1989. [pdf]. 8

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.

Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):
2278–2324, 1998. [pdf]. 87, 88

136 149
W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
Shot MultiBox Detector. CoRR, abs/1512.02325,
2015. [pdf]. 106, 108

J. Long, E. Shelhamer, and T. Darrell. Fully Con-

volutional Networks for Semantic Segmenta-
tion. CoRR, abs/1411.4038, 2014. [pdf]. 71, 72,
112

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rec-

tifier nonlinearities improve neural network
acoustic models. In proceedings of the ICML
Workshop on Deep Learning for Audio, Speech
and Language Processing, 2013. [pdf]. 60

V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-

level control through deep reinforcement
learning. Nature, 518(7540):529–533, February
2015. [pdf]. 128

A. Nichol, P. Dhariwal, A. Ramesh, et al. GLIDE:

Towards Photorealistic Image Generation and
Editing with Text-Guided Diffusion Models.
CoRR, abs/2112.10741, 2021. [pdf]. 125

A. Radford, J. Kim, C. Hallacy, et al. Learn-

ing Transferable Visual Models From Natural
Language Supervision. CoRR, abs/2103.00020,
2021. [pdf]. 116, 118

137 149
A. Radford, J. Kim, T. Xu, et al. Robust Speech
Recognition via Large-Scale Weak Supervi-
sion. CoRR, abs/2212.04356, 2022. [pdf]. 114

A. Radford, K. Narasimhan, T. Salimans, and

I. Sutskever. Improving Language Understand-
ing by Generative Pre-Training, 2018. [pdf].
95, 98, 120

A. Radford, J. Wu, R. Child, et al. Language

Models are Unsupervised Multitask Learners,
2019. [pdf]. 98, 130

O. Ronneberger, P. Fischer, and T. Brox. U-Net:

Convolutional Networks for Biomedical Im-
age Segmentation. In Medical Image Comput-
ing and Computer-Assisted Intervention, 2015.
[pdf]. 71, 72, 112

F. Scarselli, M. Gori, A. C. Tsoi, et al. The Graph

Neural Network Model. IEEE Transactions
on Neural Networks (TNN), 20(1):61–80, 2009.
[pdf]. 129

R. Sennrich, B. Haddow, and A. Birch. Neural

Machine Translation of Rare Words with Sub-
word Units. CoRR, abs/1508.07909, 2015. [pdf].
30

138 149
J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. [pdf]. 9, 21

K. Simonyan and A. Zisserman. Very Deep Con-

volutional Networks for Large-Scale Image
Recognition. CoRR, abs/1409.1556, 2014. [pdf].
87

N. Srivastava, G. Hinton, A. Krizhevsky, et al.

Dropout: A Simple Way to Prevent Neural
Networks from Overfitting. Journal of Ma-
chine Learning Research (JMLR), 15:1929–1958,
2014. [pdf]. 65

A. Vaswani, N. Shazeer, N. Parmar, et al. Atten-

tion Is All You Need. CoRR, abs/1706.03762,
2017. [pdf]. 71, 74, 83, 94, 95, 96

J. Zbontar, L. Jing, I. Misra, et al. Barlow Twins:

Self-Supervised Learning via Redundancy Re-
duction. CoRR, abs/2103.03230, 2021. [pdf].
130

M. D. Zeiler and R. Fergus. Visualizing and Un-

derstanding Convolutional Networks. In Eu-
ropean Conference on Computer Vision (ECCV),
2014. [pdf]. 57

139 149
H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene
Parsing Network. CoRR, abs/1612.01105, 2016.
[pdf]. 112, 113

140 149
Index

1d convolution, 55
2d convolution, 55

activation, 22, 37
activation function, 59, 85
activation map, 56
Adam, 34
artificial neural network, 8, 11
attention layer, 74
attention operator, 75
autoencoder, 127
autograd, 38
autoregressive model, 29, 120
average pooling, 62

backpropagation, 38
backward pass, 38
basis function regression, 14
batch, 20, 34
batch normalization, 67
141 149
bias vector, 50, 55
BPE, 30, 114, 120
Byte Pair Encoding, 30

cache memory, 20
capacity, 15
causal, 30, 77
causal model, 29, 78, 97
channel, 22
checkpointing, 39
classification, 17
CLIP, 116
CLS token, 100
computational cost, 39
contrastive loss, 26, 116
convnet, 87
convolutional layer, 53, 87
convolutional network, 87
cross-attention block, 80, 95
cross-entropy, 25

data augmentation, 105

deep learning, 8, 11
denoising autoencoder, 103
density modeling, 17
depth, 37
diffusion process, 122
dilation, 56, 62
discriminator, 128
142 149
downscaling residual block, 92
dropout, 65, 78

embedding layer, 81, 97

epoch, 41

filter, 55
fine tuning, 129
flops, 21
forward pass, 37
foundation models, 121
FP32, 21
framework, 22
fully connected layer, 50, 85, 87

GAN, 127
GELU, 61
Generative Adversarial Networks, 127
generator, 127
GNN, 129
GPT, 98, 116, 120, 130
GPU, 8, 19
gradient descent, 31, 33, 36
gradient step, 31
Graph Neural Network, 129
Graphical Processing Unit, 8, 19
ground truth, 17

hidden layer, 85
hidden state, 126
143 149
image classification, 105
image processing, 87
image synthesis, 74, 122
inductive bias, 16, 42, 53

kernel size, 55, 62

key, 75

Large Language Model, 75, 120, 129

Large Language Models, 130
layer, 48
layer normalization, 70
layers, 37
Leaky ReLU, 60
learning rate, 31
learning rate schedule, 43
LeNet, 87
lenet, 88
linear layer, 50
LLM, 120
local minimum, 31
logit, 25
loss, 12

max pooling, 62
mean squared error, 14, 25
memory requirement, 39
memory speed, 20
meta parameter, 13, 41
metric learning, 26
144 149
MLP, 85, 95
model, 12
Multi-Head Attention, 78, 94
multi-layer perceptron, 85

natural language processing, 74

NLP, 74
non-linearity, 59
normalizing layer, 67

object detection, 106

over-fitting, 16, 41

padding, 55, 62
parameter, 12
parametric model, 12
peak performance, 21
pooling, 62
positional encoding, 82, 97
posterior probability, 25
pre-trained model, 109, 113

query, 75

random initialization, 51
receptive field, 56, 57, 106
rectified linear unit, 59, 127
recurrent neural network, 126
regression, 17
reinforcement learning, 128
145 149
ReLU, 59
residual block, 90
residual connection, 71, 89
residual network, 71, 89
resnet, 71, 89
ResNet-50, 89
reversible layer, 39
RL, 128
RNN, 126

scaling laws, 44
self-attention block, 80, 94, 95
self-supervised learning, 130
semantic segmentation, 111
SGD, 34
Single Shot Detector, 106
skip connection, 71, 112, 127
softargmax, 25, 76
softmax, 25
speech recognition, 114
SSD, 106
stochastic gradient descent, 34, 45
stride, 56, 62
supervised learning, 18

tanh, 60
tensor, 22
tensor cores, 20
Tensor Processing Units, 20
146 149
test set, 41
text synthesis, 120
tokenizer, 30, 114, 120
tokens, 28
TPU, 20
trainable parameter, 12
training, 12
training set, 12, 24, 41, 44
Transformer, 71, 75, 94, 96, 114
transposed convolution, 58

under-fitting, 15
universal approximation theorem, 85
unsupervised learning, 18

VAE, 127
validation set, 41
value, 75
vanishing gradient, 40, 47
variational autoencoder, 127
variational bound, 124
Vision Transformer, 100
ViT, 100, 116
vocabulary, 28

weight, 13
weight decay, 27
weight matrix, 50

zero-shot learning, 120

147 149
zero-shot prediction, 117

148 149
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.

149 149

Israel The Startup Nation
100% (1)
Israel The Startup Nation
20 pages
The Target The Decimation of Jignesh Shahs Global Empire
0% (1)
The Target The Decimation of Jignesh Shahs Global Empire
240 pages
Case Study R. Thyagarajan PDF
No ratings yet
Case Study R. Thyagarajan PDF
65 pages
Excercising Your GMIB Brochure
No ratings yet
Excercising Your GMIB Brochure
8 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
John Perkins - Confessions of An Economic Hitman
No ratings yet
John Perkins - Confessions of An Economic Hitman
1 page
Keeping Up With Quants - HBR
0% (1)
Keeping Up With Quants - HBR
13 pages
Application Letter
100% (1)
Application Letter
2 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Antique Daily 03 Jun 25
No ratings yet
Antique Daily 03 Jun 25
25 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
CFA Presentation
100% (8)
CFA Presentation
47 pages
BX53, BX43 and BX46 Brochure
No ratings yet
BX53, BX43 and BX46 Brochure
24 pages
Flaw of Averages
100% (1)
Flaw of Averages
12 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
LBDL
No ratings yet
LBDL
143 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
Howard Mark Memos (1) (1) - Compressed
No ratings yet
Howard Mark Memos (1) (1) - Compressed
207 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Company Analyses Vol. 3
No ratings yet
Company Analyses Vol. 3
252 pages
6-Strategic Alternatives and Choice
81% (16)
6-Strategic Alternatives and Choice
20 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
143 pages
Event Driven Investing PDF
No ratings yet
Event Driven Investing PDF
2 pages
Chapter 4 Legal Regulatory and Political Issues
No ratings yet
Chapter 4 Legal Regulatory and Political Issues
2 pages
Poor Charlie's Almanack
No ratings yet
Poor Charlie's Almanack
462 pages
Investment Manual
No ratings yet
Investment Manual
33 pages
ADAM SMITH - An Inquiry Into The Nature and Causes of The Wealth of Nations PDF
No ratings yet
ADAM SMITH - An Inquiry Into The Nature and Causes of The Wealth of Nations PDF
1,282 pages
Diversified Investing - Ayush Mittal
100% (1)
Diversified Investing - Ayush Mittal
28 pages
Munger-Daily Journal Annual Meeting Notes by Adam Blum-2!15!17
100% (4)
Munger-Daily Journal Annual Meeting Notes by Adam Blum-2!15!17
21 pages
Spinoffs and Wealth Transfers: The Marriott Case
No ratings yet
Spinoffs and Wealth Transfers: The Marriott Case
34 pages
CFA India Society - US - 3 Sep
No ratings yet
CFA India Society - US - 3 Sep
22 pages
Module 2.1 Managerial Economics
No ratings yet
Module 2.1 Managerial Economics
18 pages
Assessment Test 2nd Cash&Rec
100% (1)
Assessment Test 2nd Cash&Rec
6 pages
10 Big Lies Youve Been Told About Investing
No ratings yet
10 Big Lies Youve Been Told About Investing
39 pages
History of Ice-Cream Industry in India.
100% (1)
History of Ice-Cream Industry in India.
60 pages
Brian Spector Leaving Baupost Letter - Business Insider
No ratings yet
Brian Spector Leaving Baupost Letter - Business Insider
4 pages
Finance Books - The 14 Best Books To Master Money & Build Wealth 2
No ratings yet
Finance Books - The 14 Best Books To Master Money & Build Wealth 2
26 pages
Notes The Intelligent Asset Allocator
No ratings yet
Notes The Intelligent Asset Allocator
3 pages
The Rediscovered Benjamin Graham-Lectures On Security Analysis
No ratings yet
The Rediscovered Benjamin Graham-Lectures On Security Analysis
66 pages
Safal Niveshak Mastermind-Lesson 1-4 (1) - Investor or Speculator - Unlocked
No ratings yet
Safal Niveshak Mastermind-Lesson 1-4 (1) - Investor or Speculator - Unlocked
8 pages
National Guidelines For Management of DR TB - 27 3 2025
No ratings yet
National Guidelines For Management of DR TB - 27 3 2025
82 pages
Biases That Skew Our Thinking
100% (1)
Biases That Skew Our Thinking
3 pages
Jpmorgan's 129 Page Internal Report On Whale Jan 16 2013
No ratings yet
Jpmorgan's 129 Page Internal Report On Whale Jan 16 2013
132 pages
FINAL Broyhill Book Club 2021
No ratings yet
FINAL Broyhill Book Club 2021
30 pages
Gotrocks Family
No ratings yet
Gotrocks Family
4 pages
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
No ratings yet
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
12 pages
Jamsetji Tata
No ratings yet
Jamsetji Tata
3 pages
Charlie Munger Unplugged WSJ
No ratings yet
Charlie Munger Unplugged WSJ
22 pages
2019 ASHRAE Boston Product Guide Final PDF
No ratings yet
2019 ASHRAE Boston Product Guide Final PDF
75 pages
A Random Walk
100% (1)
A Random Walk
10 pages
Stock Market Prediction Using Hidden Markov Model
No ratings yet
Stock Market Prediction Using Hidden Markov Model
4 pages
Atoms To Bits: 26 Annual Wealth Creation Study (2016-2021)
No ratings yet
Atoms To Bits: 26 Annual Wealth Creation Study (2016-2021)
64 pages
Key Differences Between Industrial All Risk (IAR) and PAR
No ratings yet
Key Differences Between Industrial All Risk (IAR) and PAR
2 pages
Coho Capital 2018 Q4 Letter
No ratings yet
Coho Capital 2018 Q4 Letter
8 pages
Sanjay Bakshi Coimbatore Talk at Firebird
No ratings yet
Sanjay Bakshi Coimbatore Talk at Firebird
9 pages
Sci Bono Mathematics Module 4 5 Final
No ratings yet
Sci Bono Mathematics Module 4 5 Final
85 pages
Hard Times Come To The Hedge Funds-Loomis-Fortune-1-70 PDF
No ratings yet
Hard Times Come To The Hedge Funds-Loomis-Fortune-1-70 PDF
8 pages
Corolla Diesel PDF
No ratings yet
Corolla Diesel PDF
2 pages
Charlie Munger Part 3
No ratings yet
Charlie Munger Part 3
3 pages
Top20 InvestingBooks Uptill2018
No ratings yet
Top20 InvestingBooks Uptill2018
3 pages
Railway Institute Research Center Concept Paper
No ratings yet
Railway Institute Research Center Concept Paper
112 pages
Pat Dorsey - Talks at Google
No ratings yet
Pat Dorsey - Talks at Google
27 pages
Mike Platt Grandma's Advice Bloomberg
No ratings yet
Mike Platt Grandma's Advice Bloomberg
7 pages
Seven Intelligent Fanatics From India
No ratings yet
Seven Intelligent Fanatics From India
23 pages
Oq 2016 Raamdeo Agrawal 1
No ratings yet
Oq 2016 Raamdeo Agrawal 1
35 pages
Here Are A Few Relevant Examples
No ratings yet
Here Are A Few Relevant Examples
6 pages
John Bogle - The Arithmetic of All in Investment Expenses
No ratings yet
John Bogle - The Arithmetic of All in Investment Expenses
9 pages
"Masterclass With Super Investors" - Key Learnings - Mumbai 22nd Feb 2019
No ratings yet
"Masterclass With Super Investors" - Key Learnings - Mumbai 22nd Feb 2019
1 page
Book Report: Poor Charlie's Almanack 3rd Edition
0% (2)
Book Report: Poor Charlie's Almanack 3rd Edition
4 pages
The Spider of Silicon Valley Inside PDF
No ratings yet
The Spider of Silicon Valley Inside PDF
12 pages
Worksheet (AS)
No ratings yet
Worksheet (AS)
4 pages
Hazardous Area Ventilation Sce Performance Standard
No ratings yet
Hazardous Area Ventilation Sce Performance Standard
82 pages
Cement Project
No ratings yet
Cement Project
16 pages
Carbon and Alloy Steel Nuts For Bolts For High Pressure or High Temperature Service, or Both
No ratings yet
Carbon and Alloy Steel Nuts For Bolts For High Pressure or High Temperature Service, or Both
11 pages
ERQ Marked Samples Discuss The Role That One Cultural Dimension May Have On Behaviour.
No ratings yet
ERQ Marked Samples Discuss The Role That One Cultural Dimension May Have On Behaviour.
4 pages
Lucky House Others
No ratings yet
Lucky House Others
16 pages
Ge El 10 - Mod 3
No ratings yet
Ge El 10 - Mod 3
62 pages
Javascriptinterviewquestions 240713104909 D9bedd8b
No ratings yet
Javascriptinterviewquestions 240713104909 D9bedd8b
25 pages
Campus Map
No ratings yet
Campus Map
1 page
Cases Syllabus IV - Book III
No ratings yet
Cases Syllabus IV - Book III
46 pages
2009 Economic Factors and Incentives For Ocean Wave Energy Conversion
No ratings yet
2009 Economic Factors and Incentives For Ocean Wave Energy Conversion
8 pages
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
No ratings yet
Catatonia and ECT Across The Lifespan - 2024 - Schizophrenia Research
6 pages
PR A2plus B1 The World Today Videos Videoscript
No ratings yet
PR A2plus B1 The World Today Videos Videoscript
3 pages
Chemistry Homework 8-1
No ratings yet
Chemistry Homework 8-1
7 pages
Semester V (2022-25)
No ratings yet
Semester V (2022-25)
1 page
GPS Unit 2 Assignment Sheet
No ratings yet
GPS Unit 2 Assignment Sheet
3 pages
Ielts Reading Question Sheet
No ratings yet
Ielts Reading Question Sheet
2 pages
List of Units Competency: Daftar Unit Kompetensi
No ratings yet
List of Units Competency: Daftar Unit Kompetensi
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.