ML LittelBook
ML LittelBook
of
Deep Learning
François Fleuret
François Fleuret is a professor of computer sci-
ence at the University of Geneva, Switzerland.
Contents 5
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and overfitting . . . . . . 16
1.4 Categories of models . . . . . . 18
2 Efficient computation 20
2.1 GPUs, TPUs, and batches . . . . 21
2.2 Tensors . . . . . . . . . . . . . . 23
3 Training 25
3.1 Losses . . . . . . . . . . . . . . 26
3.2 Autoregressive models . . . . . 30
3.3 Gradient descent . . . . . . . . 34
3 161
3.4 Backpropagation . . . . . . . . 39
3.5 The value of depth . . . . . . . 44
3.6 Training protocols . . . . . . . 47
3.7 The benefits of scale . . . . . . 50
II Deep models 55
4 Model components 56
4.1 The notion of layer . . . . . . . 57
4.2 Linear layers . . . . . . . . . . . 59
4.3 Activation functions . . . . . . 69
4.4 Pooling . . . . . . . . . . . . . . 72
4.5 Dropout . . . . . . . . . . . . . 75
4.6 Normalizing layers . . . . . . . 78
4.7 Skip connections . . . . . . . . 82
4.8 Attention layers . . . . . . . . . 85
4.9 Token embedding . . . . . . . . 92
4.10 Positional encoding . . . . . . . 93
5 Architectures 95
5.1 Multi-Layer Perceptrons . . . . 96
5.2 Convolutional networks . . . . 98
5.3 Attention models . . . . . . . . 105
Afterword 143
Bibliography 144
Index 153
5 161
List of Figures
4.1 1D convolution . . . . . . . . . . . . 61
4.2 2D convolution . . . . . . . . . . . . 62
4.3 Stride, padding, and dilation . . . . 63
4.4 Receptive field . . . . . . . . . . . . 66
4.5 Activation functions . . . . . . . . . 70
4.6 Max pooling . . . . . . . . . . . . . 73
4.7 Dropout . . . . . . . . . . . . . . . . 76
4.8 Batch normalization . . . . . . . . . 79
4.9 Skip connections . . . . . . . . . . . 83
4.10 Attention operator interpretation . 86
6 161
4.11 Complete attention operator . . . . 88
4.12 Multi-Head Attention layer . . . . . 90
7 161
Foreword
If you did not get this book from its official URL
https://fleuret.org/public/lbdl.pdf
François Fleuret,
May 21, 2023
9 161
Part I
Foundations
10 161
Chapter 1
Machine Learning
11 161
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance,
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
13 161
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error:
N
1 X
ℒ (w) = (yn − f (xn ;w))2 , (1.1)
N
n=1
14 161
the loss ℒ (w) is quadratic with respect to the
wk s, and finding w∗ that minimizes it boils down
to solving a linear system. See Figure 1.1 for an
example with Gaussian kernels as fk .
15 161
1.3 Under and overfitting
A key element is the interplay between the ca-
pacity of the model, that is its flexibility and
ability to fit diverse data, and the amount and
quality of the training data. When the capacity
is insufficient, the model cannot fit the data and
the error during training is high. This is referred
to as underfitting.
16 161
This is overfitting.
17 161
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:
19 161
Chapter 2
Efficient computation
20 161
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting for deep models. As their usage for
AI has increased, GPUs have been equipped with
dedicated sub-components referred to as tensor
cores, and deep-learning specialized chips such
as Google’s Tensor Processing Units (TPUs) have
been developed.
21 161
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice, a GPU processes
a batch that fits in memory almost as quickly as
a single sample.
22 161
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to be
processed by organizing them as tensors, which
are series of scalars arranged along several dis-
crete axes. They are elements of RN1 ×···×ND
that generalize the notion of vector and matrix.
24 161
Chapter 3
Training
25 161
3.1 Losses
The example of the mean squared error of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.
Cross-entropy
For classification, the usual strategy is that the
output of the model is a vector with one com-
ponent f (x;w)y per class y, interpreted as the
logarithm of a non-normalized probability, or
logit.
expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z
26 161
To be consistent with this interpretation the
model should be trained to maximize the proba-
bility of the true classes, hence to minimize the
cross-entropy, expressed as:
N
1 X
ℒce (w) = − log P̂ (Y = yn | X = xn )
N
n=1
N
1 X expf (xn ;w)yn
= −log P .
N z expf (xn ;w)z
n=1 | {z }
Lce (f (xn ;w),yn )
Contrastive loss
In certain setups, even though the value to be
predicted is continuous, the supervision takes
the form of ranking constraints. The typical do-
main where this is the case is metric learning,
where the objective is to learn a measure of dis-
tance between samples such that a sample xa
from a certain semantic class is closer to any
sample xb of the same class than to any sample
xc from another class. For instance, xa and xb
can be two pictures of a certain person, and xc a
picture of someone else.
29 161
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled with autoregressive models of discrete
sequences.
30 161
Then, given a model
the chain rule states that one can sample a full se-
quence of length T by sampling the xt s one after
another, each according to the predicted poste-
rior distribution, given the x1 ,...,xt−1 already
sampled. This is an autoregressive generative
model.
Causal models
Training such a model could be achieved naively
by minimizing the sum across training sequences
x and time steps t of
Lce f (x1 ,...,xt−1 ,∅,...,∅;w),xt ,
cst x1 x2 ... xT −1 xT
32 161
Tokenizer
One important technical detail when dealing
with natural languages is that the representation
as tokens can be done in multiple ways, from the
finest granularity of individual symbols to entire
words. The conversion to and from the token
representation is done by a separate algorithm
called a tokenizer.
33 161
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed-form expression. In the general
case, the tool of choice to minimize a function is
gradient descent. It starts by initializing the pa-
rameters with a random w0 , and then improves
this estimate by iterating gradient steps, each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it:
Learning rate
The meta-parameter η is referred to as the learn-
ing rate. It is a positive value that modulates
how quickly the minimization is done, and has
to be chosen carefully.
ℒ (w)
35 161
bounce around a good minimum and never de-
scend into it. As we will see in § 3.6, it can depend
on the iteration number n.
38 161
3.4 Backpropagation
Using gradient descent requires a tech-
nical means to compute ∇𝓁 |w (w) where
𝓁 = L(f (x;w);y). Given that f and L are
both compositions of standard tensor opera-
tions, as for any mathematical expression, the
chain rule from differential calculus allows us to
get an expression of it.
f (d) (·;wd )
x(d−1) x(d)
×Jf (d) |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jf (d) |w
∇𝓁 |wd
41 161
Resource usage
Regarding the computational cost, as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass and two for the products by
the Jacobians for the backward pass, making the
latter roughly twice as costly as the former.
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be rescaled
42 161
by a multiplicative factor, and consequently de-
crease or increase exponentially when it tra-
verses many layer. When it decreases exponen-
tially, this is called the vanishing gradient, and
it may make the training impossible, or, in its
milder form, cause different parts of the model
to be updated at different speeds, degrading their
co-adaptation [Glorot and Bengio, 2010].
43 161
3.5 The value of depth
As the term “deep learning” indicates, useful
models are generally compositions of long se-
ries of mappings. Training them with gradient
descent results in a sophisticated co-adaptation
of the mappings, even though this procedure is
gradual and local.
45 161
of representations in high dimensions, which, in
particular, facilitates the optimization by provid-
ing many degrees of freedom.
46 161
3.6 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.
Loss
Validation
Train
Number of epochs
48 161
on the training set [Belkin et al., 2018].
49 161
3.7 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, esti-
mated through the loss on test data, improves
with the amount of data according to remarkable
scaling laws, as long as the model size increases
correspondingly [Kaplan et al., 2020] (see Figure
3.6).
50 161
Test loss
Compute (peta-FLOP/s-day)
Test loss
Number of parameters
52 161
1TWh
PaLM
1024
GPT-3 LaMDA
AlphaZero Whisper
Training cost (FLOP)
ViT
1MWh
AlphaGo CLIP-ViT
GPT-2
1021
BERT
Transformer
GPT
ResNet
1KWh
VGG16
1018 AlexNet GoogLeNet
2015 2020
Year
Figure 3.7: Training costs in number of FLOP of some
landmark models [Sevilla et al., 2023]. The colors in-
dicate the domains of application: Computer Vision
(blue), Natural Language Processing (red), or other
(black). The dashed lines correspond to the energy con-
sumption using A100s SXM in 16 bits precision.
53 161
The most impressive current successes of artifi-
cial intelligence rely on the so-called Large Lan-
guage Models (LLMs), which we will see in § 5.3
and § 7.1, trained on extremely large text datasets
(see Table 3.1).
54 161
Part II
Deep models
55 161
Chapter 4
Model components
56 161
4.1 The notion of layer
We call layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters and correspond to a convenient level of
granularity for designing and describing large
deep models. The term is inherited from sim-
ple multi-layer neural networks, even though
modern models may take the form of a complex
graph of such modules, incorporating multiple
parallel pathways.
Y
4×4
g n=4
f
×K
32 × 32
X
58 161
4.2 Linear layers
The most important modules in terms of compu-
tation and number of parameters are the Linear
layers. They benefit from decades of research
and engineering in algorithmic and chip design
for matrix operations.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ] + b.
Convolutional layers
A linear layer can take as input an arbitrarily-
shaped tensor by reshaping it into a vector, as
long as it has the correct number of coefficients.
However, such a layer is poorly adapted to deal-
60 161
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1D transposed
1D convolution
convolution
Figure 4.1: A 1D convolution (left) takes as input
a D × T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D × K, and stores
the resulting D′ × 1 tensors into Y . A 1D transposed
convolution (right) takes as input a D × T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D × 1, and sums the shifted resulting
D′ × K tensors. Both can process inputs of different
size.
61 161
ϕ ψ
Y X
X Y
2D transposed
2D convolution
convolution
Figure 4.2: A 2D convolution (left) takes as input a
D × H × W tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D × K × L, and
stores the resulting D′ × 1 × 1 tensors into Y . A 2D
transposed convolution (right) takes as input a D ×
H × W tensor, applies the same affine mapping ψ(·;w)
to every D × 1 × 1 sub-tensor, and sums the shifted
resulting D′ × K × L tensors into Y .
62 161
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameters: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
63 161
and statistical stationarity to translation, scaling,
and certain symmetries. This is not reflected
in the inductive bias of a fully connected layer,
which completely ignores the signal structure.
67 161
or for synthesis, as it is easier to learn a density
model over a low-dimension representation. We
will come back to this in § 5.2.
68 161
4.3 Activation functions
If a network were combining only linear compo-
nents, it would itself be a linear operator, so it is
essential to have non-linear operations. They are
implemented in particular with activation func-
tions, which are layers that transforms each com-
ponent of the input tensor individually through a
mapping, resulting in a tensor of the same shape.
70 161
4.5, bottom left):
(
ax if x < 0,
leakyrelu(x) =
x otherwise.
gelu(x) = xP (Z ≤ x),
71 161
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer, which, similarly
to convolution, can operate in 1D and 2D, and is
defined by a kernel size.
max
max
...
max
1D max pooling
73 161
layer that computes the average instead of the
maximum over the sub-tensors. This is a linear
operation, whereas max pooling is not.
74 161
4.5 Dropout
Some layers have been designed to explicitly
facilitate training or improve the learned repre-
sentations.
75 161
Y Y
0
1 1 1 1 1 0
1 1 1 1 0
1
1 1
0 1 1
0 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
76 161
confidence scores [Gal and Ghahramani, 2015].
77 161
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the normaliz-
ing layers, which force the empirical mean and
variance of groups of activations.
78 161
D
H,W
x⊙γ +β x⊙γ +β
√ √
(x − m̂)/ v̂ + ϵ (x − m̂)/ v̂ + ϵ
batchnorm layernorm
79 161
viation γd :
xb,d − m̂d
zb,d = √
v̂d + ϵ
yb,d = γd zb,d + βd .
81 161
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep archi-
tectures are skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections is the resid-
ual connection which combines the signal with
a sum, and usually skips only a few layers (see
Figure 4.9, right).
f (8)
...
(7)
...
f
(6)
f +
(6)
f
f (5) f (4)
(5)
f
f (4) f (3)
(4)
f
f (3) +
(3)
f
f (2) f (2)
f (2)
f (1) f (1)
f (1) ... ...
...
83 161
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible sizes. In the case of residual con-
nections, they may also facilitate learning by
simplifying the task to finding a differential im-
provement instead of a full update.
84 161
4.8 Attention layers
In many applications, there is a need for an op-
eration able to combine local information at lo-
cations far apart in a tensor. For instance, this
could be distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.
Attention operator
Given
Y = att(K,Q,V )
QK ⊤
att(Q,K,V ) = softargmax √ V.
DQK
| {z }
A
87 161
Y
×
A
dropout
1/Σk
Masked
softargmax M ⊙
exp
Q K V
88 161
This operator is usually extended in two ways,
as depicted in Figure 4.11. First, the attention
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows, for instance, to make the
operator causal by taking M full of 1s below the
diagonal and zero above, preventing Yq from de-
pending on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.
• W Q of size H × D × DQK ,
• W K of size H × D × DQK , and
• W V of size H × D × DV ,
×W O
(Y1 | ··· | YH )
attatt
attatt
att
Q K V
×W
×W1 2 Q×W
Q
1 2K K×W
×W 1 2V V
×W
×W×W3
Q ×W
Q ×W3 K
K×W×W3 4V V
×W4
H ×W4
H ×W H
×H
XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
90 161
It takes as input three sequences
• X Q of size N Q × D,
• X K of size N KV × D, and
• X V of size N KV × D,
Y = (Y1 | ··· | YH )W O .
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].
92 161
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor and to the position of the
resulting activation in the output tensor, con-
volutional layers and Multi-Head Attention lay-
ers are oblivious to the absolute position in the
tensor. This is key to their strong invariance and
inductive bias, which is beneficial for dealing
with a stationary signal.
pos-enc[t,d] =
sin d/Dt
if d ∈ 2N
T
t
cos (d−1)/D
T
otherwise,
with T = 104 .
94 161
Chapter 5
Architectures
95 161
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the Multi-Layer
Perceptron (MLP), which takes the form of a
succession of fully connected layers separated
by activation functions. See an example in Figure
5.1. For historical reasons, in such a model, the
number of hidden layers refers to the number of
linear layers, excluding the last one.
Y
2
fully-conn
relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.
96 161
and not polynomial, any continuous function f
can be approximated arbitrarily well uniformly
on a compact domain, that is bounded and con-
taining its boundary, by a model of the form
l2 ◦ σ ◦ l1 where l1 and l2 are affine. Such a model
is a MLP with a single hidden layer, and this re-
sult implies that it can approximate anything
of practical value. However, this approximation
holds if the dimension of the first linear layer’s
output can be arbitrarily large.
97 161
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that com-
bines multiple convolutional layers, either to re-
duce the signal size before it can be processed by
fully connected layers, or to output a 2D signal
also of large size.
LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2D
convolutional layers and max pooling layers that
play the role of feature extractor, with a series
of fully connected layers which act like a MLP
and perform the classification per se (see Figure
5.2).
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
98 161
P̂ (Y )
10
fully-conn
Classifier
relu
200
fully-conn
256
reshape
relu
64 × 2 × 2
maxpool k=2
64 × 4 × 4
Feature conv-2d k=5
extractor
relu
32 × 8 × 8
maxpool k=3
32 × 24 × 24
conv-2d k=5
1 × 28 × 28
X
Figure 5.2: Example of a small LeNet-like network for
classifying 28 × 28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28 ×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.
99 161
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
100 161
Y
4C H W
S × S × S
relu
+
batchnorm batchnorm
4C H W
S × S × S
conv-2d k=1 s=S conv-2d k=1
relu
batchnorm
C
S ×H
S ×
W
S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
101 161
P̂ (Y )
1000
fully-conn
2048
reshape
2048 × 1 × 1
avgpool k=7
resblock
×2
2048 × 7 × 7
dresblock
S=2
resblock
×5
1024 × 14 × 14
dresblock
S=2
resblock
×3
512 × 28 × 28
dresblock
S=2
resblock
×2
256 × 56 × 56
dresblock
S=1
64 × 56 × 56
maxpool k=3 s=2 p=1
relu
batchnorm
64 × 112 × 112
conv-2d k=7 s=2 p=3
3 × 224 × 224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
102 161
sentation. However, the parameter count of a
convolutional layer, and its computational cost,
are quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1 × 1 con-
volution, then operating spatially with a 3 × 3
convolution on this reduced number of chan-
nels, and then upscaling the number of channels,
again with a 1 × 1 convolution.
103 161
prisingly, in the first section, there is no down-
scaling, only an increase of the number of chan-
nels by a factor of 4. The output of the last resid-
ual block is 2048 × 7 × 7, which is converted to a
vector of dimension 2048 by an average pooling
of kernel size 7 × 7, and then processed through
a fully connected layer to get the final logits,
here for 1000 classes.
104 161
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured in Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.
105 161
Y Y
+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm
+ +
mha mha
Q K V Q K V
layernorm layernorm
X QKV XQ X KV
Figure 5.6: Self-attention block (left) and cross-atten-
tion block (right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.
106 161
P̂ (Y1 ),..., P̂ (YS | Ys<S )
S ×V
fully-conn
S ×D
cross-att
Q KV
Decoder causal
self-att ×N
pos-enc +
S ×D
embed
S
0,Y1 ,...,YS−1
Z1 ,...,ZT
T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X1 ,...,XT
107 161
mation from any other positions, with a one-
hidden-layer MLP that updates representations
at every position separately. This block can
be made causal by using an adequate attention
mask, as described in § 4.8
108 161
P̂ (X1 ),..., P̂ (XT | Xt<T )
T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed
T
0,X1 ,...,XT −1
109 161
P̂ (Y )
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
Z0 ,Z1 ,...,ZM
(M + 1) × D
self-att
×N
pos-enc +
(M + 1) × D
E0 ,E1 ,...,EM
Image E0 ×W E
encoder M × 3P 2
X1 ,...,XM
110 161
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020] (see Figure 5.9).
111 161
Part III
Applications
112 161
Chapter 6
Prediction
113 161
6.1 Image denoising
A direct application of deep models to image
processing is to recover from degradation by
utilizing the redundancy in the statistical struc-
ture of images. The petals of a sunflower on a
grayscale picture can be colored with high confi-
dence, and the texture of a geometric shape such
as a table on a low-light grainy picture can be
corrected by averaging it over a large area likely
to be uniform.
114 161
with a lossy compression method.
115 161
6.2 Image classification
Image classification is the simplest strategy for
extracting semantics from an image and consists
of predicting a class from a finite, predefined
number of classes, given an input image.
116 161
6.3 Object detection
A more complex task for image understanding is
object detection, in which the objective is, given
an input image, to predict the classes and posi-
tions of objects of interest.
117 161
X
Z1
Z2
ZS−1
ZS
...
...
118 161
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
119 161
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 − x1 ,y2 − y1 ),
y1 +y2
2 , and 2 .
x1 +x2
120 161
metric quantities.
121 161
6.4 Semantic segmentation
The finest-grain prediction task for image un-
derstanding is semantic segmentation, which
consists of predicting, for every pixel, the class
of the object to which it belongs. This can be
achieved with a standard convolutional neural
network that outputs a convolutional map with
as many channels as classes, carrying the esti-
mated logits for every pixel.
122 161
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].
123 161
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].
124 161
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Transformer, as described in § 5.3.
125 161
This approach allows leveraging extremely large
data sets that combine multiple types of sound
sources with diverse ground truth.
126 161
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations, such that an image, or a textual
description of it, would be mapped to the same
feature vector.
129 161
Chapter 7
Synthesis
130 161
7.1 Text generation
The standard approach to text synthesis is to
use an attention-based, autoregressive model.
The most successful in this domain is the GPT
[Radford et al., 2018], which we described in
§ 5.3.
132 161
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high-dimensional den-
sity. A powerful approach for image synthesis
relies on inverting a diffusion process.
x0
134 161
noise and re-normalizing the variance to 1. This
process exponentially reduces the importance of
x0 , and xt ’s density can rapidly be approximated
with a normal.
136 161
The missing bits
Autoencoder
An autoencoder is a model that maps an input
signal, possibly of high dimension, to a low-
dimension latent representation, and then maps
it back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can also be used to auto-
matically discover a meaningful low-dimension
parameterization of the data manifold.
138 161
put following a fixed distribution as input and
produces a structured signal such as an image,
and a discriminator, which takes as input a sam-
ple and predicts whether it comes from the train-
ing set or if it was generated by the generator.
Reinforcement Learning
Many problems require a model to estimate
an accumulated long-term reward given action
choices and an observable state, and what ac-
tions to choose to maximize that reward. Rein-
forcement Learning (RL) is the standard frame-
work to formalize such problems, and strategy
games or robotic control, for instance, can be
formulated within it. Deep models, particularly
convolutional neural networks, have demon-
strated excellent performance for this class of
tasks [Mnih et al., 2015].
139 161
Fine-tuning
As we saw in § 6.3 for object detection, and in
§ 6.4 for semantic segmentation, starting form a
pre-trained model and fine-tuning it to the task
at hand is an efficient strategy to deal with small
training sets.
Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word, Large Language
Models trained on large unlabeled data sets such
as GPT (see § 5.3) are able to solve various tasks
such as identifying the grammatical role of a
word, answering questions, or even translating
from one language to another [Radford et al.,
2019].
141 161
age features so that they are invariant to data
transformations that do not change the semantic
content of the image, while being statistically
uncorrelated [Zbontar et al., 2021].
142 161
Afterword
143 161
Bibliography
145 161
Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018. [pdf]. 52, 111
146 161
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 42
147 161
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
Conference on Machine Learning (ICML), 2015.
[pdf]. 78
148 161
recognition. Proceedings of the IEEE, 86(11):
2278–2324, 1998. [pdf]. 98, 99
149 161
Language Supervision. CoRR, abs/2103.00020,
2021. [pdf]. 127, 129
151 161
M. D. Zeiler and R. Fergus. Visualizing and Un-
derstanding Convolutional Networks. In Eu-
ropean Conference on Computer Vision (ECCV),
2014. [pdf]. 67
152 161
Index
1D convolution, 64
2D convolution, 64
activation, 23, 40
function, 69, 96
map, 66
Adam, 38
affine operation, 59
artificial neural network, 8, 11
attention operator, 86
autoencoder, 138
denoising, 114
Autograd, 41
autoregressive model, see model, autoregressive
average pooling, 72
backpropagation, 41
backward pass, 41
basis function regression, 14
batch, 21, 37
batch normalization, 78, 101
bias vector, 59, 64
153 161
BPE, see Byte Pair Encoding
Byte Pair Encoding, 33, 125, 131
cache memory, 21
capacity, 16
causal, 32, 88, 108
model, see model, causal
chain rule (derivative), 39
chain rule (probability), 30
channel, 23
checkpointing, 42
classification, 18, 26, 98, 116
CLIP, see Contrastive Language-Image
Pre-training
CLS token, 111
computational cost, 42
Contrastive Language-Image Pre-training, 127
contrastive loss, 27, 127
convnet, see convolutional network
convolution, 64
convolutional layer, see layer, convolutional
convolutional network, 98
cross-attention block, 91, 106, 108
cross-entropy, 27, 32, 44
filter, 64
fine-tuning, 140
flops, 22
forward pass, 40
foundation model, 132
FP32, 22
framework, 23
image processing, 98
image synthesis, 85, 133
inductive bias, 17, 48, 64
query, 86
random initialization, 60
receptive field, 66, 117
rectified linear unit, 69, 137
recurrent neural network, 137
regression, 18
Reinforcement Learning, 139
ReLU, see rectified linear unit
residual
block, 101
connection, 82, 100
network, 46, 82, 100
ResNet-50, 100
reversible layer, see layer, reversible
RL, see Reinforcement Learning
RNN, see recurrent neural network
scaling laws, 50
self-attention block, 91, 105, 106
158 161
self-supervised learning, 141
semantic segmentation, 122
SGD, see stochastic gradient descent
Single Shot Detector, 117
skip connection, 82, 123, 137
softargmax, 26, 87
softmax, 26
speech recognition, 125
SSD, see Single Shot Detector
stochastic gradient descent, 37, 44, 50
stride, 65, 72
supervised learning, 19
underfitting, 16
159 161
universal approximation theorem, 96
unsupervised learning, 19
weight, 13
decay, 28
matrix, 59
160 161
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.
161 161