A Little Book of Deep Learning - Francois Fleuret
A Little Book of Deep Learning - Francois Fleuret
of
Deep Learning
François Fleuret
beta-2023.05.15
François Fleuret is professor of computer science
at the University of Geneva, Switzerland.
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and over-fitting . . . . . 15
1.4 Categories of models . . . . . . 17
2 Efficient computation 19
2.1 GPUs, TPUs, and batches . . . . 20
2.2 Tensors . . . . . . . . . . . . . . 22
3 Training 24
3.1 Losses . . . . . . . . . . . . . . 25
3.2 Autoregressive models . . . . . 28
3.3 Gradient descent . . . . . . . . 31
3 149
3.4 Backpropagation . . . . . . . . 36
3.5 Training protocols . . . . . . . 41
3.6 Training data . . . . . . . . . . 44
II Deep models 46
4 Model components 47
4.1 The notion of layer . . . . . . . 48
4.2 Linear layers . . . . . . . . . . . 50
4.3 Activation functions . . . . . . 59
4.4 Pooling . . . . . . . . . . . . . . 62
4.5 Dropout . . . . . . . . . . . . . 65
4.6 Normalizing layers . . . . . . . 67
4.7 Skip connections . . . . . . . . 71
4.8 Attention layers . . . . . . . . . 74
4.9 Token embedding . . . . . . . . 81
4.10 Positional encoding . . . . . . . 82
5 Architectures 84
5.1 Multi-Layer Perceptrons . . . . 85
5.2 Convolutional networks . . . . 87
5.3 Attention models . . . . . . . . 94
Afterword 131
Bibliography 132
Index 141
5 149
List of Figures
4.1 1d convolution . . . . . . . . . . . . 52
4.2 2d convolution . . . . . . . . . . . . 53
4.3 Stride, padding, and dilation . . . . 54
4.4 Receptive field . . . . . . . . . . . . 57
4.5 Activation functions . . . . . . . . . 60
4.6 Max pooling . . . . . . . . . . . . . 63
4.7 Dropout . . . . . . . . . . . . . . . . 66
4.8 Batch normalization . . . . . . . . . 68
4.9 Skip connections . . . . . . . . . . . 72
4.10 Interpretation of the attention operator 75
4.11 Attention operator . . . . . . . . . . 77
6 149
4.12 Multi-Head Attention layer . . . . . 79
7 149
Foreword
If you did not get this book from its official url
https://fleuret.org/public/lbdl.pdf
François Fleuret
April 21, 2023
9 149
Part I
Foundations
10 149
Chapter 1
Machine Learning
11 149
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
13 149
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1
15 149
This is over-fitting.
16 149
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:
18 149
Chapter 2
Efficient computation
19 149
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting to deep models. As their usage
for AI has increased, GPUs got equipped with
dedicated sub-components referred to as
,tensor cores
and deep-learning specialized chips
such as Google’s Tensor Processing Units (TPUs)
have been produced.
20 149
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice a GPU processes
a batch that fits in memory almost as quickly as
a single sample.
21 149
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to pro-
cess by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.
23 149
Chapter 3
Training
24 149
3.1 Losses
The example of the mean squared error of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.
expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z
25 149
For density modeling, the standard loss is the
likelihood of the data. If f (x;w) is to be inter-
preted as a normalized log-probability or density,
the loss is the opposite of the sum of its value
over training samples.
27 149
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:
cst x1 x2 ... xT −1 xT
30 149
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed form expression. In the general
case the tool of choice to minimize a function
is gradient descent. It consists of initializing the
parameters with a random w0 , and then improv-
ing this estimate by iterating gradient steps, each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it
31 149
w
ℒ (w)
32 149
As for many algorithms, intuition tends to break
in very high dimension, and although it seems
that this procedure would be very easily trapped
in a local minimum, in reality, due to the number
of parameters, the design of the models, and
the stochasticity in the data, its efficiency is far
greater than one could expect.
where
𝓁n (w) = L(f (xn ;w),yn )
for some L, and the gradient is then
N
1X
∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
n=1
34 149
ent training speeds in different parts of a model.
35 149
3.4 Backpropagation
Using gradient descent requires a tech-
nical means to compute ∇𝓁|w (w) where
𝓁= L(f (x;w);y). Given that f and L are both
compositions of standard tensor operations, as
for any mathematical expression, the chain rule
allows to get an expression of it.
fd (·;wd )
x(d−1) x(d)
×Jfd |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jfd |w
∇𝓁 |wd
36 149
Forward and backward passes
Consider the simple case of a composition of
mappings
f = f1 ◦f2 ◦···◦fD .
38 149
Resource usage
Regarding the computational cost, as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass, and two for the products by
the Jacobians for the backward pass. This makes
the latter roughly twice more costly than the
former.
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
39 149
wards through an operator it may be rescaled
by a multiplicative factor, and consequently de-
crease or increase exponentially when it tra-
verses many layer. When it decreases exponen-
tially this is called the vanishing gradient, and
it may make the training impossible, or in its
milder form, make different parts of the model
being updated at different speeds, degrading
their co-adaptation [Glorot and Bengio, 2010].
40 149
3.5 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.
Loss
Validation
Train
Number of epochs
42 149
An important design choice is the
learning rate schedule
during training. The general policy is
that the learning rate should be initially large to
avoid having the optimization being trapped in
a bad local minimum early, and that it should
get small so that the optimized parameter values
do not bounce around, and reaches a good mini-
mum in a narrow valley of the loss landscape.
43 149
3.6 Training data
One key aspect of deep learning is the steady
improvement of performance with the
training set
size. There has been an accumulation of em-
pirical results showing that performance, for in-
stance estimated through the loss on test data,
improves with the amount of data according to
accurate scaling laws, as long as the model size
increases correspondingly [Kaplan et al., 2020].
44 149
tributed nature of these models, that can take ad-
vantage of massively parallel computing devices,
and by the stochastic gradient descent, which
requires only a tiny fraction of the data at a time,
and can operate with data sets whose size is or-
ders of magnitude greater than that of the com-
puting device’s memory.
45 149
Part II
Deep models
46 149
Chapter 4
Model components
Y
4×4
g n=4
f
×K
32×32
X
48 149
• non-default valued meta-parameters are
added in blue on their right,
49 149
4.2 Linear layers
Linear layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b.
Convolutional layers
A linear layer can take as input an arbitrarily
shaped tensor by reshaping it into a vector, as
long as it has the right number of coefficients.
However such a layer is poorly adapted to deal-
ing with large tensors since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance to process an RGB im-
age of size 256×256 as input and compute a
result of same size, it would require ≃ 4×1010
parameters and multiplications.
51 149
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.
52 149
ϕ ψ
Y X
X Y
2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .
53 149
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
54 149
the same operator everywhere.
55 149
than one. Its default value is 0.
56 149
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its receptive field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.
57 149
A converse operation is the
transposed convolution
that consists also in a localized affine op-
erator, defined by similar meta and trainable
parameters as the convolution, but which ap-
plies, for instance in the 1d case, an affine map-
′
ping ψ(·;w) : RD×1 → RD ×K , to every D×1
sub-tensor of the input, and sums the shifted
D′ ×K resulting tensors to compute its output.
Such operator increases the size of the signal
and can be understood intuitively as a synthe-
sis process (see Figure 4.1, right and Figure 4.2,
right).
58 149
4.3 Activation functions
If a network was combining only linear compo-
nents it would itself be a linear operator, so it
is essential to have non-linear operations. They
are implemented in particular with
activation functions
which are layers that transforms ev-
ery component of the input tensor individually
through a mapping, resulting in a tensor of same
shape.
59 149
Tanh ReLU
gelu(x) = xP (Z ≤ x),
61 149
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a kernel size.
max
max
...
max
1d max pooling
63 149
max over the sub-tensors. This is a linear opera-
tion, while max pooling is not.
64 149
4.5 Dropout
Some layers have been designed to explicitly
facilitate training, or improve the quality of the
learned representations.
0
1 1 1 1 1 1
0 1 1 1 0
1 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
66 149
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the
normalizing layers
which force the empirical mean and
variance of groups of activations.
67 149
D
H,W
x⊙γ +β x⊙γ +β
√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ
batchnorm layernorm
68 149
viation γd
xb,d − m̂d
zb,d = √
v̂d +ϵ
yb,d = γd zb,d +βd .
70 149
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep architec-
ture are the skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections are the
residual connections
which combine the signal with
a sum, and usually skip only a few layers (see
Figure 4.9, right).
f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...
72 149
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible size. In the case of residual connec-
tions, they may also facilitate the learning by
simplifying the task to finding a differential im-
provement instead of a full update.
73 149
4.8 Attention layers
In many applications there is a need for a pro-
cessing able to combine local information at lo-
cations far apart in a tensor. This can be for
instance distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.
Attention operator
Given
Y = att(K,Q,V )
·
A
dropout
1/Σk
Masked
softargmax M ⊙
exp
Q K V
77 149
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows for instance to make the
operator causal by taking M full of 1s below
the diagonal and zero above, preventing Yq to
depend on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.
• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK , and
• W V of size H ×D×DV ,
78 149
Y
×W O
(Y1 | ··· | YH )
attatt
attatt
att
Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H
×H
XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
79 149
• X Q of size N Q ×D,
• X K of size N KV ×D, and
• X V of size N KV ×D,
Y = (Y1 | ··· | YH )W O .
80 149
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an
embedding layer
which consists of a lookup table
that directly maps integers to vectors.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].
81 149
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor, and to the position of the
resulting activation in the output tensor, convo-
lutional layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial to deal with a
stationary signal.
82 149
D, Vaswani et al. [2017] add
pos-enc[t,d] =
sin d/Dt
if d ∈ 2N
T
t
cos (d−1)/D
T
otherwise,
with T = 104 .
83 149
Chapter 5
Architectures
84 149
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the
Multi-Layer Perceptron
(MLP), which takes the form
of a succession of fully connected layers sepa-
rated by activation functions. See an example
on Figure 5.1. For historical reasons, in such a
model, the number of hidden layers refers to the
number of linear layers, excluding the last one.
Y
2
fully-conn
relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.
85 149
any continuous function f can be approximated
arbitrarily well uniformly on a compact by a
model of the form l2 ◦σ◦l1 where l1 and l2 are
affine. Such a model is a MLP with a single hid-
den layer, and this result implies that it can ap-
proximate anything of practical value. However
this approximation holds if the dimension of the
first linear layer’s output can be arbitrarily large.
86 149
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that
combines multiple convolutional layers, either
to reduce the signal size before it can be pro-
cessed by fully connected layers, or to output a
2d signal also of large size.
LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2d
convolutional layers and max pooling layers that
play the role of feature extractor, with a series of
fully connected layers which act like a MLP and
performs the classification per se. See Figure 5.2
for an example.
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
87 149
P̂ (Y )
10
fully-conn
Classifier
relu
200
fully-conn
256
reshape
relu
64×2×2
maxpool k=2
64×4×4
Feature conv-2d k=5
extractor
relu
32×8×8
maxpool k=3
32×24×24
conv-2d k=5
1×28×28
X
Figure 5.2: Example of a small LeNet-like network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.
88 149
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
89 149
Y
4C
S ×H W
S × S
relu
+
batchnorm batchnorm
4C
S ×H W
S × S
conv-2d k=1 s=S conv-2d k=1
relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
90 149
P̂ (Y )
1000
fully-conn
2048
reshape
2048×1×1
avgpool k=7
resblock
×2
2048×7×7
dresblock
S=2
resblock
×5
1024×14×14
dresblock
S=2
resblock
×3
512×28×28
dresblock
S=2
resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3
3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
91 149
tation. However the parameter count of a con-
volutional layer, and its computational cost, are
quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then up-scaling the number of chan-
nels, again with a 1×1 convolution.
92 149
blocks. Surprisingly, in the first section, there
is no downscaling, only an increase of the num-
ber of channels by a factor of 4. The output of
the last residual block is 2048×7×7, which is
converted to a vector of dimension 2048 by an
average pooling of kernel size 7×7, and then
processed through a fully connected layer to get
the final logits, here for 1000 classes.
93 149
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured on Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence,
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.
94 149
Y Y
+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm
+ +
mha mha
Q K V Q K V
layernorm layernorm
X QKV XQ X KV
Figure 5.6: Self-attention block (left) and
cross-attention block
(right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.
95 149
P̂ (Y1 ),..., P̂ (YS | Ys<S )
S ×V
fully-conn
S ×D
cross-att
Q KV
Decoder causal
self-att ×N
pos-enc +
S ×D
embed
S
0,Y1 ,...,YS−1
Z1 ,...,ZT
T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X1 ,...,XT
96 149
input two sequences, one to compute the queries,
and one the keys and values.
97 149
P̂ (X1 ),..., P̂ (XT | Xt<T )
T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed
T
0,X1 ,...,XT −1
98 149
P̂ (Y )
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
Z0 ,Z1 ,...,ZM
(M +1)×D
self-att
×N
pos-enc +
(M +1)×D
E0 ,E1 ,...,EM
Image E0 ×W E
encoder M ×3P 2
X1 ,...,XM
99 149
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.
100 149
Part III
Applications
101 149
Chapter 6
Prediction
102 149
6.1 Image denoising
A direct application of deep models to image pro-
cessing is to recover from degradation by using
the redundancy in the statistical structure of im-
ages. The petals of a sunflower on a grayscale
picture can be colored with high confidence, and
the texture of a geometric shape such as a table
on a low-light grainy picture can be corrected
by averaging it over a large area likely to be
uniform.
104 149
6.2 Image classification
Image classification is the simplest strategy to
extract semantics from an image, and consists
of predicting a class among a finite predefined
number of classes, given an input image.
105 149
6.3 Object detection
A more complex task for image understanding
is object detection, in which case the objective
is, given an input image, to predict the classes
and positions of objects of interest.
106 149
X
Z1
Z2
ZS−1
ZS
...
...
107 149
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
108 149
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2
109 149
regression of geometric quantities.
110 149
6.4 Semantic segmentation
The finest grain prediction task for image under-
standing is semantic segmentation, which con-
sists of predicting for every pixel the class of the
object it belongs to. This can be achieved with
a standard convolutional neural network, that
outputs a convolutional map with as many chan-
nels as classes, that carry the estimated logits for
every pixel.
111 149
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].
112 149
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].
113 149
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Transformer, that we described in § 5.3.
115 149
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations.
116 149
resulting in a N ×N matrix of similarity score
117 149
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.
118 149
Chapter 7
Synthesis
119 149
7.1 Text generation
The standard approach to text synthesis is to use
an attention-based autoregressive model. The
most successful in this domain is the GPT [Rad-
ford et al., 2018], that we described in § 5.3.
121 149
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high dimension density.
A powerful one for image synthesis relies on
inverting a diffusion process.
x0
123 149
re-normalizing the variance to 1. This process
reduces exponentially the importance of x0 , and
xt ’s density can rapidly be approximated with a
normal.
125 149
The missing bits
126 149
niques now used for deep architectures such as
rectifiers and gating, a form of skip connections
which are modulated dynamically.
Autoencoder
An autoencoder is a model that maps the input
signal, possibly of high dimension, to a low-
dimension latent representation, and maps it
back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can be used to discover au-
tomatically a meaningful low-dimension param-
eterization of the data manifold. The
Variational Autoencoder
(VAE) proposed by Kingma and
Welling [2013] has a similar structure, but im-
poses through the loss a distribution to the latent
representation. After training, it allows to gener-
ate signals by sampling the latent representation
according to this imposed distribution, and then
mapping back through the decoder.
Reinforcement Learning
Many problems necessitate a model to estimate
an accumulated long-term reward given action
choices and an observable state, and what ac-
tions to choose to maximize that reward.
Reinforcement Learning
(RL) is the standard frame-
work to formalize such problems, and strategy
games or robotic control, for instance, can be for-
mulated in it. Deep models, in particular convo-
lutional neural networks, have demonstrated ex-
cellent performance for this class of tasks [Mnih
et al., 2015].
128 149
Fine-tuning
As we saw in § 6.3 for object detection, or in § 6.4
for semantic segmentation, Fine-tuning deep ar-
chitectures is an efficient strategy to deal with
small training sets. Beside, due to the dramatic
increase in the size of architectures, particularly
that of Large Language Models, training a single
model can cost several millions of dollars, and
fine-tuning is a crucial, and often the only way,
to achieve high performance on a specific task.
129 149
the feature vectors they carry.
Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word,
Large Language Models
trained on large unlabeled data-sets such
as GPT (see § 5.3) are able to solve various tasks
such as identifying the grammatical role of a
word, answering questions, or even translating
from a language to another [Radford et al., 2019].
Such models constitute one category of a larger
class of methods that fall under the name of
,self-supervised learning
and try to take advantage
of unlabeled data-sets [Balestriero et al., 2023].
The key principle of these methods is to define a
task that does not require labels but necessitates
feature representations which are useful for the
real task of interest, for which a small labeled
data set exists. In computer vision, for instance,
a standard approach consists of optimizing im-
age features so that they are invariant to data
transformations that do not change the semantic
content of the image, while being statistically
uncorrelated [Zbontar et al., 2021].
130 149
Afterword
131 149
Bibliography
133 149
J. Devlin, M. Chang, K. Lee, and K. Toutanova.
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018. [pdf]. 21, 100
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al.
An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. CoRR,
abs/2010.11929, 2020. [pdf]. 99, 100
K. Fukushima. Neocognitron: A self-organizing
neural network model for a mechanism of
pattern recognition unaffected by shift in po-
sition. Biological Cybernetics, 36(4):193–202,
April 1980. [pdf]. 2
Y. Gal and Z. Ghahramani. Dropout as
a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning. CoRR,
abs/1506.02142, 2015. [pdf]. 66
X. Glorot and Y. Bengio. Understanding the dif-
ficulty of training deep feedforward neural
networks. In International Conference on Arti-
ficial Intelligence and Statistics (AISTATS), 2010.
[pdf]. 40, 51
X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse
Rectifier Neural Networks. In International
Conference on Artificial Intelligence and Statis-
tics (AISTATS), 2011. [pdf]. 59
134 149
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 39
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
et al. Generative Adversarial Networks. CoRR,
abs/1406.2661, 2014. [pdf]. 127
K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
ual Learning for Image Recognition. CoRR,
abs/1512.03385, 2015. [pdf]. 21, 71, 72, 89, 91
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 61
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 117
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 122, 123, 124
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 126
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
135 149
Conference on Machine Learning (ICML), 2015.
[pdf]. 67
136 149
W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
Shot MultiBox Detector. CoRR, abs/1512.02325,
2015. [pdf]. 106, 108
137 149
A. Radford, J. Kim, T. Xu, et al. Robust Speech
Recognition via Large-Scale Weak Supervi-
sion. CoRR, abs/2212.04356, 2022. [pdf]. 114
138 149
J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. [pdf]. 9, 21
139 149
H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene
Parsing Network. CoRR, abs/1612.01105, 2016.
[pdf]. 112, 113
140 149
Index
1d convolution, 55
2d convolution, 55
activation, 22, 37
activation function, 59, 85
activation map, 56
Adam, 34
artificial neural network, 8, 11
attention layer, 74
attention operator, 75
autoencoder, 127
autograd, 38
autoregressive model, 29, 120
average pooling, 62
backpropagation, 38
backward pass, 38
basis function regression, 14
batch, 20, 34
batch normalization, 67
141 149
bias vector, 50, 55
BPE, 30, 114, 120
Byte Pair Encoding, 30
cache memory, 20
capacity, 15
causal, 30, 77
causal model, 29, 78, 97
channel, 22
checkpointing, 39
classification, 17
CLIP, 116
CLS token, 100
computational cost, 39
contrastive loss, 26, 116
convnet, 87
convolutional layer, 53, 87
convolutional network, 87
cross-attention block, 80, 95
cross-entropy, 25
filter, 55
fine tuning, 129
flops, 21
forward pass, 37
foundation models, 121
FP32, 21
framework, 22
fully connected layer, 50, 85, 87
GAN, 127
GELU, 61
Generative Adversarial Networks, 127
generator, 127
GNN, 129
GPT, 98, 116, 120, 130
GPU, 8, 19
gradient descent, 31, 33, 36
gradient step, 31
Graph Neural Network, 129
Graphical Processing Unit, 8, 19
ground truth, 17
hidden layer, 85
hidden state, 126
143 149
image classification, 105
image processing, 87
image synthesis, 74, 122
inductive bias, 16, 42, 53
max pooling, 62
mean squared error, 14, 25
memory requirement, 39
memory speed, 20
meta parameter, 13, 41
metric learning, 26
144 149
MLP, 85, 95
model, 12
Multi-Head Attention, 78, 94
multi-layer perceptron, 85
padding, 55, 62
parameter, 12
parametric model, 12
peak performance, 21
pooling, 62
positional encoding, 82, 97
posterior probability, 25
pre-trained model, 109, 113
query, 75
random initialization, 51
receptive field, 56, 57, 106
rectified linear unit, 59, 127
recurrent neural network, 126
regression, 17
reinforcement learning, 128
145 149
ReLU, 59
residual block, 90
residual connection, 71, 89
residual network, 71, 89
resnet, 71, 89
ResNet-50, 89
reversible layer, 39
RL, 128
RNN, 126
scaling laws, 44
self-attention block, 80, 94, 95
self-supervised learning, 130
semantic segmentation, 111
SGD, 34
Single Shot Detector, 106
skip connection, 71, 112, 127
softargmax, 25, 76
softmax, 25
speech recognition, 114
SSD, 106
stochastic gradient descent, 34, 45
stride, 56, 62
supervised learning, 18
tanh, 60
tensor, 22
tensor cores, 20
Tensor Processing Units, 20
146 149
test set, 41
text synthesis, 120
tokenizer, 30, 114, 120
tokens, 28
TPU, 20
trainable parameter, 12
training, 12
training set, 12, 24, 41, 44
Transformer, 71, 75, 94, 96, 114
transposed convolution, 58
under-fitting, 15
universal approximation theorem, 85
unsupervised learning, 18
VAE, 127
validation set, 41
value, 75
vanishing gradient, 40, 47
variational autoencoder, 127
variational bound, 124
Vision Transformer, 100
ViT, 100, 116
vocabulary, 28
weight, 13
weight decay, 27
weight matrix, 50
148 149
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.
149 149