Deep Learning
Deep Learning
Gabriel Peyré
CNRS & DMA
École Normale Supérieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com
Deep Learning
Before detailing deep architectures and their use, we start this chapter by presenting two essential com-
putational tools that are used to train these models: stochastic optimization methods and automatic differ-
entiation. In practice, they work hand-in-hand to be able to learn painlessly complicated non-linear models
on large-scale datasets.
This function hW,u (·) is thus a weighted sum of q “ridge functions” ρ(h·, wk i). These functions are constant
in the direction orthogonal to the neuron wk and have a profile defined by ρ.
The most popular non-linearities are sigmoid functions such as
er 1 1
ρ(r) = and ρ(r) = atan(r) +
1 + er π 2
and the rectified linear unit (ReLu) function ρ(r) = max(r, 0).
One often add a bias term in these models, and consider functions of the form ρ(h·, wk i + zk ) but this
bias term can be integrated in the weight as usual by considering (ha, wk i + zk = h(a, 1), (wk , zk )i, so we
ignore it in the following section. This simply amount to replacing a ∈ Rp by (a, 1) ∈ Rp+1 and adding a
dimension p 7→ p + 1, as a pre-processing of the features.
239
Expressiveness. In order to define function of arbitrary complexity when q increases, it is important that
ρ is non-linear. Indeed, if ρ(s) = s, then hW,u (a) = hW a, ui = ha, W > ui. It is thus a linear function with
weights W > u, whatever the number q of neurons. Similarly, if ρ is a polynomial on R of degree d, then hW,u (·)
is itself a polynomial of degree d in Rp , which is a linear space V of finite dimension dim(V ) = O(pd ). So
even if q increases, the dimension dim(V ) stays fixed and hW,u (·) cannot approximate an arbitrary function
outside V . In sharp contrast, one can show that if ρ is not polynomial, then hW,u (·) can approximate any
continuous function, as studied in Section 15.1.3.
Note that here, the parameters being optimized are (W, u) ∈ Rq×p × Rq .
Optimizing with respect to u. This function f is convex with respect to u, since it is a quadratic
function. Its gradient with respect to u can be computed as in (13.8) and thus
and one can compute in closed form the solution (assuming ker(ρ(AW > )) = {0}) as
u? = [ρ(AW > )> ρ(AW > )]−1 ρ(AW > )> y = [ρ(W A> )ρ(AW > )]−1 ρ(W A> )y
When W = Idp and ρ(s) = s one recovers the least square formula (13.9).
Optimizing with respect to W . The function f is non-convex with respect to W because the function
ρ is itself non-linear. Training a MLP is thus a delicate process, and one can only hope to obtain a local
minimum of f . It is also important to initialize correctly the neurons (wk )k (for instance as unit norm
random vector, but bias terms might need some adjustment), while u can be usually initialized at 0.
To compute its gradient with respect to W , we first note that for a perturbation ε ∈ Rq×p , one has
ρ(A(W + ε)> ) = ρ(AW > + Aε> ) = ρ(AW > ) + ρ0 (AW > ) (Aε> )
where we have denoted “ ” the entry-wise multiplication of matrices, i.e. U V = (Ui,j Vi,j )i,j . One thus
has,
1
||e + [ρ0 (AW > ) (Aε> )]y||2 where e = ρ(AW > )u − y ∈ Rn
def.
f (W + ε, u) =
2
= f (W, u) + he, [ρ0 (AW > ) (Aε> )]yi + o(||ε||)
= f (W, u) + hAε> , ρ0 (AW > ) (eu> )i
= f (W, u) + hε> , A> × [ρ0 (AW > ) (eu> )]i.
240
15.1.3 Universality
In this section, to ease the exposition, we explicitly introduce the bias and use the variable “x ∈ Rp ” in
place of “a ∈ Rp ”. We thus write the function computed by the MLP (including explicitly the bias zk ) as
q
X
def. def.
hW,z,u (x) = uk ϕwk ,zk (x) where ϕw,z (x) = ρ(hx, wi + z).
k=1
def.
The function ϕw,z (x) is a ridge function in the direction orthogonal to w̄ = w/||w|| and passing around the
z
point − ||w|| w̄.
In the following we assume that ρ : R → R is a bounded function such that
r→−∞ r→+∞
ρ(r) −→ 0 and ρ(r) −→ 1. (15.1)
Note in particular that such a function cannot be a polynomial and that the ReLu function does not satisfy
these hypothesis (universality for the ReLu is more involved to show). The goal is to show the following
theorem.
Theorem 25 (Cybenko, 1989). For any compact set Ω ⊂ Rp , the space spanned by the functions {ϕw,z }w,z
is dense in C(Ω) for the uniform convergence. This means that for any continuous function f and any ε > 0,
there exists q ∈ N and weights (wk , zk , uk )qk=1 such that
q
X
∀ x ∈ Ω, |f (x) − uk ϕwk ,zk (x)| 6 ε.
k=1
In a typical ML scenario, this implies that one can “overfit” the data, since using a q large enough ensures
that the training error can be made arbitrary small. Of course, there is a bias-variance tradeoff, and q needs
to be cross-validated to account for the finite number n of data, and ensure a good generalization properties.
Proof in dimension p = 1. In 1D, the approximation hW,z,u can be thought as an approximation using
smoothed step functions. Indeed, introducing a parameter ε > 0, one has (assuming the function is Lipschitz
to ensure uniform convergence),
ε→0
ϕ w , zk −→ 1[−z/w,+∞[
ε ε
which is a piecewise constant function. Inversely, any piecewise constant function can be written this way.
Indeed, if h assumes the value dk on each interval [tk , tk+1 [, then it can be written as
X
h= dk (1[tk ,+∞[ − 1[tk ,+∞[ ).
k
Since the space of piecewise constant functions is dense in continuous function over an interval, this proves
the theorem.
Proof in arbitrary dimension p. We start by proving the following dual characterization of density,
using bounded Borel measure µ ∈ M(Ω) i.e. such that µ(Ω) < +∞.
Proposition 48. If ρ is such that for any Borel measure µ ∈ M(Ω)
Z
∀ (w, z), ρ(hx, wi + z)dµ(x) = 0 =⇒ µ = 0, (15.2)
241
Proof. We consider the linear space
( q )
def.
X
p
S = uk ϕwk ,zk ; q ∈ N, wk ∈ R , uk ∈ R, zk ∈ R ⊂ C(Ω).
k=1
so that L = 0 on S̄. L is a bounded linear form, so that by Hahn-Banach theorem, it can be extended in a
bounded linear form L̄ : C(Ω) → R. Since L ∈ C(Ω)∗ (the dual space of continuous linear form), and that
R Borel measures, there exists µ ∈ RM(Ω), with µ 6= 0, such that for any
this dual space is identified with
continuous function h, L̄(h) = Ω h(x)dµ(x). But since L̄ = 0 on S̄, ρ(h·, wi + z)dµ = 0 for all (w, z) and
thus by hypothesis, µ = 0, which is a contradiction.
The theorem now follows from the following proposition.
Proposition 49. If ρ is continuous and satisfies (15.1), then it satisfies (15.2).
Proof. One has
hx, wi + u
1 if Hw,u ,
ε→0 def.
ϕ wε , uε +t (x) = ρ +t −→ γ(x) = ρ(t) if x ∈ Pw,u ,
ε
0 if hw, xi + u < 0,
def. def.
where we defined Hw,u = {x ; hw, xi + u > 0} and Pw,u = {x ; hw, xi + u = 0}. By Lebesgue dominated
convergence (since the involved quantities are bounded uniformly on a compact set)
Z Z
ε→0
ϕ wε , uε +t dµ −→ γdµ = ϕ(t)µ(Pw,u ) + µ(Hw,u ).
F : L∞ (R) → R is a bounded linear form since |F (µ)| 6 ||h||∞ µ(Ω) and µ(Ω) < +∞. One has
Z
F (1[−u,+∞[ = 1[−u,+∞[ (hw, xi)dµ(x) = µ(Pw,u ) + µ(Hw,u ) = 0.
Ω
By linearity, F (h) = 0 for all piecewise constant functions, and F is a continuous linear form, so that by
density F (h) = 0 for all functions h ∈ L∞ (R). Applying this for h(r) = eir one obtains
Z
def.
µ̂(w) = eihx, wi dµ(x) = 0.
Ω
242
Quantitative rates. Note that Theorem 25 is not constructive in the sense that it does not explain how to
compute the weights (wk , uk , zk )k to reach a desired accuracy. Since for a fixed q the function is non-convex,
this is not surprising. Some recent studies show that if q is large enough, a simple gradient descent is able
to reach an arbitrary good accuracy, but it might require a very large q.
Theorem 25 is also not quantitative since it does not tell how much neurons q is needed to reach a desired
accuracy. To obtain quantitative bounds, continuity is not enough, it requires to add smoothness constraints.
For instance, Barron proved that if Z
||ω|||fˆ(ω)|dω 6 Cf
where fˆ(ω) = f (x)e−ihx, ωi dx is the Fourier transform of f , then for q ∈ N there exists (wk , uk , zk )k
R
q
(2rCf )2
Z
1 X
|f (x) − uk ϕwk ,zk (x)|2 dx 6 .
Vol(B(0, r)) ||x||6r q
k=1
The surprising part of this Theorem is that the 1/q decay is independent of the dimension p. Note however
that the constant involved Cf might depend on p.
While it is possible to consider more complicated architecture (in particular recurrent ones), we restrict here
out attention to these simple linear graph computation structures (so-called feedforward networks).
The supervised learning of these parameters β is usually done by empirical risk minimization (12.11) using
SGD-type methods as explained in Section 14.2. Note that this results in highly non-convex optimization
problems. In particular, strong convergence guarantees such as Theorem 24 do not hold anymore, and
only weak convergence (toward stationary points) holds. SGD type technics are however found to work
surprisingly well in practice, and it now believe that the success of these deep-architecture approaches (in
particular the ability of these over-parameterized model to generalize well) are in large part due to the
dynamics of the SGD itself, which induce an implicit regularization effect.
For these simple linear architectures, the gradient of the ERM loss (14.13) can be computed using the re-
verse mode computation detailed in Section ??. In particular, in the context of deep learning, formula (15.4).
One should however keep in mind that for more complicated (e.g. recursive) architectures, such a simple
formula is not anymore available, and one should resort to reverse mode automatic differentiation (see Sec-
tion ??), which, while being conceptually simple, is actually implementing possibly highly non-trivial and
computationally optimal recursive differentiation.
In most successful applications of deep-learning, each computational block f` (·, β` ) is actually very simple,
and is the composition of
an affine map, B` · +b` with a matrix B` ∈ Rn` ×ñ` and a vector b` ∈ Rñ` parametrized (in most case
linearly) by β` ,
a fixed (not depending on β` ) non-linearity ρ` : Rñ` → Rn`+1
243
Figure 15.1: Left: example of fully connected network. Right: example of convolutional neural network.
which we write as
∀ x` ∈ Rn` , f` (x` , β` ) = ρ` (B` x` + b` ) ∈ Rn`+1 . (15.3)
In the simplest case, the so-called “fully connected”, one has (B` , b` ) = β` , i.e. B` is a full matrix and its
entries (together with the bias b` ) are equal to the set of parameters β` . Also in the simplest cases ρ` is a
pointwise non-linearity ρ` (z) = (ρ̃` (zk ))k , where ρ̃` : R → R is non-linear. The most usual choices are the
rectified linear unit (ReLu) ρ̃` (s) = max(s, 0) and the sigmoid ρ̃` (s) = θ(s) = (1 + e−s )−1 .
The important point here is that the interleaving of non-linear map progressively increases the complexity
of the function f (·, β).
The parameter β = (B` , b` )` of such a deep network are then trained by minimizing the ERM func-
tional (12.11) using SGD-type stochastic optimization method. The P gradient can be computed efficiently
(with complexity proportional to the application of the model, i.e. O( ` n2` )) by automatic differentiation.
Since such models are purely feedforward, one can directly use the back-propagation formula (14.20).
For regression tasks, one can can directly use the output of the last layer (using e.g. a ReLu non-linearity)
in conjunction with a `2 squared loss L. For classification tasks, the output of the last layer needs to be
transformed into class probabilities by a multi-class logistic map (??).
An issue with such a fully connected setting is that the number of parameters is too large to be applicable
to large scale data such as images. Furthermore, it ignores any prior knowledge about the data, such
as for instance some invariance. This is addressed in more structured architectures, such as for instance
convolutional networks detailed in Section 15.2.3.
The resulting one-layer network f (x, β) = θ(hx, βi) (possibly including a bias term by adding one dummy
dimension to x) is trained using the loss, for binary classes y ∈ {0, 1}
244
Multi-class models with K classes are obtained by computing B0 x = (hx, βk i)K
k=1 , and a normalized
logistic map
u
f (x, β) = N ((exp(hx, βk i))k ) where N (u) = P
k uk
and assuming the classes are represented using vectors y on the probability simplex, one should use as loss
K
X
L(t, y) = − yk log(tk ).
k=1
In the literature, it has been proposed to replace linear sub-sampling by non-linear sub-sampling, for instance
the so-called max-pooling (that operate by taking the maximum among groups of m` successive values), but
it seems that linear sub-sampling is sufficient in practice when used in conjunction with very deep (large L)
architectures.
The intuition behind such model is that as one moves deeper through the layers, the neurons are receptive
to larger areas in the image domain (although, since the transform is non-linear, precisely giving sense to
this statement and defining a proper “receptive field” is non-trivial). Using an increasing number of channels
helps to define different classes of “detectors” (for the first layer, they detect simple patterns such as edges
and corner, and progressively capture more elaborated shapes).
In practice, the last few layers (2 or 3) of such a CNN architectures are chosen to be fully connected.
This is possible because, thanks to the sub-sampling, the dimension of these layers are small.
The parameters of such a model are the filters β = (ψ`,r,s )`,s,r , and they are trained by minimizing the
ERM functional (12.11). The gradient is typically computed by backpropagation. Indeed, when computing
the gradient with respect to some filter ψ`,r,s , the feedforward computational graph has the form (14.20).
For simplicity, we re-formulate this computation in the case of a single channel per layer (multiple layer can
245
be understood as replacing convolution by matrix-domain convolution). The forward pass computes all the
inner coefficients, by traversing the network from ` = 0 to ` = L − 1,
x`+1 = λ` (ψ` ? x` )
where λ` (u) = (λ̃` (ui ))i is applied component wise. Then, denoting E(β) = L(β, y) the loss to be minimized
with respect to the set of filters β = (ψ` )` , and denoting ∇` E(β) = ∂E(β)
∂ψ` the gradient with respect to ψ` ,
one computes all the gradients by traversing the network in reverse order, from ` = L − 1 to ` = 0
where λ0` (u) = (λ̃0` (ui ))i applies the derivative of λ̃` component wise, and where ψ̄` = ψ` (−·) is the reversed
filter. Here, is the pointwise multiplication of vectors. The recursion is initialized as ∇EL (β) = ∇L(xL , y),
the gradient of the loss itself.
This recursion (15.4) is the celebrated backpropagation algorithm put forward by Yann Lecun. Note
that to understand and code these iterations, one does not need to rely on the advanced machinery of
reverse mode automatic differentiation exposed in Section ??. The general automatic differentiation method
is however crucial to master because advanced deep-learning architectures are not purely feedforward, and
might include recursive connexions. Furthermore, automatic differentiation is useful outside deep learning,
and considerably eases prototyping for modern data-sciences with complicated non-linear models.
246
Bibliography
[1] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.
[2] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.
[3] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.
[4] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.
[5] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.
[6] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[7] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.
[8] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[9] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
[10] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
[11] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[12] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.
[13] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[14] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[15] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[16] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.
[17] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.
247