0% found this document useful (0 votes)

5 views10 pages

Deep Learning

This document discusses the mathematical foundations of deep learning, focusing on Multi-Layer Perceptrons (MLPs) and their training through stochastic optimization and automatic differentiation. It covers the structure of MLPs, their expressiveness, gradient computation, and universality, emphasizing the importance of non-linear activation functions for approximating complex functions. Additionally, it touches on deep discriminative models and the challenges associated with training MLPs due to their non-convex nature.

Uploaded by

Mohamed Assili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

Deep Learning

Uploaded by

Mohamed Assili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Mathematical Foundations of Data Sciences

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com

November 18, 2020

Chapter 15

Deep Learning

Before detailing deep architectures and their use, we start this chapter by presenting two essential com-
putational tools that are used to train these models: stochastic optimization methods and automatic differ-
entiation. In practice, they work hand-in-hand to be able to learn painlessly complicated non-linear models
on large-scale datasets.

15.1 Multi-Layers Perceptron

In this section, we study the simplest example of non-linear parametric models, namely Multi-Layers
Perceptron (MLP) with a single hidden layer (so they have in total 2 layers). Perceptron (with no hidden
layer) corresponds to the linear models studied in the previous sections. MLP with more layers are obtained
by stacking together several such simple MLP, and are studied in Section ??, since the computation of their
derivatives is very suited to automatic-differentiation methods.

15.1.1 MLP and its derivative

The basic MLP a 7→ hW,u (a) takes as input a feature vector a ∈ Rp , computes an intermediate hidden
representation b = W a ∈ Rq using q “neurons” stored as the rows wk ∈ Rp of the weight matrix W ∈ Rq×p ,
passes these through a non-linearity ρ : R → R, i.e. ρ(b) = (ρ(bk ))qk=1 and then outputs a scalar value as a
linear combination with output weights u ∈ Rq , i.e.
q
X q
X
hW,u (a) = hρ(W a), ui = uk ρ((W a)k ) = uk ρ(ha, wk i).
k=1 k=1

This function hW,u (·) is thus a weighted sum of q “ridge functions” ρ(h·, wk i). These functions are constant
in the direction orthogonal to the neuron wk and have a profile defined by ρ.
The most popular non-linearities are sigmoid functions such as

er 1 1
ρ(r) = and ρ(r) = atan(r) +
1 + er π 2

and the rectified linear unit (ReLu) function ρ(r) = max(r, 0).
One often add a bias term in these models, and consider functions of the form ρ(h·, wk i + zk ) but this
bias term can be integrated in the weight as usual by considering (ha, wk i + zk = h(a, 1), (wk , zk )i, so we
ignore it in the following section. This simply amount to replacing a ∈ Rp by (a, 1) ∈ Rp+1 and adding a
dimension p 7→ p + 1, as a pre-processing of the features.

239
Expressiveness. In order to define function of arbitrary complexity when q increases, it is important that
ρ is non-linear. Indeed, if ρ(s) = s, then hW,u (a) = hW a, ui = ha, W > ui. It is thus a linear function with
weights W > u, whatever the number q of neurons. Similarly, if ρ is a polynomial on R of degree d, then hW,u (·)
is itself a polynomial of degree d in Rp , which is a linear space V of finite dimension dim(V ) = O(pd ). So
even if q increases, the dimension dim(V ) stays fixed and hW,u (·) cannot approximate an arbitrary function
outside V . In sharp contrast, one can show that if ρ is not polynomial, then hW,u (·) can approximate any
continuous function, as studied in Section 15.1.3.

15.1.2 MLP and Gradient Computation

Given pairs of features and data values (ai , yi )ni=1 , and as usual storing the features in the rows of
A ∈ Rn×p , we consider the following least square regression function (similar computation can be done for
classification losses)
n
1X 1
(hW,u (ai ) − yi )2 = ||ρ(AW > )u − y||2 .
def.
min f (W, u) =
x=(W,u) 2 i=1 2

Note that here, the parameters being optimized are (W, u) ∈ Rq×p × Rq .

Optimizing with respect to u. This function f is convex with respect to u, since it is a quadratic
function. Its gradient with respect to u can be computed as in (13.8) and thus

∇u f (W, u) = ρ(AW > )> (ρ(AW > )u − y)

and one can compute in closed form the solution (assuming ker(ρ(AW > )) = {0}) as

u? = [ρ(AW > )> ρ(AW > )]−1 ρ(AW > )> y = [ρ(W A> )ρ(AW > )]−1 ρ(W A> )y

When W = Idp and ρ(s) = s one recovers the least square formula (13.9).

Optimizing with respect to W . The function f is non-convex with respect to W because the function
ρ is itself non-linear. Training a MLP is thus a delicate process, and one can only hope to obtain a local
minimum of f . It is also important to initialize correctly the neurons (wk )k (for instance as unit norm
random vector, but bias terms might need some adjustment), while u can be usually initialized at 0.
To compute its gradient with respect to W , we first note that for a perturbation ε ∈ Rq×p , one has

ρ(A(W + ε)> ) = ρ(AW > + Aε> ) = ρ(AW > ) + ρ0 (AW > ) (Aε> )

where we have denoted “ ” the entry-wise multiplication of matrices, i.e. U V = (Ui,j Vi,j )i,j . One thus
has,

1
||e + [ρ0 (AW > ) (Aε> )]y||2 where e = ρ(AW > )u − y ∈ Rn
def.
f (W + ε, u) =
2
= f (W, u) + he, [ρ0 (AW > ) (Aε> )]yi + o(||ε||)
= f (W, u) + hAε> , ρ0 (AW > ) (eu> )i
= f (W, u) + hε> , A> × [ρ0 (AW > ) (eu> )]i.

The gradient thus reads

∇W f (W, u) = [ρ0 (W A> ) (ue> )] × A ∈ Rq×p .

240
15.1.3 Universality
In this section, to ease the exposition, we explicitly introduce the bias and use the variable “x ∈ Rp ” in
place of “a ∈ Rp ”. We thus write the function computed by the MLP (including explicitly the bias zk ) as
q
X
def. def.
hW,z,u (x) = uk ϕwk ,zk (x) where ϕw,z (x) = ρ(hx, wi + z).
k=1

def.
The function ϕw,z (x) is a ridge function in the direction orthogonal to w̄ = w/||w|| and passing around the
z
point − ||w|| w̄.
In the following we assume that ρ : R → R is a bounded function such that
r→−∞ r→+∞
ρ(r) −→ 0 and ρ(r) −→ 1. (15.1)

Note in particular that such a function cannot be a polynomial and that the ReLu function does not satisfy
these hypothesis (universality for the ReLu is more involved to show). The goal is to show the following
theorem.
Theorem 25 (Cybenko, 1989). For any compact set Ω ⊂ Rp , the space spanned by the functions {ϕw,z }w,z
is dense in C(Ω) for the uniform convergence. This means that for any continuous function f and any ε > 0,
there exists q ∈ N and weights (wk , zk , uk )qk=1 such that
q
X
∀ x ∈ Ω, |f (x) − uk ϕwk ,zk (x)| 6 ε.
k=1

In a typical ML scenario, this implies that one can “overfit” the data, since using a q large enough ensures
that the training error can be made arbitrary small. Of course, there is a bias-variance tradeoff, and q needs
to be cross-validated to account for the finite number n of data, and ensure a good generalization properties.

Proof in dimension p = 1. In 1D, the approximation hW,z,u can be thought as an approximation using
smoothed step functions. Indeed, introducing a parameter ε > 0, one has (assuming the function is Lipschitz
to ensure uniform convergence),
ε→0
ϕ w , zk −→ 1[−z/w,+∞[
ε ε

This means that

ε→0
X
h W , z ,u −→ uk 1[−zk /wk ,+∞[ ,
ε ε
k

which is a piecewise constant function. Inversely, any piecewise constant function can be written this way.
Indeed, if h assumes the value dk on each interval [tk , tk+1 [, then it can be written as
X
h= dk (1[tk ,+∞[ − 1[tk ,+∞[ ).
k

Since the space of piecewise constant functions is dense in continuous function over an interval, this proves
the theorem.

Proof in arbitrary dimension p. We start by proving the following dual characterization of density,
using bounded Borel measure µ ∈ M(Ω) i.e. such that µ(Ω) < +∞.
Proposition 48. If ρ is such that for any Borel measure µ ∈ M(Ω)
Z
∀ (w, z), ρ(hx, wi + z)dµ(x) = 0 =⇒ µ = 0, (15.2)

then Theorem 25 holds.

241
Proof. We consider the linear space
( q )
def.
X
p
S = uk ϕwk ,zk ; q ∈ N, wk ∈ R , uk ∈ R, zk ∈ R ⊂ C(Ω).
k=1

Let S̄ be its closure in C(Ω) for || · ||∞ , which is a Banach space. If S̄ =

6 C(Ω), let us pick g 6= 0, g ∈ C(Ω)\S̄.
We define the linear form L on S̄ ⊕ span(g) as

∀ s ∈ S̄, ∀ λ ∈ R, L(s + λg) = λ

so that L = 0 on S̄. L is a bounded linear form, so that by Hahn-Banach theorem, it can be extended in a
bounded linear form L̄ : C(Ω) → R. Since L ∈ C(Ω)∗ (the dual space of continuous linear form), and that
R Borel measures, there exists µ ∈ RM(Ω), with µ 6= 0, such that for any
this dual space is identified with
continuous function h, L̄(h) = Ω h(x)dµ(x). But since L̄ = 0 on S̄, ρ(h·, wi + z)dµ = 0 for all (w, z) and
thus by hypothesis, µ = 0, which is a contradiction.
The theorem now follows from the following proposition.
Proposition 49. If ρ is continuous and satisfies (15.1), then it satisfies (15.2).
Proof. One has


hx, wi + u
 1 if Hw,u ,
ε→0 def.
ϕ wε , uε +t (x) = ρ +t −→ γ(x) = ρ(t) if x ∈ Pw,u ,
ε
0 if hw, xi + u < 0,


def. def.
where we defined Hw,u = {x ; hw, xi + u > 0} and Pw,u = {x ; hw, xi + u = 0}. By Lebesgue dominated
convergence (since the involved quantities are bounded uniformly on a compact set)
Z Z
ε→0
ϕ wε , uε +t dµ −→ γdµ = ϕ(t)µ(Pw,u ) + µ(Hw,u ).

Thus if µ is such that all these integrals vanish, then

∀ (w, u, t), ϕ(t)µ(Pw,u ) + µ(Hw,u ) = 0.

By selecting (t, t0 ) such that ϕ(t) 6= ϕ(t0 ), one has that

∀ (w, u), µ(Pw,u ) = µ(Hw,u ) = 0.

We now need to show that µ = 0. For a fixed w ∈ Rp , we consider the function

Z
∞ def.
h ∈ L (R), F (h) = h(hw, xi)dµ(x).
Ω

F : L∞ (R) → R is a bounded linear form since |F (µ)| 6 ||h||∞ µ(Ω) and µ(Ω) < +∞. One has
Z
F (1[−u,+∞[ = 1[−u,+∞[ (hw, xi)dµ(x) = µ(Pw,u ) + µ(Hw,u ) = 0.
Ω

By linearity, F (h) = 0 for all piecewise constant functions, and F is a continuous linear form, so that by
density F (h) = 0 for all functions h ∈ L∞ (R). Applying this for h(r) = eir one obtains
Z
def.
µ̂(w) = eihx, wi dµ(x) = 0.
Ω

This means that the Fourier transform of µ is zero, so that µ = 0.

242
Quantitative rates. Note that Theorem 25 is not constructive in the sense that it does not explain how to
compute the weights (wk , uk , zk )k to reach a desired accuracy. Since for a fixed q the function is non-convex,
this is not surprising. Some recent studies show that if q is large enough, a simple gradient descent is able
to reach an arbitrary good accuracy, but it might require a very large q.
Theorem 25 is also not quantitative since it does not tell how much neurons q is needed to reach a desired
accuracy. To obtain quantitative bounds, continuity is not enough, it requires to add smoothness constraints.
For instance, Barron proved that if Z
||ω|||fˆ(ω)|dω 6 Cf

where fˆ(ω) = f (x)e−ihx, ωi dx is the Fourier transform of f , then for q ∈ N there exists (wk , uk , zk )k
R

q
(2rCf )2
Z
1 X
|f (x) − uk ϕwk ,zk (x)|2 dx 6 .
Vol(B(0, r)) ||x||6r q
k=1

The surprising part of this Theorem is that the 1/q decay is independent of the dimension p. Note however
that the constant involved Cf might depend on p.

15.2 Deep Discriminative Models

15.2.1 Deep Network Structure
Deep learning are estimator f (x, β) which are built as composition of simple building blocks. In their
simplest form (non-recursive), they corresponds to a simple linear computational graph as already defined
in (14.20) (without the loss L), and we write this as

f (·, β) = fL−1 (·, β1 ) ◦ fL−2 (·, β2 ) ◦ . . . ◦ f0 (·, β0 )

where β = (β0 , . . . , βL−1 ) is the set of parameters, and

f` (·, β` ) : Rn` → Rn`+1

While it is possible to consider more complicated architecture (in particular recurrent ones), we restrict here
out attention to these simple linear graph computation structures (so-called feedforward networks).
The supervised learning of these parameters β is usually done by empirical risk minimization (12.11) using
SGD-type methods as explained in Section 14.2. Note that this results in highly non-convex optimization
problems. In particular, strong convergence guarantees such as Theorem 24 do not hold anymore, and
only weak convergence (toward stationary points) holds. SGD type technics are however found to work
surprisingly well in practice, and it now believe that the success of these deep-architecture approaches (in
particular the ability of these over-parameterized model to generalize well) are in large part due to the
dynamics of the SGD itself, which induce an implicit regularization effect.
For these simple linear architectures, the gradient of the ERM loss (14.13) can be computed using the re-
verse mode computation detailed in Section ??. In particular, in the context of deep learning, formula (15.4).
One should however keep in mind that for more complicated (e.g. recursive) architectures, such a simple
formula is not anymore available, and one should resort to reverse mode automatic differentiation (see Sec-
tion ??), which, while being conceptually simple, is actually implementing possibly highly non-trivial and
computationally optimal recursive differentiation.
In most successful applications of deep-learning, each computational block f` (·, β` ) is actually very simple,
and is the composition of
an affine map, B` · +b` with a matrix B` ∈ Rn` ×ñ` and a vector b` ∈ Rñ` parametrized (in most case
linearly) by β` ,
a fixed (not depending on β` ) non-linearity ρ` : Rñ` → Rn`+1

243
Figure 15.1: Left: example of fully connected network. Right: example of convolutional neural network.

which we write as
∀ x` ∈ Rn` , f` (x` , β` ) = ρ` (B` x` + b` ) ∈ Rn`+1 . (15.3)
In the simplest case, the so-called “fully connected”, one has (B` , b` ) = β` , i.e. B` is a full matrix and its
entries (together with the bias b` ) are equal to the set of parameters β` . Also in the simplest cases ρ` is a
pointwise non-linearity ρ` (z) = (ρ̃` (zk ))k , where ρ̃` : R → R is non-linear. The most usual choices are the
rectified linear unit (ReLu) ρ̃` (s) = max(s, 0) and the sigmoid ρ̃` (s) = θ(s) = (1 + e−s )−1 .
The important point here is that the interleaving of non-linear map progressively increases the complexity
of the function f (·, β).
The parameter β = (B` , b` )` of such a deep network are then trained by minimizing the ERM func-
tional (12.11) using SGD-type stochastic optimization method. The P gradient can be computed efficiently
(with complexity proportional to the application of the model, i.e. O( ` n2` )) by automatic differentiation.
Since such models are purely feedforward, one can directly use the back-propagation formula (14.20).
For regression tasks, one can can directly use the output of the last layer (using e.g. a ReLu non-linearity)
in conjunction with a `2 squared loss L. For classification tasks, the output of the last layer needs to be
transformed into class probabilities by a multi-class logistic map (??).
An issue with such a fully connected setting is that the number of parameters is too large to be applicable
to large scale data such as images. Furthermore, it ignores any prior knowledge about the data, such
as for instance some invariance. This is addressed in more structured architectures, such as for instance
convolutional networks detailed in Section 15.2.3.

15.2.2 Perceptron and Shallow Models

Before going on with the description of deep architectures, let us re-interpret the logistic classification
method detailed in Sections 12.4.2 and 12.4.3.
The two-class logistic classification model (12.21) is equal to a single layer (L = 1) network of the
form (15.3) (ignoring the constant bias term) where

B0 x = hx, βi and λ̃0 (u) = θ(u).

The resulting one-layer network f (x, β) = θ(hx, βi) (possibly including a bias term by adding one dummy
dimension to x) is trained using the loss, for binary classes y ∈ {0, 1}

L(t, y) = − log(ty (1 − t)1−y ) = −y log(t) − (1 − y) log(1 − t).

In this case, the ERM optimization is of course a convex program.

244
Multi-class models with K classes are obtained by computing B0 x = (hx, βk i)K
k=1 , and a normalized
logistic map
u
f (x, β) = N ((exp(hx, βk i))k ) where N (u) = P
k uk

and assuming the classes are represented using vectors y on the probability simplex, one should use as loss
K
X
L(t, y) = − yk log(tk ).
k=1

15.2.3 Convolutional Neural Networks

In order to be able to tackle data of large size, and also to improve the performances, it is important
to leverage some prior knowledge about the structure of the typical data to process. For instance, for
signal, images or videos, it is important to make use of the spacial location of the pixels and the translation
invariance (up to boundary handling issues) of the domain.
Convolutional neural networks are obtained by considering that the manipulated vectors x` ∈ Rn` at
depth ` in the network are of the form x` ∈ Rn̄` ×d` , where n̄` is the number of “spatial” positions (typically
along a 1-D, 2-D, or 3-D grid) and d` is the number of “channels”. For instance, for color images, one starts
with ñ` being the number of pixels, and d` = 3.
The linear operator B` : Rn̄` ×d` → Rn̄` ×d`+1 is then (up to boundary artefact) translation invariant and
hence a convolution along each channel (note that the number of channels can change between layers). It
r=1,...,d
is thus parameterized by a set of filters (ψ`,r,s )s=1,...,d``+1 . Denoting x` = (x`,s,· )ds=1
`
the different layers
composing x` , the linear map reads
d
X̀
∀ r ∈ {1, . . . , d`+1 }, (B` x` )r,· = ψ`,r,s ? x`,s,·
s=1

and the bias term b` ∈ R is contant (to maintain translation invariance).

The non-linear maps across layers serve two purposes: as before a pointwise non-linearity is applied, and
then a sub-sampling helps to reduce the computational complexity of the network. This is very similar to
the construction of the fast wavelet transform. Denoting by mk the amount of down-sampling, where usually
mk = 1 (no reduction) or mk = 2 (reduction by a factor two in each direction). One has

λ` (u) = λ̃` (us,mk · ) .
s=1...,d`+1

In the literature, it has been proposed to replace linear sub-sampling by non-linear sub-sampling, for instance
the so-called max-pooling (that operate by taking the maximum among groups of m` successive values), but
it seems that linear sub-sampling is sufficient in practice when used in conjunction with very deep (large L)
architectures.
The intuition behind such model is that as one moves deeper through the layers, the neurons are receptive
to larger areas in the image domain (although, since the transform is non-linear, precisely giving sense to
this statement and defining a proper “receptive field” is non-trivial). Using an increasing number of channels
helps to define different classes of “detectors” (for the first layer, they detect simple patterns such as edges
and corner, and progressively capture more elaborated shapes).
In practice, the last few layers (2 or 3) of such a CNN architectures are chosen to be fully connected.
This is possible because, thanks to the sub-sampling, the dimension of these layers are small.
The parameters of such a model are the filters β = (ψ`,r,s )`,s,r , and they are trained by minimizing the
ERM functional (12.11). The gradient is typically computed by backpropagation. Indeed, when computing
the gradient with respect to some filter ψ`,r,s , the feedforward computational graph has the form (14.20).
For simplicity, we re-formulate this computation in the case of a single channel per layer (multiple layer can

245
be understood as replacing convolution by matrix-domain convolution). The forward pass computes all the
inner coefficients, by traversing the network from ` = 0 to ` = L − 1,

x`+1 = λ` (ψ` ? x` )

where λ` (u) = (λ̃` (ui ))i is applied component wise. Then, denoting E(β) = L(β, y) the loss to be minimized
with respect to the set of filters β = (ψ` )` , and denoting ∇` E(β) = ∂E(β)
∂ψ` the gradient with respect to ψ` ,
one computes all the gradients by traversing the network in reverse order, from ` = L − 1 to ` = 0

∇` E(β) = [λ0` (ψ` ? x` )] [ψ̄` ? ∇`+1 E(β)], (15.4)

where λ0` (u) = (λ̃0` (ui ))i applies the derivative of λ̃` component wise, and where ψ̄` = ψ` (−·) is the reversed
filter. Here, is the pointwise multiplication of vectors. The recursion is initialized as ∇EL (β) = ∇L(xL , y),
the gradient of the loss itself.
This recursion (15.4) is the celebrated backpropagation algorithm put forward by Yann Lecun. Note
that to understand and code these iterations, one does not need to rely on the advanced machinery of
reverse mode automatic differentiation exposed in Section ??. The general automatic differentiation method
is however crucial to master because advanced deep-learning architectures are not purely feedforward, and
might include recursive connexions. Furthermore, automatic differentiation is useful outside deep learning,
and considerably eases prototyping for modern data-sciences with complicated non-linear models.

15.2.4 Scattering Transform

The scattering transform, introduced by Mallat and his collaborators, is a specific instance of deep
convolutional network, where the filters (ψ`,r,s )`,s,r are not trained, and are fixed to be wavelet filters. This
network can be understood as a non-linear extension of the wavelet transform. In practice, the fact that it
is fixed prevent it to be applied to arbitrary data (and is used mostly on signals and images) and it does not
lead to state of the art results for natural images. Nevertheless, it allows to derives some regularity properties
about the feature extraction map f (·, β) computed by the network in term of stability to diffeomorphisms.
It can also be used as a set of fixed initial features which can be further enhanced by a trained deep network,
as shown by Edouard Oyallon.

246
Bibliography

[1] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.
[2] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.
[3] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.
[4] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.
[5] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.
[6] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[7] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.
[8] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[9] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
[10] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
[11] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[12] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.
[13] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[14] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[15] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[16] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.
[17] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.

247

Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
No ratings yet
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
713 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Digi EX50 User Guide 90002435
No ratings yet
Digi EX50 User Guide 90002435
1,189 pages
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
No ratings yet
Classification With Deep Neural Networks and Logistic Loss: Zihan Zhang
117 pages
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
No ratings yet
A Diagram Free Approach To The Stochastic Estimates in Regularity Structures
97 pages
Sample MA Due Diligence Issues Report
No ratings yet
Sample MA Due Diligence Issues Report
10 pages
Written Arguments Consumer
No ratings yet
Written Arguments Consumer
3 pages
Huawei SUN2000 30KTL-A - 33KTL - 40KTL User Manual (Issue04 - 2016!06!20)
No ratings yet
Huawei SUN2000 30KTL-A - 33KTL - 40KTL User Manual (Issue04 - 2016!06!20)
108 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Index
No ratings yet
Index
127 pages
2013 Approximations For Modulus of Gradients and Their Applications To Neighborhood Filters
No ratings yet
2013 Approximations For Modulus of Gradients and Their Applications To Neighborhood Filters
22 pages
Applying Statistical Learning Theory To Deep Learning
No ratings yet
Applying Statistical Learning Theory To Deep Learning
51 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
MA4K0 Notes
No ratings yet
MA4K0 Notes
189 pages
Double/Debiased Machine Learning For Treatment and Structural Parameters
No ratings yet
Double/Debiased Machine Learning For Treatment and Structural Parameters
71 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Differentially Private Inference Via Noisy Optimization Supplementary Material
No ratings yet
Differentially Private Inference Via Noisy Optimization Supplementary Material
41 pages
Revision1OfTR21 102
No ratings yet
Revision1OfTR21 102
21 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
Hephaestus 7100 - Quick Reference Guide
No ratings yet
Hephaestus 7100 - Quick Reference Guide
4 pages
Selected Theoretical Aspects of ML and Deep Learning
No ratings yet
Selected Theoretical Aspects of ML and Deep Learning
46 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Module 8 Artificial Intelligence in Monitoring and Evaluation
No ratings yet
Module 8 Artificial Intelligence in Monitoring and Evaluation
23 pages
PDF (SG) - EAP11 - 12 - Unit 12 - Lesson 1 - Organizing Data From Surveys
No ratings yet
PDF (SG) - EAP11 - 12 - Unit 12 - Lesson 1 - Organizing Data From Surveys
18 pages
EBLQ-CV3, CW1 EDLQ-CV3, CW1 4PEN522034-1 2018 01 Installer Reference Guide English
No ratings yet
EBLQ-CV3, CW1 EDLQ-CV3, CW1 4PEN522034-1 2018 01 Installer Reference Guide English
108 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Theory of Quadrature PDF
No ratings yet
Theory of Quadrature PDF
280 pages
Installation: Order No.: Customer: Equipment: Converter Type: Document: 3BHS213774E01 ACS 1000 W
No ratings yet
Installation: Order No.: Customer: Equipment: Converter Type: Document: 3BHS213774E01 ACS 1000 W
73 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
2205.14398 Deep Neural Networks Overcome The Curse
No ratings yet
2205.14398 Deep Neural Networks Overcome The Curse
34 pages
Generative Artificial Intelligence in Ophthalmology
No ratings yet
Generative Artificial Intelligence in Ophthalmology
11 pages
480 Note Lin
No ratings yet
480 Note Lin
11 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
1-ICT Topic 3
100% (1)
1-ICT Topic 3
6 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
UDL - Errata Data
No ratings yet
UDL - Errata Data
19 pages
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
No ratings yet
WIREs Data Min Knowl - 2023 - Shaik - Remote Patient Monitoring Using Artificial Intelligence Current State
31 pages
Lec 03
No ratings yet
Lec 03
42 pages
Internship Report Smriti
No ratings yet
Internship Report Smriti
20 pages
1190 543 PB
No ratings yet
1190 543 PB
17 pages
The Universal Approximation Power of Finite-Width Deep Relu Networks
No ratings yet
The Universal Approximation Power of Finite-Width Deep Relu Networks
16 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Gauss Quad LP
No ratings yet
Gauss Quad LP
19 pages
BCI Patient Monitoring Catalogue Goodwin
No ratings yet
BCI Patient Monitoring Catalogue Goodwin
36 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
Foundations of Machine Learning: Regression
No ratings yet
Foundations of Machine Learning: Regression
52 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
UDL Errata
No ratings yet
UDL Errata
8 pages
Lecture 1 2 Background
No ratings yet
Lecture 1 2 Background
6 pages
The Business of Intellectual Property A Literature Review of IP Management Research
No ratings yet
The Business of Intellectual Property A Literature Review of IP Management Research
20 pages
May Mobile
No ratings yet
May Mobile
3 pages
Introduction To Cloud Computing Practical
No ratings yet
Introduction To Cloud Computing Practical
17 pages
Montanari
No ratings yet
Montanari
10 pages
Sheetal Cyriac Virtual Impedance Based Stabilization
No ratings yet
Sheetal Cyriac Virtual Impedance Based Stabilization
6 pages
Kurkova Kolmogorov's Theorem and Multilayer Neural Networks
No ratings yet
Kurkova Kolmogorov's Theorem and Multilayer Neural Networks
6 pages
UDL Errata
No ratings yet
UDL Errata
13 pages
QuickGuide 2018
No ratings yet
QuickGuide 2018
7 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
74 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
No ratings yet
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
22 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
HW 3
No ratings yet
HW 3
7 pages
HW3 2
No ratings yet
HW3 2
4 pages
Amath/Math 516 Second Homework Set Linear Least Squares
No ratings yet
Amath/Math 516 Second Homework Set Linear Least Squares
6 pages
A List of All My Torrents
No ratings yet
A List of All My Torrents
3 pages
BCN Campus Recruitment Process - FAQ
No ratings yet
BCN Campus Recruitment Process - FAQ
1 page
BV - Embedded Software Engineer - Le Dinh Hoang
No ratings yet
BV - Embedded Software Engineer - Le Dinh Hoang
1 page
Aimcat 1803 Exp Review
No ratings yet
Aimcat 1803 Exp Review
2 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Learn More E - Commerce Product Photography Tips
No ratings yet
Learn More E - Commerce Product Photography Tips
9 pages
LXV50 2stroke Workshop Manual PDF
No ratings yet
LXV50 2stroke Workshop Manual PDF
162 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
No ratings yet
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
9 pages
Breadth First Search and Depth First Search Algorithms
No ratings yet
Breadth First Search and Depth First Search Algorithms
2 pages
Mathematics of Deep Learning: Lecture 2 - Depth Separation
No ratings yet
Mathematics of Deep Learning: Lecture 2 - Depth Separation
13 pages
Learning Multidimensional Fourier Series With Tensor Trains
No ratings yet
Learning Multidimensional Fourier Series With Tensor Trains
6 pages
Klqgceb Ewvhja SC
No ratings yet
Klqgceb Ewvhja SC
8 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
No ratings yet
CD Expt 3 Implementation of A Lexical Analyzer Using Lex Tool
6 pages
Pms Deck Nasyda Linso
100% (1)
Pms Deck Nasyda Linso
21 pages
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
No ratings yet
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
9 pages
IJAMSS - On Convergence Properties of Szasz Type Positive Linear Operator
No ratings yet
IJAMSS - On Convergence Properties of Szasz Type Positive Linear Operator
8 pages
ZOTUP ZU MV - Overvoltage Surge Arrester For Medium Voltage Solutions
No ratings yet
ZOTUP ZU MV - Overvoltage Surge Arrester For Medium Voltage Solutions
3 pages
The Muncaster Steam-Engine Models: 5-Vertical Stationary Engines
No ratings yet
The Muncaster Steam-Engine Models: 5-Vertical Stationary Engines
3 pages
Aproximação Por Redes Neurais - Prova para LP e Funções Contínuas Quando Se Usa Trigonometrica
No ratings yet
Aproximação Por Redes Neurais - Prova para LP e Funções Contínuas Quando Se Usa Trigonometrica
10 pages
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
No ratings yet
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
6 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Deep Learning

Uploaded by

Deep Learning

Uploaded by

Mathematical Foundations of Data Sciences

November 18, 2020

15.1 Multi-Layers Perceptron

15.1.1 MLP and its derivative

15.1.2 MLP and Gradient Computation

∇u f (W, u) = ρ(AW > )> (ρ(AW > )u − y)

The gradient thus reads

This means that

then Theorem 25 holds.

Let S̄ be its closure in C(Ω) for || · ||∞ , which is a Banach space. If S̄ =

∀ s ∈ S̄, ∀ λ ∈ R, L(s + λg) = λ

Thus if µ is such that all these integrals vanish, then

∀ (w, u, t), ϕ(t)µ(Pw,u ) + µ(Hw,u ) = 0.

By selecting (t, t0 ) such that ϕ(t) 6= ϕ(t0 ), one has that

∀ (w, u), µ(Pw,u ) = µ(Hw,u ) = 0.

We now need to show that µ = 0. For a fixed w ∈ Rp , we consider the function

This means that the Fourier transform of µ is zero, so that µ = 0.

15.2 Deep Discriminative Models

f (·, β) = fL−1 (·, β1 ) ◦ fL−2 (·, β2 ) ◦ . . . ◦ f0 (·, β0 )

where β = (β0 , . . . , βL−1 ) is the set of parameters, and

f` (·, β` ) : Rn` → Rn`+1

15.2.2 Perceptron and Shallow Models

B0 x = hx, βi and λ̃0 (u) = θ(u).

L(t, y) = − log(ty (1 − t)1−y ) = −y log(t) − (1 − y) log(1 − t).

In this case, the ERM optimization is of course a convex program.

15.2.3 Convolutional Neural Networks

and the bias term b` ∈ R is contant (to maintain translation invariance).

∇` E(β) = [λ0` (ψ` ? x` )] [ψ̄` ? ∇`+1 E(β)], (15.4)

15.2.4 Scattering Transform

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.