0% found this document useful (0 votes)
5 views10 pages

Deep Learning

This document discusses the mathematical foundations of deep learning, focusing on Multi-Layer Perceptrons (MLPs) and their training through stochastic optimization and automatic differentiation. It covers the structure of MLPs, their expressiveness, gradient computation, and universality, emphasizing the importance of non-linear activation functions for approximating complex functions. Additionally, it touches on deep discriminative models and the challenges associated with training MLPs due to their non-convex nature.

Uploaded by

Mohamed Assili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Deep Learning

This document discusses the mathematical foundations of deep learning, focusing on Multi-Layer Perceptrons (MLPs) and their training through stochastic optimization and automatic differentiation. It covers the structure of MLPs, their expressiveness, gradient computation, and universality, emphasizing the importance of non-linear activation functions for approximating complex functions. Additionally, it touches on deep discriminative models and the challenges associated with training MLPs due to their non-convex nature.

Uploaded by

Mohamed Assili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Mathematical Foundations of Data Sciences

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com

November 18, 2020


Chapter 15

Deep Learning

Before detailing deep architectures and their use, we start this chapter by presenting two essential com-
putational tools that are used to train these models: stochastic optimization methods and automatic differ-
entiation. In practice, they work hand-in-hand to be able to learn painlessly complicated non-linear models
on large-scale datasets.

15.1 Multi-Layers Perceptron


In this section, we study the simplest example of non-linear parametric models, namely Multi-Layers
Perceptron (MLP) with a single hidden layer (so they have in total 2 layers). Perceptron (with no hidden
layer) corresponds to the linear models studied in the previous sections. MLP with more layers are obtained
by stacking together several such simple MLP, and are studied in Section ??, since the computation of their
derivatives is very suited to automatic-differentiation methods.

15.1.1 MLP and its derivative


The basic MLP a 7→ hW,u (a) takes as input a feature vector a ∈ Rp , computes an intermediate hidden
representation b = W a ∈ Rq using q “neurons” stored as the rows wk ∈ Rp of the weight matrix W ∈ Rq×p ,
passes these through a non-linearity ρ : R → R, i.e. ρ(b) = (ρ(bk ))qk=1 and then outputs a scalar value as a
linear combination with output weights u ∈ Rq , i.e.
q
X q
X
hW,u (a) = hρ(W a), ui = uk ρ((W a)k ) = uk ρ(ha, wk i).
k=1 k=1

This function hW,u (·) is thus a weighted sum of q “ridge functions” ρ(h·, wk i). These functions are constant
in the direction orthogonal to the neuron wk and have a profile defined by ρ.
The most popular non-linearities are sigmoid functions such as

er 1 1
ρ(r) = and ρ(r) = atan(r) +
1 + er π 2

and the rectified linear unit (ReLu) function ρ(r) = max(r, 0).
One often add a bias term in these models, and consider functions of the form ρ(h·, wk i + zk ) but this
bias term can be integrated in the weight as usual by considering (ha, wk i + zk = h(a, 1), (wk , zk )i, so we
ignore it in the following section. This simply amount to replacing a ∈ Rp by (a, 1) ∈ Rp+1 and adding a
dimension p 7→ p + 1, as a pre-processing of the features.

239
Expressiveness. In order to define function of arbitrary complexity when q increases, it is important that
ρ is non-linear. Indeed, if ρ(s) = s, then hW,u (a) = hW a, ui = ha, W > ui. It is thus a linear function with
weights W > u, whatever the number q of neurons. Similarly, if ρ is a polynomial on R of degree d, then hW,u (·)
is itself a polynomial of degree d in Rp , which is a linear space V of finite dimension dim(V ) = O(pd ). So
even if q increases, the dimension dim(V ) stays fixed and hW,u (·) cannot approximate an arbitrary function
outside V . In sharp contrast, one can show that if ρ is not polynomial, then hW,u (·) can approximate any
continuous function, as studied in Section 15.1.3.

15.1.2 MLP and Gradient Computation


Given pairs of features and data values (ai , yi )ni=1 , and as usual storing the features in the rows of
A ∈ Rn×p , we consider the following least square regression function (similar computation can be done for
classification losses)
n
1X 1
(hW,u (ai ) − yi )2 = ||ρ(AW > )u − y||2 .
def.
min f (W, u) =
x=(W,u) 2 i=1 2

Note that here, the parameters being optimized are (W, u) ∈ Rq×p × Rq .

Optimizing with respect to u. This function f is convex with respect to u, since it is a quadratic
function. Its gradient with respect to u can be computed as in (13.8) and thus

∇u f (W, u) = ρ(AW > )> (ρ(AW > )u − y)

and one can compute in closed form the solution (assuming ker(ρ(AW > )) = {0}) as

u? = [ρ(AW > )> ρ(AW > )]−1 ρ(AW > )> y = [ρ(W A> )ρ(AW > )]−1 ρ(W A> )y

When W = Idp and ρ(s) = s one recovers the least square formula (13.9).

Optimizing with respect to W . The function f is non-convex with respect to W because the function
ρ is itself non-linear. Training a MLP is thus a delicate process, and one can only hope to obtain a local
minimum of f . It is also important to initialize correctly the neurons (wk )k (for instance as unit norm
random vector, but bias terms might need some adjustment), while u can be usually initialized at 0.
To compute its gradient with respect to W , we first note that for a perturbation ε ∈ Rq×p , one has

ρ(A(W + ε)> ) = ρ(AW > + Aε> ) = ρ(AW > ) + ρ0 (AW > ) (Aε> )

where we have denoted “ ” the entry-wise multiplication of matrices, i.e. U V = (Ui,j Vi,j )i,j . One thus
has,

1
||e + [ρ0 (AW > ) (Aε> )]y||2 where e = ρ(AW > )u − y ∈ Rn
def.
f (W + ε, u) =
2
= f (W, u) + he, [ρ0 (AW > ) (Aε> )]yi + o(||ε||)
= f (W, u) + hAε> , ρ0 (AW > ) (eu> )i
= f (W, u) + hε> , A> × [ρ0 (AW > ) (eu> )]i.

The gradient thus reads


∇W f (W, u) = [ρ0 (W A> ) (ue> )] × A ∈ Rq×p .

240
15.1.3 Universality
In this section, to ease the exposition, we explicitly introduce the bias and use the variable “x ∈ Rp ” in
place of “a ∈ Rp ”. We thus write the function computed by the MLP (including explicitly the bias zk ) as
q
X
def. def.
hW,z,u (x) = uk ϕwk ,zk (x) where ϕw,z (x) = ρ(hx, wi + z).
k=1

def.
The function ϕw,z (x) is a ridge function in the direction orthogonal to w̄ = w/||w|| and passing around the
z
point − ||w|| w̄.
In the following we assume that ρ : R → R is a bounded function such that
r→−∞ r→+∞
ρ(r) −→ 0 and ρ(r) −→ 1. (15.1)

Note in particular that such a function cannot be a polynomial and that the ReLu function does not satisfy
these hypothesis (universality for the ReLu is more involved to show). The goal is to show the following
theorem.
Theorem 25 (Cybenko, 1989). For any compact set Ω ⊂ Rp , the space spanned by the functions {ϕw,z }w,z
is dense in C(Ω) for the uniform convergence. This means that for any continuous function f and any ε > 0,
there exists q ∈ N and weights (wk , zk , uk )qk=1 such that
q
X
∀ x ∈ Ω, |f (x) − uk ϕwk ,zk (x)| 6 ε.
k=1

In a typical ML scenario, this implies that one can “overfit” the data, since using a q large enough ensures
that the training error can be made arbitrary small. Of course, there is a bias-variance tradeoff, and q needs
to be cross-validated to account for the finite number n of data, and ensure a good generalization properties.

Proof in dimension p = 1. In 1D, the approximation hW,z,u can be thought as an approximation using
smoothed step functions. Indeed, introducing a parameter ε > 0, one has (assuming the function is Lipschitz
to ensure uniform convergence),
ε→0
ϕ w , zk −→ 1[−z/w,+∞[
ε ε

This means that


ε→0
X
h W , z ,u −→ uk 1[−zk /wk ,+∞[ ,
ε ε
k

which is a piecewise constant function. Inversely, any piecewise constant function can be written this way.
Indeed, if h assumes the value dk on each interval [tk , tk+1 [, then it can be written as
X
h= dk (1[tk ,+∞[ − 1[tk ,+∞[ ).
k

Since the space of piecewise constant functions is dense in continuous function over an interval, this proves
the theorem.

Proof in arbitrary dimension p. We start by proving the following dual characterization of density,
using bounded Borel measure µ ∈ M(Ω) i.e. such that µ(Ω) < +∞.
Proposition 48. If ρ is such that for any Borel measure µ ∈ M(Ω)
 Z 
∀ (w, z), ρ(hx, wi + z)dµ(x) = 0 =⇒ µ = 0, (15.2)

then Theorem 25 holds.

241
Proof. We consider the linear space
( q )
def.
X
p
S = uk ϕwk ,zk ; q ∈ N, wk ∈ R , uk ∈ R, zk ∈ R ⊂ C(Ω).
k=1

Let S̄ be its closure in C(Ω) for || · ||∞ , which is a Banach space. If S̄ =


6 C(Ω), let us pick g 6= 0, g ∈ C(Ω)\S̄.
We define the linear form L on S̄ ⊕ span(g) as

∀ s ∈ S̄, ∀ λ ∈ R, L(s + λg) = λ

so that L = 0 on S̄. L is a bounded linear form, so that by Hahn-Banach theorem, it can be extended in a
bounded linear form L̄ : C(Ω) → R. Since L ∈ C(Ω)∗ (the dual space of continuous linear form), and that
R Borel measures, there exists µ ∈ RM(Ω), with µ 6= 0, such that for any
this dual space is identified with
continuous function h, L̄(h) = Ω h(x)dµ(x). But since L̄ = 0 on S̄, ρ(h·, wi + z)dµ = 0 for all (w, z) and
thus by hypothesis, µ = 0, which is a contradiction.
The theorem now follows from the following proposition.
Proposition 49. If ρ is continuous and satisfies (15.1), then it satisfies (15.2).
Proof. One has


hx, wi + u
  1 if Hw,u ,
ε→0 def.
ϕ wε , uε +t (x) = ρ +t −→ γ(x) = ρ(t) if x ∈ Pw,u ,
ε
0 if hw, xi + u < 0,

def. def.
where we defined Hw,u = {x ; hw, xi + u > 0} and Pw,u = {x ; hw, xi + u = 0}. By Lebesgue dominated
convergence (since the involved quantities are bounded uniformly on a compact set)
Z Z
ε→0
ϕ wε , uε +t dµ −→ γdµ = ϕ(t)µ(Pw,u ) + µ(Hw,u ).

Thus if µ is such that all these integrals vanish, then

∀ (w, u, t), ϕ(t)µ(Pw,u ) + µ(Hw,u ) = 0.

By selecting (t, t0 ) such that ϕ(t) 6= ϕ(t0 ), one has that

∀ (w, u), µ(Pw,u ) = µ(Hw,u ) = 0.

We now need to show that µ = 0. For a fixed w ∈ Rp , we consider the function


Z
∞ def.
h ∈ L (R), F (h) = h(hw, xi)dµ(x).

F : L∞ (R) → R is a bounded linear form since |F (µ)| 6 ||h||∞ µ(Ω) and µ(Ω) < +∞. One has
Z
F (1[−u,+∞[ = 1[−u,+∞[ (hw, xi)dµ(x) = µ(Pw,u ) + µ(Hw,u ) = 0.

By linearity, F (h) = 0 for all piecewise constant functions, and F is a continuous linear form, so that by
density F (h) = 0 for all functions h ∈ L∞ (R). Applying this for h(r) = eir one obtains
Z
def.
µ̂(w) = eihx, wi dµ(x) = 0.

This means that the Fourier transform of µ is zero, so that µ = 0.

242
Quantitative rates. Note that Theorem 25 is not constructive in the sense that it does not explain how to
compute the weights (wk , uk , zk )k to reach a desired accuracy. Since for a fixed q the function is non-convex,
this is not surprising. Some recent studies show that if q is large enough, a simple gradient descent is able
to reach an arbitrary good accuracy, but it might require a very large q.
Theorem 25 is also not quantitative since it does not tell how much neurons q is needed to reach a desired
accuracy. To obtain quantitative bounds, continuity is not enough, it requires to add smoothness constraints.
For instance, Barron proved that if Z
||ω|||fˆ(ω)|dω 6 Cf

where fˆ(ω) = f (x)e−ihx, ωi dx is the Fourier transform of f , then for q ∈ N there exists (wk , uk , zk )k
R

q
(2rCf )2
Z
1 X
|f (x) − uk ϕwk ,zk (x)|2 dx 6 .
Vol(B(0, r)) ||x||6r q
k=1

The surprising part of this Theorem is that the 1/q decay is independent of the dimension p. Note however
that the constant involved Cf might depend on p.

15.2 Deep Discriminative Models


15.2.1 Deep Network Structure
Deep learning are estimator f (x, β) which are built as composition of simple building blocks. In their
simplest form (non-recursive), they corresponds to a simple linear computational graph as already defined
in (14.20) (without the loss L), and we write this as

f (·, β) = fL−1 (·, β1 ) ◦ fL−2 (·, β2 ) ◦ . . . ◦ f0 (·, β0 )

where β = (β0 , . . . , βL−1 ) is the set of parameters, and

f` (·, β` ) : Rn` → Rn`+1

While it is possible to consider more complicated architecture (in particular recurrent ones), we restrict here
out attention to these simple linear graph computation structures (so-called feedforward networks).
The supervised learning of these parameters β is usually done by empirical risk minimization (12.11) using
SGD-type methods as explained in Section 14.2. Note that this results in highly non-convex optimization
problems. In particular, strong convergence guarantees such as Theorem 24 do not hold anymore, and
only weak convergence (toward stationary points) holds. SGD type technics are however found to work
surprisingly well in practice, and it now believe that the success of these deep-architecture approaches (in
particular the ability of these over-parameterized model to generalize well) are in large part due to the
dynamics of the SGD itself, which induce an implicit regularization effect.
For these simple linear architectures, the gradient of the ERM loss (14.13) can be computed using the re-
verse mode computation detailed in Section ??. In particular, in the context of deep learning, formula (15.4).
One should however keep in mind that for more complicated (e.g. recursive) architectures, such a simple
formula is not anymore available, and one should resort to reverse mode automatic differentiation (see Sec-
tion ??), which, while being conceptually simple, is actually implementing possibly highly non-trivial and
computationally optimal recursive differentiation.
In most successful applications of deep-learning, each computational block f` (·, β` ) is actually very simple,
and is the composition of
an affine map, B` · +b` with a matrix B` ∈ Rn` ×ñ` and a vector b` ∈ Rñ` parametrized (in most case
linearly) by β` ,
a fixed (not depending on β` ) non-linearity ρ` : Rñ` → Rn`+1

243
Figure 15.1: Left: example of fully connected network. Right: example of convolutional neural network.

which we write as
∀ x` ∈ Rn` , f` (x` , β` ) = ρ` (B` x` + b` ) ∈ Rn`+1 . (15.3)
In the simplest case, the so-called “fully connected”, one has (B` , b` ) = β` , i.e. B` is a full matrix and its
entries (together with the bias b` ) are equal to the set of parameters β` . Also in the simplest cases ρ` is a
pointwise non-linearity ρ` (z) = (ρ̃` (zk ))k , where ρ̃` : R → R is non-linear. The most usual choices are the
rectified linear unit (ReLu) ρ̃` (s) = max(s, 0) and the sigmoid ρ̃` (s) = θ(s) = (1 + e−s )−1 .
The important point here is that the interleaving of non-linear map progressively increases the complexity
of the function f (·, β).
The parameter β = (B` , b` )` of such a deep network are then trained by minimizing the ERM func-
tional (12.11) using SGD-type stochastic optimization method. The P gradient can be computed efficiently
(with complexity proportional to the application of the model, i.e. O( ` n2` )) by automatic differentiation.
Since such models are purely feedforward, one can directly use the back-propagation formula (14.20).
For regression tasks, one can can directly use the output of the last layer (using e.g. a ReLu non-linearity)
in conjunction with a `2 squared loss L. For classification tasks, the output of the last layer needs to be
transformed into class probabilities by a multi-class logistic map (??).
An issue with such a fully connected setting is that the number of parameters is too large to be applicable
to large scale data such as images. Furthermore, it ignores any prior knowledge about the data, such
as for instance some invariance. This is addressed in more structured architectures, such as for instance
convolutional networks detailed in Section 15.2.3.

15.2.2 Perceptron and Shallow Models


Before going on with the description of deep architectures, let us re-interpret the logistic classification
method detailed in Sections 12.4.2 and 12.4.3.
The two-class logistic classification model (12.21) is equal to a single layer (L = 1) network of the
form (15.3) (ignoring the constant bias term) where

B0 x = hx, βi and λ̃0 (u) = θ(u).

The resulting one-layer network f (x, β) = θ(hx, βi) (possibly including a bias term by adding one dummy
dimension to x) is trained using the loss, for binary classes y ∈ {0, 1}

L(t, y) = − log(ty (1 − t)1−y ) = −y log(t) − (1 − y) log(1 − t).

In this case, the ERM optimization is of course a convex program.

244
Multi-class models with K classes are obtained by computing B0 x = (hx, βk i)K
k=1 , and a normalized
logistic map
u
f (x, β) = N ((exp(hx, βk i))k ) where N (u) = P
k uk

and assuming the classes are represented using vectors y on the probability simplex, one should use as loss
K
X
L(t, y) = − yk log(tk ).
k=1

15.2.3 Convolutional Neural Networks


In order to be able to tackle data of large size, and also to improve the performances, it is important
to leverage some prior knowledge about the structure of the typical data to process. For instance, for
signal, images or videos, it is important to make use of the spacial location of the pixels and the translation
invariance (up to boundary handling issues) of the domain.
Convolutional neural networks are obtained by considering that the manipulated vectors x` ∈ Rn` at
depth ` in the network are of the form x` ∈ Rn̄` ×d` , where n̄` is the number of “spatial” positions (typically
along a 1-D, 2-D, or 3-D grid) and d` is the number of “channels”. For instance, for color images, one starts
with ñ` being the number of pixels, and d` = 3.
The linear operator B` : Rn̄` ×d` → Rn̄` ×d`+1 is then (up to boundary artefact) translation invariant and
hence a convolution along each channel (note that the number of channels can change between layers). It
r=1,...,d
is thus parameterized by a set of filters (ψ`,r,s )s=1,...,d``+1 . Denoting x` = (x`,s,· )ds=1
`
the different layers
composing x` , the linear map reads
d

∀ r ∈ {1, . . . , d`+1 }, (B` x` )r,· = ψ`,r,s ? x`,s,·
s=1

and the bias term b` ∈ R is contant (to maintain translation invariance).


The non-linear maps across layers serve two purposes: as before a pointwise non-linearity is applied, and
then a sub-sampling helps to reduce the computational complexity of the network. This is very similar to
the construction of the fast wavelet transform. Denoting by mk the amount of down-sampling, where usually
mk = 1 (no reduction) or mk = 2 (reduction by a factor two in each direction). One has
 
λ` (u) = λ̃` (us,mk · ) .
s=1...,d`+1

In the literature, it has been proposed to replace linear sub-sampling by non-linear sub-sampling, for instance
the so-called max-pooling (that operate by taking the maximum among groups of m` successive values), but
it seems that linear sub-sampling is sufficient in practice when used in conjunction with very deep (large L)
architectures.
The intuition behind such model is that as one moves deeper through the layers, the neurons are receptive
to larger areas in the image domain (although, since the transform is non-linear, precisely giving sense to
this statement and defining a proper “receptive field” is non-trivial). Using an increasing number of channels
helps to define different classes of “detectors” (for the first layer, they detect simple patterns such as edges
and corner, and progressively capture more elaborated shapes).
In practice, the last few layers (2 or 3) of such a CNN architectures are chosen to be fully connected.
This is possible because, thanks to the sub-sampling, the dimension of these layers are small.
The parameters of such a model are the filters β = (ψ`,r,s )`,s,r , and they are trained by minimizing the
ERM functional (12.11). The gradient is typically computed by backpropagation. Indeed, when computing
the gradient with respect to some filter ψ`,r,s , the feedforward computational graph has the form (14.20).
For simplicity, we re-formulate this computation in the case of a single channel per layer (multiple layer can

245
be understood as replacing convolution by matrix-domain convolution). The forward pass computes all the
inner coefficients, by traversing the network from ` = 0 to ` = L − 1,

x`+1 = λ` (ψ` ? x` )

where λ` (u) = (λ̃` (ui ))i is applied component wise. Then, denoting E(β) = L(β, y) the loss to be minimized
with respect to the set of filters β = (ψ` )` , and denoting ∇` E(β) = ∂E(β)
∂ψ` the gradient with respect to ψ` ,
one computes all the gradients by traversing the network in reverse order, from ` = L − 1 to ` = 0

∇` E(β) = [λ0` (ψ` ? x` )] [ψ̄` ? ∇`+1 E(β)], (15.4)

where λ0` (u) = (λ̃0` (ui ))i applies the derivative of λ̃` component wise, and where ψ̄` = ψ` (−·) is the reversed
filter. Here, is the pointwise multiplication of vectors. The recursion is initialized as ∇EL (β) = ∇L(xL , y),
the gradient of the loss itself.
This recursion (15.4) is the celebrated backpropagation algorithm put forward by Yann Lecun. Note
that to understand and code these iterations, one does not need to rely on the advanced machinery of
reverse mode automatic differentiation exposed in Section ??. The general automatic differentiation method
is however crucial to master because advanced deep-learning architectures are not purely feedforward, and
might include recursive connexions. Furthermore, automatic differentiation is useful outside deep learning,
and considerably eases prototyping for modern data-sciences with complicated non-linear models.

15.2.4 Scattering Transform


The scattering transform, introduced by Mallat and his collaborators, is a specific instance of deep
convolutional network, where the filters (ψ`,r,s )`,s,r are not trained, and are fixed to be wavelet filters. This
network can be understood as a non-linear extension of the wavelet transform. In practice, the fact that it
is fixed prevent it to be applied to arbitrary data (and is used mostly on signals and images) and it does not
lead to state of the art results for natural images. Nevertheless, it allows to derives some regularity properties
about the feature extraction map f (·, β) computed by the network in term of stability to diffeomorphisms.
It can also be used as a set of fixed initial features which can be further enhanced by a trained deep network,
as shown by Edouard Oyallon.

246
Bibliography

[1] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.
[2] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.
[3] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.
[4] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.
[5] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.
[6] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[7] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.
[8] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[9] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
[10] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.
[11] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[12] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.
[13] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[14] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[15] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[16] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.
[17] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.

247

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy