0% found this document useful (0 votes)

11 views16 pages

Fundations Data Science

The document discusses the mathematical foundations of data sciences, focusing on shallow learning and the Multi-Layer Perceptron (MLP) model. It covers the structure of MLPs, the significance of non-linear activation functions, and the universal approximation properties of neural networks. Additionally, it introduces Barron's theorem, which relates the approximation capabilities of neural networks to functions in Barron's space, emphasizing the importance of regularity and dimensionality in approximation errors.

Uploaded by

Khajan Mahtolia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views16 pages

Fundations Data Science

Uploaded by

Khajan Mahtolia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Mathematical Foundations of Data Sciences

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com

December 11, 2024

2
Chapter 1

Shallow Learning

In this chapter, we study the simplest example of non-linear parametric models, namely Multi-Layers
Perceptron (MLP) with a single hidden layer (so they have in total 2 layers). Perceptron (with no hidden
layer) corresponds to the linear models studied in the previous chapter. MLP with more layers are obtained
by stacking together several such simple MLP, and are studied in Section ??, since the computation of their
derivatives is very suited to automatic-differentiation methods.

1.1 Multi-layer Perceptron

1.1.1 Multi-layer
′
Let us first consider the general case of an arbitrary number of layers to defined mapping fθ : Rd → Rd .
From x0 := x ∈ Rds = Rd , they iterate along the depth indexes s = 0, . . . , S − 1

xs+1 + σ(Ws xs + bs )

where Ws ∈ Rds+1 ×ds and the bias is bs ∈ Rds+1 . Here σ is a non-linear (and in fact non-polynomiial)
function applied component wise, i.e. we denote σ(z) = (σ(zi ))i .
The most popular non-linearities are sigmoid functions such as

er 1 1
ρ(r) = and ρ(r) = atan(r) +
1 + er π 2

and the rectified linear unit (ReLu) function ρ(r) = max(r, 0). There is an important difference both in
practice and in theory on these two class of activation (bounded vs. un-bounded). ReLu works better in
practice because there is less saturaration effect, so that gradient are not zero if the values commputed by
the networks are large. Also the ReLu is positively 1-homogeneous, which allows to rescale the weights and
for some proof, consider that these weight are on a unit sphere. A difficulty however is that the ReLu is
not differentiable at 0, which makes some rigorous proof difficult to do (but in practice, this non smoothness
seems harmless).
In order to define function of arbitrary complexity when width (number of neuron per layer) increases,
it is important that σ is non-polynomial. Otherwise, fθ would be a polynimial of degree proportional to s,
so these functions would for instance not be dense in continuous functions. Note however that the linear
case σ = Id is of independent to compute matrix factorization, but this does not corresponds to supervised
learning problem (but rather dimensionality reduction using PCA or non-negative matrix factorization).

3
1.1.2 2-layers MLPs
We consider two-layer neural networks of the form:
n
X
fθ (x) := uk σ(⟨vk , x⟩ + bk ), ∀x ∈ Rd , (1.1)
k=1

′
where σ is the activation function. The parameters of the network are denoted as θk = (uk ∈ Rd , vk ∈
Rd , bk ∈ R) for k = 1, . . . , n. In most of the following, for the sake of simplicity, we consider d′ = 1, i.e.
real-valued output.
In practice, neural network are designed by doing gradient descent, i.e. we consider a loss (for the sake
of simplicity here quadratic Z
min E(θ) := ∥fθ (x) − y∥dρ(x, y) (1.2)
θ

and use the gradient (it is of course possible to use SGD)

θt+1 = θt − τt ∇E(θ).

Gradient computation Ignoring the bias bk for simplicity, we can write in matrix form fθ (x) = U σ(V ⊤ x)
′
where U ∈ Rd ×n , V ∈ Rd×n . For the sake of simplicity, we assume there is a finite number N of data points
d′ ×N
X = (xi )N
i=1 ∈ R
d×N
and Y = (yi )Ni=1 ∈ R Training with a ℓ2 loss thus reads

1
min E(U, V ) := ∥U σ(V ⊤ X) − Y ∥2 .
U,V 2

If we denote Z := σ(V ⊤ X) (which can be thought as applying the feature map x → σ(V ⊤ x) to the data),
then training U is a classical least square 21 ∥U Z − Y ∥2 and the gradient reads

∇U E(U, V ) = (U Z − Y )Z ⊤ .

We perform a Taylor expansion to compute the gradient with respect to V , denoting R := U σ(V ⊤ X) − Y
and S := σ ′ (V ⊤ X)

1
E(U, V + εD) = ∥R + εU [S ⊙ (D⊤ X)]∥2 = E(U, V ) + ⟨R, H⟩ + O(ε2 )
2
where we denoted A ⊙ B = (Ai,j Bi,j ) and where

⟨R, H⟩ = ⟨R, U [S ⊙ (D⊤ X)]⟩ = ⟨D⊤ X, (U ⊤ R) ⊙ S⟩ = ⟨X ⊤ D, (R⊤ U ) ⊙ S ⊤ ⟩

which leads to
∇V E(U, V ) = X[(R⊤ U ) ⊙ S ⊤ ].
This computation is quite painful, and the advice is not to use this derivation for deeper network, because not
only they are overly complicated, but they are vastly sub-optimal. The correct way to compute this gradient
is to use the back-propagation method, which corresponds to revserse mode automatic differentiation.

1.2 L∞ non-quantitative universality

If σ is a sigmoid function, George Cybenko’s theorem, later refined by Kurt Hornik, Maxwell Stinchcombe,
and Halbert White, demonstrates that the functions fθ can approximate any continuous function uniformly
on a compact domain. So this means here we insist on doing an L∞ approximation, which is strictly stronger
(and more difficult) that doing an L2 error control as consider during training (1.2).

4
Proposition 1. If σ is an increasing (not necessarily continuous) function satisfying:
lim σ(s) = 0 and lim σ(s) = 1,
s→−∞ s→+∞

and K ⊂ Rd is compact, then for any continuous function f on K and any ε > 0, there exist n and parameters
(θk )nk=0 such that:
sup |f (x) − fθ (x)| ⩽ ε.
x∈K

This theorem establishes the universal approximation property of two-layer neural networks. However,
it does not provide bounds on the number of neurons n required as a function of ε. Furthermore, the proof
does not constructively specify how to determine the parameters of the approximating network fθ . The
first proof was done by Cybenko [15] using a duality argument. We detail next the proof due to Hornik
et al. which is a bit more constructive, and rely on Stone-Weierstrass theorem to perform a Fourier-type
approximation. On contrary to a direct Fourier series expansion, this leads to a uniform approximation of a
continuous function, whereas Fourier series do not lead to a uniform approximation.
Proof. It first considers the activation σ = cos (note that the initial density argument would also work with
σ = exp which interestingly is a non-bounded activation). Consider the function space:
( n )
X
A := uk cos(⟨vk , x⟩ + bk ) : n ∈ N, (uk , bk , vk )k .
k=1

This space is an algebra of continuous functions on the compact set K. It contains the constant functions
and separates points; that is, for x ̸= x′ , there exists w such that cos(⟨w, x⟩) ̸= cos(⟨w, x′ ⟩). By the Stone-
Weierstrass theorem, A is dense in the space of continuous functions on K.
Let r = maxk (|vk | · Radius(K) + |bk |). To approximate functions on K, it thus suffices by the previous
density to approximate cos(s) on the interval [−r, r]. Splitting the interval into subintervals where cos(s) is
monotonic, this can be replaced by just approximating the rectified cosine squashing function :

0,
 s ⩽ 0,
cos+ (s) = 1, s ⩾ π/2,

1 − cos(s), s ∈ [0, π/2].


The goal is to construct σ-based functions of the form:

X
uk σ(vk s + bk ) − cos+ (s) ⩽ ε,
k

where uk , bk , vk ∈ R.
Divide [0, π/2] into Q subintervals [sk , sk+1 ], where sk = cos−1
+ (k/Q). Choose M > 0 large enough such
that:
ε ε
σ(−M ) < , σ(M ) > 1 − .
2Q 2Q
Define vk and bk such that the affine map vk s + bk sends [sk , sk+1 ] to [−M, M ]. Set the weights uk = 1/Q.
For each subinterval, the construction ensures that:
ε
|σ(vk s + bk ) − cos+ (s)| ⩽ .
Q
Summing over all subintervals gives the desired approximation:
X
a0 + uk σ(vk s + bk ) − cos+ (s) ⩽ ε,
k

provided Q > 2/ε. Combining the results, the network fθ can approximate any continuous function f on K
to within ε, completing the proof.

5
1.3 L2 Quantitative Approximation (Barron’s theorem)
In constrast to the uniform error control of the previous section, we consider here L2 approximation as
consider in the initial loss (1.2). We only focuss on approximation error, so that we consider that the data
satisfy exactly y = f (x) for some function f to approximate and x is distributed according to some ρ(x).
Another limitation of the theory we detail next is that we assume ρ is compactly supported on a ball of
radius R. Without any hypothesis beside convexity, it is not possible to show any rate (i.e. approximation
by a network can be arbitrary slow). The functional space to obtain fast rate (independent of the dimension)
is called the Barron’s space, and was introduced by Andrew Barron.

1.3.1 Barron’s space

For an integrable function f , its Fourier transform is defined, for any ξ ∈ Rd by
Z
ˆ
f (ξ) ≜ f (ξ)ei⟨ξ, x⟩ dx.
Rd

The Barron’s space [3] is the set of functions such as the semi-norm
Z
||f ||B ≜ ||ξ|||fˆ(ξ)|dξ
Rd

is finite. If we impose that f (0) is fixed, we can show that this defines a norm and that the Barron space is
a Banach space. One has Z
||f ||B = ||∇f
c (ξ)||dξ,
Rd

this shows that the functions of the Barron space are quite regular. Here are some example of function
classes with the corresponding Barron’s norm.
2 √
• Gaussians: for f (x) = e−||x|| /2 , one has ||f ||B ⩽ 2 d

• Ridge function: let f (x) = ψ(⟨x, b⟩ + c) where ψ : R → R then one has

Z
||f ||B ⩽ ||b|| |uψ̂(u)|du.
R

In particular, if ψ is C 2+δ for δ > 0 then f is in the Barron space. If ρ satisfies this hypothesis, the
“neurons” functions are in Barron space.

• Regular functions with s derivatives: for all s > d/2, one has ||f ||B ⩽ C(d, s) ||f ||H s where the Sobolev
norm is
Z d
X
||f ||2H s ≜ |fˆ(ξ)|2 (1 + ||ξ||2s )dξ ∼ ||f ||2L2 (dx) + ||∂xk f ||2L2 (dx) ,
Rd k=1

and C(d, s) < ∞ is a constant. This shows that if f has at least d/2 derivatives in L2 , it is in Barron
space. Beware that the converse is false, the Barron space can contain less regular functions as seen
in the previous examples. This somehow shows that the Barron space is larger than RKHS space of
fixed smoothness degree.

1.3.2 Barron’s Theorem

The main result is as follows.

6
Theorem 1 (Barron [3]). We assume ρ is supported on B(0, R). For all n, there exists fθ with n neurons
such that
2R||f ||B
||f (0) + fθ − f ||L2 (ρ) ⩽ √ .
n
P
Furthermore, one can impose that k |uk | ⩽ 2R||f ||B
This result shows that if f is in Barron space, the decrease of the error does not depend on the dimension:
√ the constant ||f ||B
this is often referred to as “overcoming the curse of dimensionality”. Be careful however,
can depend on the dimension, this is the case for Gaussian functions (where it is 2 d) but not for ridges
functions.

1.3.3 Mean field representation.

The proof of Barron’s theorem involves rescaling the coefficients uk by 1/n and rewriting the neural
network in Equation (1.1) as:
n
1X
fθ (x) := φ(x, ωk ),
n
k=1
d′
where θ = (ωk )nk=1 , ωk = (uk , vk , bk ) ∈ R × R d+1
and φ(x, ω) := uσ(⟨v, x⟩ + b). Introducing the empirical
measure:
n
1X
µ̂ := δωk ,
n
k=1
this neural network can be expressed as an integral:
Z
fθ (x) := φ(x, ω) dµ̂(ω)
Ω
d′ d+1
where Ω ⊂ R × R is the set of considered parameter (we will see bellow that it is important to be able to
restrict u to belong to a compact domain). An advantage of this integral representation is that it is linear in
the measure µ. This eliminates the need to restrict to discrete measures and allows for a general probabilistic
interpretation of µ.
For the sake of simplicity, we consider the 1-D ouput case, d = 1. The core of Barron’s theorem
demonstrates that if the Barron norm of f , ∥f ∥B , is finite, then f can be represented by a measure.
Proposition 2. If ||f ||B < +∞, there exists a probability measure µ such that:
f (x) = Φ(µ)(x),
where: Z
Φ(µ)(x) := φ(x, ω) dµ(ω). (1.3)
Ω
Furthermore, the measure µ can be restricted to a compact support on the outer weights, supp(µ) ⊂ Ω where
Ω := [−M, M ] ⊗ Rd+1 ,
where M := R∥f ∥B , and R is the radius of the domain K on which the approximation is performed.
Proof. We only sketch the construction. Using the inverse Fourier transform and the fact that f (x) is real,
one has
Z Z
f (x) − f (0) = ℜ ˆ
f (ξ)(e i⟨ξ, x⟩
− 1)dξ = ℜ ˆ
|f (ξ)|e iΘ(ξ) i⟨ξ, x⟩
(e − 1)dξ
Rd Rd
Z
= (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ)))|fˆ(ξ)|dξ
Rd
||f ||B ||ξ|||fˆ(ξ)|
Z Z
= (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ))) dξ = gξ (x)dΓ(ξ)
Rd |ξ| ||f ||B Rd

7
||f ||B ||ξ|||fˆ(ξ)|
where gξ (x) ≜ (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ))) and dµ(ξ) ≜ dξ
||ξ|| ||f ||B
Note that
||f ||B
|gξ (x)| ⩽ |⟨ξ, x⟩| ⩽ ||f ||B R
||ξ||
so that gξ are similar to bounded sigmoid functions. This calculation shows that the previous decomposition
(??) is true but with sigmoid functions gξ instead of functions gω . One then proceeds by showing that the
function cos can be written using translates and dilates of the function ρ to obtain the thought after integral
formula.

1.3.4 Probabilistic proof

A first proof used the so-called “probabilistic method”, which relies on drawing a random neural network
and showing that the probaiblity of reaching the desired O(1/n) P error is non zero, thus showing the existence
n
of a network with this error bound. We thus consider µ̂ = n1 i=1 δωi , where the (ωi )i are now random
vector, independent one from each other, and with law ωi = µ, where µ is the measure so that Φ(µ) = f
constructed above. Beward that now Φ(µ̂) is a random function, and note that

1X
Eµ̂ (Φ(µ̂))(x) = Eωi (φ(x, ωi )) = f (x)
n i

i.e. Eµ̂ (Φ(µ)) = f . In the following, we denote φω (x) = φ(x, ω) for the ease of writing. We consider the
average error according to the data distribution ρ(x) on the x variable. This corresponds to the classical
error in a Monte-Carlo estimation of an integral (excepted here that the value of the integral is a function
and not just a scalar as it is usually the case). In the following, we use the short-hand notation ∥ · ∥ = ∥∥L2 (ρ)
and the inner product are also for L2 (ρ)

Eµ̂ ∥Φ(µ̂) − f ∥2 = Eµ̂ ∥Φ(µ̂)∥2 − 2⟨Eµ̂ Φ(µ̂), f ⟩ + ∥f ∥2L2 = Eµ̂ ∥Φ(µ̂)∥2 − ∥f ∥2 .

We now compute the first expectation, using the fact that for i ̸= j, ωi and ωj are indepentent

1 X 1 X 1 1
Eµ̂ ∥Φ(µ̂)∥2 = 2
Eωi ∥φωi ∥2 + 2 ⟨Eωi φωi , Eωj φωj ⟩ = Eω ∥φω ∥2 + (1 − )∥f ∥2 .
n i n n n
i̸=j

Putting all this together leads to the bound

Eω ∥φω ∥2 − ∥f ∥2 Eω ∥φω ∥2
Eµ̂ ∥Φ(µ̂) − f ∥2 = ⩽
n n
One has Eω ∥φω ∥2 ⩽ ∥φ∥2K×Ω ⩽ C := R2 ∥f ∥2B ∥σ∥2∞ . So this means that the probability of the event
∥Φ(µ̂) − f ∥2 ⩽ C/n holds is non zero, hence the proof of the theorem.

1.3.5 Proof by optimization

A second proof is fully deterministic and relies on using n step of an optimization algorithm (Frank-
Wolfe method), for which an O(1/n) convergence rate is known. To prove the existence of such a discrete
measure achieving the desired error, we thus consider the following approximation problem over the space
of probability measures P(Ω):
Z
1 2
inf E(µ) := (Φ(µ)(x) − f (x)) dx, (1.4)
µ∈P(Ω) 2 K

where dx is the integration measure with support on K. This optimization problem is infinite-dimensional.

8
P
The classical way to solve it is to restrict the previous optimization to discrete measure µ̂ = i δωi with
n neurons and perform gradient descent on the neuron’s parameter (ωi )i . Since the function is non-convex
this might be trapped in a local minima. A recent breakthrough was recently obtained by Chizat and Bach.
They made the remark that this flow is equialent to a Wasserstein gradient flow (a gradient flow for the
optimal transport distance). This allows one to consider the mean field limit when n → +∞, and in this
limit, provided that the initialization has a density, they showed that this flow can never be trapped in a local
minimizer. This in turn ensure that, if the number of neurons n is large enough, and if these are initialized
at random according to some distribution with a density, then the usual gradient descent cannot be trapped
in a local minimum (if it converges, it converges to the global minimizer, hence to a 0 loss). Note however
that it is not possible to know how many neurons are needed for this conclusion to holds, so it is not known
wehter it is possible to reach the O(1/n) rate with a gradient descent algorithm.
To make the proof, one has to rely on another algorithm with known convergence guarantees, which relies
on classical convex optimization. The advantage is that it leads to a constructive proof, but the issue is that
this algorithm relies on the computatino of an oracle which is a priori not tractable (exact optimization of
a single neurons). So this algorithm cannot be used in practice in high dimensions.

First order variations. To derive this algorithm, we have to rely on linearization, which we detail in the
general context of a Banach space (but it can in fact be done even more abstractly without a norm structure
by only relying on directional derivative) and can be applied for our concern over the space of probabilty
equipped with the total variation norm. In the following, to ease the description, we denote the integration
as a pairing between functions and measure using an inner product notation
Z
⟨f, µ⟩ := f (x)dµ(x).

Let µ + ερ be a small perturbation of µ, where ρ is another measure. Then the first variation ∇E(µ) is a
function defined using the Frechet directional derivative rule

E(µ + ερ) = E(µ) + ε⟨∇E(µ), ρ⟩ + o(ε),

so that ∇E(µ) is the FrÃ©chet derivative of E, also called the first variation. In our case, we have:
Z
1 2
E(µ + ερ) = (Φ(µ)(x) − f (x) + εφ(ρ)(x)) dx,
2 K
which expands to:
Z
E(µ + ερ) = E(µ) + ε φ(ρ)(x) (Φ(µ)(x) − f (x)) dx + O(ε2 ).
K

Rewriting this in terms of φ, we find:

Z
∇E(µ)(ω) = φ(x, ω) (Φ(µ)(x) − f (x)) dx.
K

which is a continuous function.

Frank-Wolfe algorithm. The Frank-Wolfe algorithm seeks to minimize a function on a convex sub-set
of a Banach space,
min E(µ).
µ∈C

It operates by successive linearization of the objective function E(µ). It initializes µ0 arbitrarily (e.g., as a
Dirac measure). At each iteration k, for a step size τk , the measure is updated as:

µk+1 = (1 − τk )µk + τk νk∗ ,

9
where νk∗ is a measure minimizing the linearized functional:
νk∗ ∈ arg min ⟨∇E(µk ), ν⟩
ν∈P(Ω)

We call this computation of νk an “oracle” since a priori it is not always simple to obtain. In finite
dimension, if the measure are on a grid, this can be carried over, but as we will see, in the general setting,
it requires the resolution of a non-convex optimization over the space Ω of (single) neurons. In our specific
case, C = P(Ω) is endowed with the total variation norm ∥µ∥TV = |µ|(Ω) (it is the extension to measure of
the L1 norm of functions). In this special case, a key property of the algorithm is that the solution νk∗ can
always be taken as a Dirac measure since, denoting gk := ∇E(µk ),
νk∗ = δωk∗ , where ωk∗ ∈ arg min gk (ω).
ω∈Ω

This holds because for any ν ∈ P(Ω):

Z
gk (ω) dν(ω) ⩾ min(gk ),

and equality is achieved when ν = δωk∗ . Therefore, if µ0 is initialized as a Dirac measure, each iteration of
the algorithm ensures that µk remains a sum of at most k + 1 Dirac masses.

Convergence Rate The following theorem establishes the convergence rate of the Frank-Wolfe algorithm.
In our case, ∥ · ∥ = ∥ · ∥TV is the total variation norm of measure, and the dual norm ∥ · ∥∗ = ∥ · ∥∞ is the L∞
norm on function. We first recall that a function with a Lipschitz gradient has a quadratic upper bound.
Lemma 1. Let E : C → R be a differentiable function with L-Lipschitz gradient with respect to the norm
∥ · ∥. That is, for all µ, ν ∈ C:
∥∇E(µ) − ∇E(ν)∥∗ ⩽ L∥ν − µ∥,
where ∥ · ∥∗ denotes the dual norm of ∥ · ∥. Then, for all µ, ν ∈ C:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 .
2
Proof. By the fundamental theorem of calculus, we can express E(ν) as:
Z 1
E(ν) = E(µ) + ⟨∇E(µ + t(ν − µ)), ν − µ⟩ dt.
0

Adding and subtracting ∇E(µ) inside the integrand:

Z 1
E(ν) = E(µ) + ⟨∇E(µ), ν − µ⟩ + ⟨∇E(µ + t(ν − µ)) − ∇E(µ), ν − µ⟩ dt.
0

Using the L-Lipschitz property of the gradient, we bound the difference:

∥∇E(µ + t(ν − µ)) − ∇E(µ)∥∗ ⩽ Lt∥ν − µ∥.
Substitute this bound into the integral:
Z 1 Z 1
⟨∇E(µ + t(ν − µ)) − ∇E(µ), ν − µ⟩ dt ⩽ Lt∥ν − µ∥2 dt.
0 0

Evaluate the integral: Z 1

L
Lt dt = .
0 2
Thus:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 .
2

10
Theorem 2. Let E be convex and assume that ∇E(µ) is L-Lipschitz, i.e.,

∥∇E(µ) − ∇E(µ′ )∥∗ ⩽ L∥µ − µ′ ∥.

2
For the step size τk = k+2 , the F-W to optimize F on a set of radius

r := sup ∥µ − µ′ ∥
µ,µ′ ∈C 2

satisfies, denoting E ∗ := inf µ∈C E(µ),

2Lr2
E(µk ) − E ∗ ⩽ ,
k+1
Proof. Using the L-Lipschitz gradient property with respect to a Banach norm ∥ · ∥, using Lemma 1, we have
the following quadratic upper bound:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 , ∀µ, ν ∈ C.
2

One-Step Improvement The Frank-Wolfe update is:

µk+1 = µk + τk (νk − µk ),
2
where τk = k+1 and νk = arg minν∈C ⟨∇E(µk ), ν⟩. By smoothness of F , we have:

L 2
E(µk+1 ) ⩽ E(µk ) + τk ⟨∇E(µk ), νk − µk ⟩ + τ ∥νk − µk ∥2TV .
2 k
Furthermore, the boundedness of C ensures ∥νk − µk ∥ ⩽ r. Substituting, we get:

L 2 2
E(µk+1 ) ⩽ E(µk ) + τk gk + τ r ,
2 k
where gk := ⟨∇E(µk ), νk − µk ⟩ Defining hk = E(µk ) − E ∗ as the suboptimality at iteration k, we have:

L 2 2
hk+1 ⩽ hk − τk gk + τ r . (1.5)
2 k
We now bound gk , using the optimality of νk

gk := ⟨∇E(µk ), νk − µk ⟩ = min⟨∇E(µk ), ν − µk ⟩
ν∈C

and by convexity,
E(νk ) ⩾ E(µk ) + ⟨∇E(µk ), νk − µk ⟩
so that ⟨∇E(µk ), ν − µk ⟩ ⩽ E(ν) − E(µk ) so

gk ⩽ min E(ν) − E(µk ) = E ∗ − E(µk ) = −hk .

ν∈C

Plugging this into (1.5), we obtained the fundamental descent property

L 2 2
hk+1 ⩽ hk − τk hk + τ r .
2 k
2
Substituting τk = k+1 :
2Lr2

2
hk+1 ⩽ hk 1 − + .
k+1 (k + 1)2

11
Recursion Argument Assume the inductive hypothesis:

2Lr2
hk ⩽ .
k+1
We will prove that:
2Lr2
hk+1 ⩽ .
k+2
Using the inductive hypothesis in the recursive relation for hk+1 :

2Lr2 2Lr2

2
hk+1 ⩽ 1− + .
k+1 k+1 (k + 1)2

Simplify the coefficient:

2Lr2 2Lr2 k − 1

2
1− = · .
k+1 k+1 k+1 k+1
Substitute back:
2Lr2 (k − 1) 2Lr2
hk+1 ⩽ + .
(k + 1)2 (k + 1)2
Combine terms:
2Lr2 ((k − 1) + 1) 2Lr2
hk+1 ⩽ = .
(k + 1)2 k+2

In the case of MLP training, where E is defined in (1.4), the proposition bellow shows that L ⩽ M 2 ∥σ∥2∞ ,
where M = R∥f ∥B , and we have r = 2 (radius of the space of probability for TV). Recall that∥f ∥B is the
Barron norm of the target function f . Furthermore, we know that E(µ∗ ) = 0, as the existence of a valid
representative measure was established in Equation (1.3). By applying the Frank-Wolfe algorithm, we deduce
the existence of a discrete measure µk consisting of at most k+1 Dirac masses. This discrete measure achieves
an approximation error:
1
E(µk ) = O .
k
Thus, the Frank-Wolfe algorithm constructs a sparse representation of the target function with a provably
decreasing error bound as the number of iterations k increases.
Proposition 3. The first variation ∇E(µ) of the functional E(µ) defined in (1.4), which is
Z
∇E(µ)(ω) = φ(x, ω) (Φ(µ)(x) − f (x)) dx,
K
R
where Φ(µ)(x) = Ω φ(x, ω) dµ(ω) and φ(x, ω) = aσ(⟨w, x⟩ + b), is L-Lipschitz with respect to the total
variation norm. Specifically, for any µ, µ′ ∈ P(Ω):

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ L∥µ − µ′ ∥TV ,

where L = ∥σ∥2∞ M 2 .
Proof. The difference of ∇E(µ) and ∇E(µ′ ) is:
Z
′
∇E(µ)(ω) − ∇E(µ )(ω) = φ(x, ω) (Φ(µ)(x) − φ(µ′ )(x)) dx.
K
Z
Φ(µ)(x) − φ(µ′ )(x) = φ(x, ω) d(µ − µ′ )(ω).
Ω

12
Substitute the above into the expression for ∇E(µ):
Z Z
′ ′ ′ ′
∇E(µ)(ω) − ∇E(µ )(ω) = φ(x, ω) φ(x, ω ) d(µ − µ )(ω ) dx.
K Ω

Using Fubini’s theorem, introducing k(ω, ω ′ ) :=φ(x, ω)φ(x, ω ′ ) dx,

R
K
Z
∇E(µ)(ω) − ∇E(µ′ )(ω) = k(ω, ω ′ )d(µ − µ′ )(ω ′ ).
Ω

Take the L∞ norm with respect to ω:

Z
′
∥∇E(µ) − ∇E(µ )∥∞ = sup k(ω, ω ′ )d(µ − µ′ )(ω ′ ) .
ω∈Ω Ω

Using the triangle inequality:

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ ∥k∥L∞ (K×K) ∥µ − µ′ ∥TV .

One has Z
∥k∥L∞ (K×K) = sup | φ(x, ω)φ(x, ω ′ )dx| ⩽ ∥φ∥2L∞ (K×Ω) ⩽ M 2 ∥σ∥2∞ .
(ω,ω ′ )∈Ω2 K

13
14
Bibliography

[1] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of
Machine Learning Research, 18(1):629–681, 2017.

[2] Francis Bach. Learning theory from first principles. 2021.

[3] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transactions on Information theory, 39(3):930–945, 1993.

[4] Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MAT-
LAB. SIAM, 2014.

[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and Trends®
in Machine Learning, 3(1):1–122, 2011.

[6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.

[8] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.

[9] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vis.,
20:89–97, 2004.

[10] Antonin Chambolle, Vicent Caselles, Daniel Cremers, Matteo Novaga, and Thomas Pock. An intro-
duction to total variation for image analysis. Theoretical foundations and numerical methods for sparse
recovery, 9(263-340):227, 2010.

[11] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta
Numerica, 25:161–319, 2016.

[12] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.

[13] Philippe G Ciarlet. Introduction à l’analyse numérique matricielle et à l’optimisation. 1982.

[14] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.

[15] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,

signals and systems, 2(4):303–314, 1989.

[16] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.

15
[17] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[18] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.

[19] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[20] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
[21] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.

[22] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[23] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization,
1(3):127–239, 2014.

[24] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.

[25] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[26] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[27] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[28] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.

[29] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.

2501.10465v1
No ratings yet
2501.10465v1
10 pages
Montanari
No ratings yet
Montanari
10 pages
Mathematics of Deep Learning: Lecture 2 - Depth Separation
No ratings yet
Mathematics of Deep Learning: Lecture 2 - Depth Separation
13 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
deep-learning
No ratings yet
deep-learning
10 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Index
No ratings yet
Index
127 pages
Applying statistical learning theory to deep learning
No ratings yet
Applying statistical learning theory to deep learning
51 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
dl
No ratings yet
dl
80 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
1-s2.0-S1474667017477378-main
No ratings yet
1-s2.0-S1474667017477378-main
24 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
No ratings yet
Deep Learning As A Building Block in Probabilistic Models: Pierre-Alexandre Mattei
57 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Lecture2
No ratings yet
Lecture2
67 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Mathematical_Foundations_of_Deep_Learning
No ratings yet
Mathematical_Foundations_of_Deep_Learning
174 pages
Richi's Neural Nets Summary
No ratings yet
Richi's Neural Nets Summary
114 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
Multilayer Feedforward Networks Are Universal Approximators
No ratings yet
Multilayer Feedforward Networks Are Universal Approximators
13 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
No ratings yet
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
6 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
17-AAP1328
No ratings yet
17-AAP1328
59 pages
Statistical Learning Theory for Neural Operators
No ratings yet
Statistical Learning Theory for Neural Operators
68 pages
DLbook
No ratings yet
DLbook
165 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
Deep ONet
No ratings yet
Deep ONet
22 pages
Sourangshu_Ghosh_IISc_Bangalore_Mathematical_Foundations_of_Deep_Learning_Version_3
No ratings yet
Sourangshu_Ghosh_IISc_Bangalore_Mathematical_Foundations_of_Deep_Learning_Version_3
433 pages
safran17a
No ratings yet
safran17a
9 pages
Genetic Algorithms Versus Traditional Methods
No ratings yet
Genetic Algorithms Versus Traditional Methods
7 pages
Lecture Five Radial-Basis Function Networks: Associate Professor
No ratings yet
Lecture Five Radial-Basis Function Networks: Associate Professor
64 pages
Klqgceb Ewvhja SC
No ratings yet
Klqgceb Ewvhja SC
8 pages
(1991) Hornik (Neural Netw.)
No ratings yet
(1991) Hornik (Neural Netw.)
7 pages
UDL Errata
No ratings yet
UDL Errata
13 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
RBF
No ratings yet
RBF
45 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
36_neural_operator_graph_kernel_n
No ratings yet
36_neural_operator_graph_kernel_n
21 pages
The Universal Approximation Power of Finite-Width Deep Relu Networks
No ratings yet
The Universal Approximation Power of Finite-Width Deep Relu Networks
16 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
No ratings yet
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
22 pages
L06 Slides.mlp3
No ratings yet
L06 Slides.mlp3
26 pages
Vahid
No ratings yet
Vahid
18 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
CT Delta Sigma ADC Tutorial
No ratings yet
CT Delta Sigma ADC Tutorial
77 pages
D 2324 1S EEE 151 Lecture 06 (Ramos)
No ratings yet
D 2324 1S EEE 151 Lecture 06 (Ramos)
4 pages
College Predictor
No ratings yet
College Predictor
20 pages
COMP5046: Natural Language Processing
No ratings yet
COMP5046: Natural Language Processing
71 pages
850 Midterm 2015
No ratings yet
850 Midterm 2015
2 pages
03 Model Selection and Train-Validation-Test Sets 12 Min
No ratings yet
03 Model Selection and Train-Validation-Test Sets 12 Min
7 pages
AlphaZero Research Paper Summary
No ratings yet
AlphaZero Research Paper Summary
3 pages
Neeraja CV
No ratings yet
Neeraja CV
2 pages
Workshop 3 Sorting
No ratings yet
Workshop 3 Sorting
11 pages
Feed Forward Neural Networks: Prof. Adel Abdennour
No ratings yet
Feed Forward Neural Networks: Prof. Adel Abdennour
48 pages
3 - Inverse Normal Distribution
No ratings yet
3 - Inverse Normal Distribution
10 pages
Section 52 PDF
No ratings yet
Section 52 PDF
16 pages
Digital Filter Design for Analogue Engineers
No ratings yet
Digital Filter Design for Analogue Engineers
15 pages
TSP_IASC_30486
No ratings yet
TSP_IASC_30486
14 pages
Constrained Optimization With Inequality Constraint
No ratings yet
Constrained Optimization With Inequality Constraint
43 pages
Article 6
No ratings yet
Article 6
11 pages
Data Hipotetik Valuasi
No ratings yet
Data Hipotetik Valuasi
20 pages
Micro-Insurance Model
No ratings yet
Micro-Insurance Model
6 pages
Management Science Unit 7
No ratings yet
Management Science Unit 7
18 pages
CHP 2 Part 4: Trunking and Grade of Service (GOS)
No ratings yet
CHP 2 Part 4: Trunking and Grade of Service (GOS)
10 pages
worksheet 6
No ratings yet
worksheet 6
2 pages
Module1 ML
No ratings yet
Module1 ML
13 pages
Data Structures Using C (Csit124) Lecture Notes: by Dr. Nancy Girdhar
No ratings yet
Data Structures Using C (Csit124) Lecture Notes: by Dr. Nancy Girdhar
31 pages
Slides Network Analysis
No ratings yet
Slides Network Analysis
162 pages
[Physical Review E 2010-oct 22 vol. 82 iss. 4] Guven, Jemal_ VÃ¡zquez-Montejo, Pablo - Spinor representation of surfaces and complex stresses on membranes and interfaces (2010) [10.1103_PhysRevE.82.041604] - libgen.li
No ratings yet
[Physical Review E 2010-oct 22 vol. 82 iss. 4] Guven, Jemal_ VÃ¡zquez-Montejo, Pablo - Spinor representation of surfaces and complex stresses on membranes and interfaces (2010) [10.1103_PhysRevE.82.041604] - libgen.li
12 pages
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
100% (1)
Applied Model Predictive Control - A Brief Guide Do MATLAB/Simulink MPC Toolbox
66 pages
Graphs
No ratings yet
Graphs
3 pages
KBS ملخص
No ratings yet
KBS ملخص
14 pages
Closure Properties of Decidable and Turing Recognizable Languages
No ratings yet
Closure Properties of Decidable and Turing Recognizable Languages
16 pages
162 Ols
No ratings yet
162 Ols
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Fundations Data Science

Uploaded by

Fundations Data Science

Uploaded by

Mathematical Foundations of Data Sciences

December 11, 2024

1.1 Multi-layer Perceptron

and use the gradient (it is of course possible to use SGD)

⟨R, H⟩ = ⟨R, U [S ⊙ (D⊤ X)]⟩ = ⟨D⊤ X, (U ⊤ R) ⊙ S⟩ = ⟨X ⊤ D, (R⊤ U ) ⊙ S ⊤ ⟩

1.2 L∞ non-quantitative universality

The goal is to construct σ-based functions of the form:

1.3.1 Barron’s space

• Ridge function: let f (x) = ψ(⟨x, b⟩ + c) where ψ : R → R then one has

1.3.2 Barron’s Theorem

1.3.3 Mean field representation.

1.3.4 Probabilistic proof

Eµ̂ ∥Φ(µ̂) − f ∥2 = Eµ̂ ∥Φ(µ̂)∥2 − 2⟨Eµ̂ Φ(µ̂), f ⟩ + ∥f ∥2L2 = Eµ̂ ∥Φ(µ̂)∥2 − ∥f ∥2 .

Putting all this together leads to the bound

1.3.5 Proof by optimization

E(µ + ερ) = E(µ) + ε⟨∇E(µ), ρ⟩ + o(ε),

Rewriting this in terms of φ, we find:

which is a continuous function.

µk+1 = (1 − τk )µk + τk νk∗ ,

This holds because for any ν ∈ P(Ω):

Adding and subtracting ∇E(µ) inside the integrand:

Using the L-Lipschitz property of the gradient, we bound the difference:

Evaluate the integral: Z 1

∥∇E(µ) − ∇E(µ′ )∥∗ ⩽ L∥µ − µ′ ∥.

satisfies, denoting E ∗ := inf µ∈C E(µ),

One-Step Improvement The Frank-Wolfe update is:

gk ⩽ min E(ν) − E(µk ) = E ∗ − E(µk ) = −hk .

Plugging this into (1.5), we obtained the fundamental descent property

Simplify the coefficient:

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ L∥µ − µ′ ∥TV ,

Using Fubini’s theorem, introducing k(ω, ω ′ ) :=φ(x, ω)φ(x, ω ′ ) dx,

Take the L∞ norm with respect to ω:

Using the triangle inequality:

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ ∥k∥L∞ (K×K) ∥µ − µ′ ∥TV .

[2] Francis Bach. Learning theory from first principles. 2021.

[13] Philippe G Ciarlet. Introduction à l’analyse numérique matricielle et à l’optimisation. 1982.

[15] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,

[24] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.