0% found this document useful (0 votes)
11 views16 pages

Fundations Data Science

The document discusses the mathematical foundations of data sciences, focusing on shallow learning and the Multi-Layer Perceptron (MLP) model. It covers the structure of MLPs, the significance of non-linear activation functions, and the universal approximation properties of neural networks. Additionally, it introduces Barron's theorem, which relates the approximation capabilities of neural networks to functions in Barron's space, emphasizing the importance of regularity and dimensionality in approximation errors.

Uploaded by

Khajan Mahtolia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Fundations Data Science

The document discusses the mathematical foundations of data sciences, focusing on shallow learning and the Multi-Layer Perceptron (MLP) model. It covers the structure of MLPs, the significance of non-linear activation functions, and the universal approximation properties of neural networks. Additionally, it introduces Barron's theorem, which relates the approximation capabilities of neural networks to functions in Barron's space, emphasizing the importance of regularity and dimensionality in approximation errors.

Uploaded by

Khajan Mahtolia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Mathematical Foundations of Data Sciences

Gabriel Peyré
CNRS & DMA
École Normale Supérieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com

December 11, 2024


2
Chapter 1

Shallow Learning

In this chapter, we study the simplest example of non-linear parametric models, namely Multi-Layers
Perceptron (MLP) with a single hidden layer (so they have in total 2 layers). Perceptron (with no hidden
layer) corresponds to the linear models studied in the previous chapter. MLP with more layers are obtained
by stacking together several such simple MLP, and are studied in Section ??, since the computation of their
derivatives is very suited to automatic-differentiation methods.

1.1 Multi-layer Perceptron


1.1.1 Multi-layer

Let us first consider the general case of an arbitrary number of layers to defined mapping fθ : Rd → Rd .
From x0 := x ∈ Rds = Rd , they iterate along the depth indexes s = 0, . . . , S − 1

xs+1 + σ(Ws xs + bs )

where Ws ∈ Rds+1 ×ds and the bias is bs ∈ Rds+1 . Here σ is a non-linear (and in fact non-polynomiial)
function applied component wise, i.e. we denote σ(z) = (σ(zi ))i .
The most popular non-linearities are sigmoid functions such as

er 1 1
ρ(r) = and ρ(r) = atan(r) +
1 + er π 2

and the rectified linear unit (ReLu) function ρ(r) = max(r, 0). There is an important difference both in
practice and in theory on these two class of activation (bounded vs. un-bounded). ReLu works better in
practice because there is less saturaration effect, so that gradient are not zero if the values commputed by
the networks are large. Also the ReLu is positively 1-homogeneous, which allows to rescale the weights and
for some proof, consider that these weight are on a unit sphere. A difficulty however is that the ReLu is
not differentiable at 0, which makes some rigorous proof difficult to do (but in practice, this non smoothness
seems harmless).
In order to define function of arbitrary complexity when width (number of neuron per layer) increases,
it is important that σ is non-polynomial. Otherwise, fθ would be a polynimial of degree proportional to s,
so these functions would for instance not be dense in continuous functions. Note however that the linear
case σ = Id is of independent to compute matrix factorization, but this does not corresponds to supervised
learning problem (but rather dimensionality reduction using PCA or non-negative matrix factorization).

3
1.1.2 2-layers MLPs
We consider two-layer neural networks of the form:
n
X
fθ (x) := uk σ(⟨vk , x⟩ + bk ), ∀x ∈ Rd , (1.1)
k=1


where σ is the activation function. The parameters of the network are denoted as θk = (uk ∈ Rd , vk ∈
Rd , bk ∈ R) for k = 1, . . . , n. In most of the following, for the sake of simplicity, we consider d′ = 1, i.e.
real-valued output.
In practice, neural network are designed by doing gradient descent, i.e. we consider a loss (for the sake
of simplicity here quadratic Z
min E(θ) := ∥fθ (x) − y∥dρ(x, y) (1.2)
θ

and use the gradient (it is of course possible to use SGD)

θt+1 = θt − τt ∇E(θ).

Gradient computation Ignoring the bias bk for simplicity, we can write in matrix form fθ (x) = U σ(V ⊤ x)

where U ∈ Rd ×n , V ∈ Rd×n . For the sake of simplicity, we assume there is a finite number N of data points
d′ ×N
X = (xi )N
i=1 ∈ R
d×N
and Y = (yi )Ni=1 ∈ R Training with a ℓ2 loss thus reads

1
min E(U, V ) := ∥U σ(V ⊤ X) − Y ∥2 .
U,V 2

If we denote Z := σ(V ⊤ X) (which can be thought as applying the feature map x → σ(V ⊤ x) to the data),
then training U is a classical least square 21 ∥U Z − Y ∥2 and the gradient reads

∇U E(U, V ) = (U Z − Y )Z ⊤ .

We perform a Taylor expansion to compute the gradient with respect to V , denoting R := U σ(V ⊤ X) − Y
and S := σ ′ (V ⊤ X)

1
E(U, V + εD) = ∥R + εU [S ⊙ (D⊤ X)]∥2 = E(U, V ) + ⟨R, H⟩ + O(ε2 )
2
where we denoted A ⊙ B = (Ai,j Bi,j ) and where

⟨R, H⟩ = ⟨R, U [S ⊙ (D⊤ X)]⟩ = ⟨D⊤ X, (U ⊤ R) ⊙ S⟩ = ⟨X ⊤ D, (R⊤ U ) ⊙ S ⊤ ⟩

which leads to
∇V E(U, V ) = X[(R⊤ U ) ⊙ S ⊤ ].
This computation is quite painful, and the advice is not to use this derivation for deeper network, because not
only they are overly complicated, but they are vastly sub-optimal. The correct way to compute this gradient
is to use the back-propagation method, which corresponds to revserse mode automatic differentiation.

1.2 L∞ non-quantitative universality


If σ is a sigmoid function, George Cybenko’s theorem, later refined by Kurt Hornik, Maxwell Stinchcombe,
and Halbert White, demonstrates that the functions fθ can approximate any continuous function uniformly
on a compact domain. So this means here we insist on doing an L∞ approximation, which is strictly stronger
(and more difficult) that doing an L2 error control as consider during training (1.2).

4
Proposition 1. If σ is an increasing (not necessarily continuous) function satisfying:
lim σ(s) = 0 and lim σ(s) = 1,
s→−∞ s→+∞

and K ⊂ Rd is compact, then for any continuous function f on K and any ε > 0, there exist n and parameters
(θk )nk=0 such that:
sup |f (x) − fθ (x)| ⩽ ε.
x∈K

This theorem establishes the universal approximation property of two-layer neural networks. However,
it does not provide bounds on the number of neurons n required as a function of ε. Furthermore, the proof
does not constructively specify how to determine the parameters of the approximating network fθ . The
first proof was done by Cybenko [15] using a duality argument. We detail next the proof due to Hornik
et al. which is a bit more constructive, and rely on Stone-Weierstrass theorem to perform a Fourier-type
approximation. On contrary to a direct Fourier series expansion, this leads to a uniform approximation of a
continuous function, whereas Fourier series do not lead to a uniform approximation.
Proof. It first considers the activation σ = cos (note that the initial density argument would also work with
σ = exp which interestingly is a non-bounded activation). Consider the function space:
( n )
X
A := uk cos(⟨vk , x⟩ + bk ) : n ∈ N, (uk , bk , vk )k .
k=1

This space is an algebra of continuous functions on the compact set K. It contains the constant functions
and separates points; that is, for x ̸= x′ , there exists w such that cos(⟨w, x⟩) ̸= cos(⟨w, x′ ⟩). By the Stone-
Weierstrass theorem, A is dense in the space of continuous functions on K.
Let r = maxk (|vk | · Radius(K) + |bk |). To approximate functions on K, it thus suffices by the previous
density to approximate cos(s) on the interval [−r, r]. Splitting the interval into subintervals where cos(s) is
monotonic, this can be replaced by just approximating the rectified cosine squashing function :

0,
 s ⩽ 0,
cos+ (s) = 1, s ⩾ π/2,

1 − cos(s), s ∈ [0, π/2].

The goal is to construct σ-based functions of the form:


X
uk σ(vk s + bk ) − cos+ (s) ⩽ ε,
k

where uk , bk , vk ∈ R.
Divide [0, π/2] into Q subintervals [sk , sk+1 ], where sk = cos−1
+ (k/Q). Choose M > 0 large enough such
that:
ε ε
σ(−M ) < , σ(M ) > 1 − .
2Q 2Q
Define vk and bk such that the affine map vk s + bk sends [sk , sk+1 ] to [−M, M ]. Set the weights uk = 1/Q.
For each subinterval, the construction ensures that:
ε
|σ(vk s + bk ) − cos+ (s)| ⩽ .
Q
Summing over all subintervals gives the desired approximation:
X
a0 + uk σ(vk s + bk ) − cos+ (s) ⩽ ε,
k

provided Q > 2/ε. Combining the results, the network fθ can approximate any continuous function f on K
to within ε, completing the proof.

5
1.3 L2 Quantitative Approximation (Barron’s theorem)
In constrast to the uniform error control of the previous section, we consider here L2 approximation as
consider in the initial loss (1.2). We only focuss on approximation error, so that we consider that the data
satisfy exactly y = f (x) for some function f to approximate and x is distributed according to some ρ(x).
Another limitation of the theory we detail next is that we assume ρ is compactly supported on a ball of
radius R. Without any hypothesis beside convexity, it is not possible to show any rate (i.e. approximation
by a network can be arbitrary slow). The functional space to obtain fast rate (independent of the dimension)
is called the Barron’s space, and was introduced by Andrew Barron.

1.3.1 Barron’s space


For an integrable function f , its Fourier transform is defined, for any ξ ∈ Rd by
Z
ˆ
f (ξ) ≜ f (ξ)ei⟨ξ, x⟩ dx.
Rd

The Barron’s space [3] is the set of functions such as the semi-norm
Z
||f ||B ≜ ||ξ|||fˆ(ξ)|dξ
Rd

is finite. If we impose that f (0) is fixed, we can show that this defines a norm and that the Barron space is
a Banach space. One has Z
||f ||B = ||∇f
c (ξ)||dξ,
Rd

this shows that the functions of the Barron space are quite regular. Here are some example of function
classes with the corresponding Barron’s norm.
2 √
• Gaussians: for f (x) = e−||x|| /2 , one has ||f ||B ⩽ 2 d

• Ridge function: let f (x) = ψ(⟨x, b⟩ + c) where ψ : R → R then one has


Z
||f ||B ⩽ ||b|| |uψ̂(u)|du.
R

In particular, if ψ is C 2+δ for δ > 0 then f is in the Barron space. If ρ satisfies this hypothesis, the
“neurons” functions are in Barron space.

• Regular functions with s derivatives: for all s > d/2, one has ||f ||B ⩽ C(d, s) ||f ||H s where the Sobolev
norm is
Z d
X
||f ||2H s ≜ |fˆ(ξ)|2 (1 + ||ξ||2s )dξ ∼ ||f ||2L2 (dx) + ||∂xk f ||2L2 (dx) ,
Rd k=1

and C(d, s) < ∞ is a constant. This shows that if f has at least d/2 derivatives in L2 , it is in Barron
space. Beware that the converse is false, the Barron space can contain less regular functions as seen
in the previous examples. This somehow shows that the Barron space is larger than RKHS space of
fixed smoothness degree.

1.3.2 Barron’s Theorem


The main result is as follows.

6
Theorem 1 (Barron [3]). We assume ρ is supported on B(0, R). For all n, there exists fθ with n neurons
such that
2R||f ||B
||f (0) + fθ − f ||L2 (ρ) ⩽ √ .
n
P
Furthermore, one can impose that k |uk | ⩽ 2R||f ||B
This result shows that if f is in Barron space, the decrease of the error does not depend on the dimension:
√ the constant ||f ||B
this is often referred to as “overcoming the curse of dimensionality”. Be careful however,
can depend on the dimension, this is the case for Gaussian functions (where it is 2 d) but not for ridges
functions.

1.3.3 Mean field representation.


The proof of Barron’s theorem involves rescaling the coefficients uk by 1/n and rewriting the neural
network in Equation (1.1) as:
n
1X
fθ (x) := φ(x, ωk ),
n
k=1
d′
where θ = (ωk )nk=1 , ωk = (uk , vk , bk ) ∈ R × R d+1
and φ(x, ω) := uσ(⟨v, x⟩ + b). Introducing the empirical
measure:
n
1X
µ̂ := δωk ,
n
k=1
this neural network can be expressed as an integral:
Z
fθ (x) := φ(x, ω) dµ̂(ω)

d′ d+1
where Ω ⊂ R × R is the set of considered parameter (we will see bellow that it is important to be able to
restrict u to belong to a compact domain). An advantage of this integral representation is that it is linear in
the measure µ. This eliminates the need to restrict to discrete measures and allows for a general probabilistic
interpretation of µ.
For the sake of simplicity, we consider the 1-D ouput case, d = 1. The core of Barron’s theorem
demonstrates that if the Barron norm of f , ∥f ∥B , is finite, then f can be represented by a measure.
Proposition 2. If ||f ||B < +∞, there exists a probability measure µ such that:
f (x) = Φ(µ)(x),
where: Z
Φ(µ)(x) := φ(x, ω) dµ(ω). (1.3)

Furthermore, the measure µ can be restricted to a compact support on the outer weights, supp(µ) ⊂ Ω where
Ω := [−M, M ] ⊗ Rd+1 ,
where M := R∥f ∥B , and R is the radius of the domain K on which the approximation is performed.
Proof. We only sketch the construction. Using the inverse Fourier transform and the fact that f (x) is real,
one has
Z  Z 
f (x) − f (0) = ℜ ˆ
f (ξ)(e i⟨ξ, x⟩
− 1)dξ = ℜ ˆ
|f (ξ)|e iΘ(ξ) i⟨ξ, x⟩
(e − 1)dξ
Rd Rd
Z
= (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ)))|fˆ(ξ)|dξ
Rd
||f ||B ||ξ|||fˆ(ξ)|
Z Z
= (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ))) dξ = gξ (x)dΓ(ξ)
Rd |ξ| ||f ||B Rd

7
||f ||B ||ξ|||fˆ(ξ)|
where gξ (x) ≜ (cos(⟨ξ, x⟩ + Θ(ξ)) − cos(Θ(ξ))) and dµ(ξ) ≜ dξ
||ξ|| ||f ||B
Note that
||f ||B
|gξ (x)| ⩽ |⟨ξ, x⟩| ⩽ ||f ||B R
||ξ||
so that gξ are similar to bounded sigmoid functions. This calculation shows that the previous decomposition
(??) is true but with sigmoid functions gξ instead of functions gω . One then proceeds by showing that the
function cos can be written using translates and dilates of the function ρ to obtain the thought after integral
formula.

1.3.4 Probabilistic proof


A first proof used the so-called “probabilistic method”, which relies on drawing a random neural network
and showing that the probaiblity of reaching the desired O(1/n) P error is non zero, thus showing the existence
n
of a network with this error bound. We thus consider µ̂ = n1 i=1 δωi , where the (ωi )i are now random
vector, independent one from each other, and with law ωi = µ, where µ is the measure so that Φ(µ) = f
constructed above. Beward that now Φ(µ̂) is a random function, and note that

1X
Eµ̂ (Φ(µ̂))(x) = Eωi (φ(x, ωi )) = f (x)
n i

i.e. Eµ̂ (Φ(µ)) = f . In the following, we denote φω (x) = φ(x, ω) for the ease of writing. We consider the
average error according to the data distribution ρ(x) on the x variable. This corresponds to the classical
error in a Monte-Carlo estimation of an integral (excepted here that the value of the integral is a function
and not just a scalar as it is usually the case). In the following, we use the short-hand notation ∥ · ∥ = ∥∥L2 (ρ)
and the inner product are also for L2 (ρ)

Eµ̂ ∥Φ(µ̂) − f ∥2 = Eµ̂ ∥Φ(µ̂)∥2 − 2⟨Eµ̂ Φ(µ̂), f ⟩ + ∥f ∥2L2 = Eµ̂ ∥Φ(µ̂)∥2 − ∥f ∥2 .

We now compute the first expectation, using the fact that for i ̸= j, ωi and ωj are indepentent

1 X 1 X 1 1
Eµ̂ ∥Φ(µ̂)∥2 = 2
Eωi ∥φωi ∥2 + 2 ⟨Eωi φωi , Eωj φωj ⟩ = Eω ∥φω ∥2 + (1 − )∥f ∥2 .
n i n n n
i̸=j

Putting all this together leads to the bound

Eω ∥φω ∥2 − ∥f ∥2 Eω ∥φω ∥2
Eµ̂ ∥Φ(µ̂) − f ∥2 = ⩽
n n
One has Eω ∥φω ∥2 ⩽ ∥φ∥2K×Ω ⩽ C := R2 ∥f ∥2B ∥σ∥2∞ . So this means that the probability of the event
∥Φ(µ̂) − f ∥2 ⩽ C/n holds is non zero, hence the proof of the theorem.

1.3.5 Proof by optimization


A second proof is fully deterministic and relies on using n step of an optimization algorithm (Frank-
Wolfe method), for which an O(1/n) convergence rate is known. To prove the existence of such a discrete
measure achieving the desired error, we thus consider the following approximation problem over the space
of probability measures P(Ω):
Z
1 2
inf E(µ) := (Φ(µ)(x) − f (x)) dx, (1.4)
µ∈P(Ω) 2 K

where dx is the integration measure with support on K. This optimization problem is infinite-dimensional.

8
P
The classical way to solve it is to restrict the previous optimization to discrete measure µ̂ = i δωi with
n neurons and perform gradient descent on the neuron’s parameter (ωi )i . Since the function is non-convex
this might be trapped in a local minima. A recent breakthrough was recently obtained by Chizat and Bach.
They made the remark that this flow is equialent to a Wasserstein gradient flow (a gradient flow for the
optimal transport distance). This allows one to consider the mean field limit when n → +∞, and in this
limit, provided that the initialization has a density, they showed that this flow can never be trapped in a local
minimizer. This in turn ensure that, if the number of neurons n is large enough, and if these are initialized
at random according to some distribution with a density, then the usual gradient descent cannot be trapped
in a local minimum (if it converges, it converges to the global minimizer, hence to a 0 loss). Note however
that it is not possible to know how many neurons are needed for this conclusion to holds, so it is not known
wehter it is possible to reach the O(1/n) rate with a gradient descent algorithm.
To make the proof, one has to rely on another algorithm with known convergence guarantees, which relies
on classical convex optimization. The advantage is that it leads to a constructive proof, but the issue is that
this algorithm relies on the computatino of an oracle which is a priori not tractable (exact optimization of
a single neurons). So this algorithm cannot be used in practice in high dimensions.

First order variations. To derive this algorithm, we have to rely on linearization, which we detail in the
general context of a Banach space (but it can in fact be done even more abstractly without a norm structure
by only relying on directional derivative) and can be applied for our concern over the space of probabilty
equipped with the total variation norm. In the following, to ease the description, we denote the integration
as a pairing between functions and measure using an inner product notation
Z
⟨f, µ⟩ := f (x)dµ(x).

Let µ + ερ be a small perturbation of µ, where ρ is another measure. Then the first variation ∇E(µ) is a
function defined using the Frechet directional derivative rule

E(µ + ερ) = E(µ) + ε⟨∇E(µ), ρ⟩ + o(ε),

so that ∇E(µ) is the Fréchet derivative of E, also called the first variation. In our case, we have:
Z
1 2
E(µ + ερ) = (Φ(µ)(x) − f (x) + εφ(ρ)(x)) dx,
2 K
which expands to:
Z
E(µ + ερ) = E(µ) + ε φ(ρ)(x) (Φ(µ)(x) − f (x)) dx + O(ε2 ).
K

Rewriting this in terms of φ, we find:


Z
∇E(µ)(ω) = φ(x, ω) (Φ(µ)(x) − f (x)) dx.
K

which is a continuous function.

Frank-Wolfe algorithm. The Frank-Wolfe algorithm seeks to minimize a function on a convex sub-set
of a Banach space,
min E(µ).
µ∈C

It operates by successive linearization of the objective function E(µ). It initializes µ0 arbitrarily (e.g., as a
Dirac measure). At each iteration k, for a step size τk , the measure is updated as:

µk+1 = (1 − τk )µk + τk νk∗ ,

9
where νk∗ is a measure minimizing the linearized functional:
νk∗ ∈ arg min ⟨∇E(µk ), ν⟩
ν∈P(Ω)

We call this computation of νk an “oracle” since a priori it is not always simple to obtain. In finite
dimension, if the measure are on a grid, this can be carried over, but as we will see, in the general setting,
it requires the resolution of a non-convex optimization over the space Ω of (single) neurons. In our specific
case, C = P(Ω) is endowed with the total variation norm ∥µ∥TV = |µ|(Ω) (it is the extension to measure of
the L1 norm of functions). In this special case, a key property of the algorithm is that the solution νk∗ can
always be taken as a Dirac measure since, denoting gk := ∇E(µk ),
νk∗ = δωk∗ , where ωk∗ ∈ arg min gk (ω).
ω∈Ω

This holds because for any ν ∈ P(Ω):


Z
gk (ω) dν(ω) ⩾ min(gk ),

and equality is achieved when ν = δωk∗ . Therefore, if µ0 is initialized as a Dirac measure, each iteration of
the algorithm ensures that µk remains a sum of at most k + 1 Dirac masses.

Convergence Rate The following theorem establishes the convergence rate of the Frank-Wolfe algorithm.
In our case, ∥ · ∥ = ∥ · ∥TV is the total variation norm of measure, and the dual norm ∥ · ∥∗ = ∥ · ∥∞ is the L∞
norm on function. We first recall that a function with a Lipschitz gradient has a quadratic upper bound.
Lemma 1. Let E : C → R be a differentiable function with L-Lipschitz gradient with respect to the norm
∥ · ∥. That is, for all µ, ν ∈ C:
∥∇E(µ) − ∇E(ν)∥∗ ⩽ L∥ν − µ∥,
where ∥ · ∥∗ denotes the dual norm of ∥ · ∥. Then, for all µ, ν ∈ C:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 .
2
Proof. By the fundamental theorem of calculus, we can express E(ν) as:
Z 1
E(ν) = E(µ) + ⟨∇E(µ + t(ν − µ)), ν − µ⟩ dt.
0

Adding and subtracting ∇E(µ) inside the integrand:


Z 1
E(ν) = E(µ) + ⟨∇E(µ), ν − µ⟩ + ⟨∇E(µ + t(ν − µ)) − ∇E(µ), ν − µ⟩ dt.
0

Using the L-Lipschitz property of the gradient, we bound the difference:


∥∇E(µ + t(ν − µ)) − ∇E(µ)∥∗ ⩽ Lt∥ν − µ∥.
Substitute this bound into the integral:
Z 1 Z 1
⟨∇E(µ + t(ν − µ)) − ∇E(µ), ν − µ⟩ dt ⩽ Lt∥ν − µ∥2 dt.
0 0

Evaluate the integral: Z 1


L
Lt dt = .
0 2
Thus:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 .
2

10
Theorem 2. Let E be convex and assume that ∇E(µ) is L-Lipschitz, i.e.,

∥∇E(µ) − ∇E(µ′ )∥∗ ⩽ L∥µ − µ′ ∥.


2
For the step size τk = k+2 , the F-W to optimize F on a set of radius

r := sup ∥µ − µ′ ∥
µ,µ′ ∈C 2

satisfies, denoting E ∗ := inf µ∈C E(µ),


2Lr2
E(µk ) − E ∗ ⩽ ,
k+1
Proof. Using the L-Lipschitz gradient property with respect to a Banach norm ∥ · ∥, using Lemma 1, we have
the following quadratic upper bound:
L
E(ν) ⩽ E(µ) + ⟨∇E(µ), ν − µ⟩ + ∥ν − µ∥2 , ∀µ, ν ∈ C.
2

One-Step Improvement The Frank-Wolfe update is:

µk+1 = µk + τk (νk − µk ),
2
where τk = k+1 and νk = arg minν∈C ⟨∇E(µk ), ν⟩. By smoothness of F , we have:

L 2
E(µk+1 ) ⩽ E(µk ) + τk ⟨∇E(µk ), νk − µk ⟩ + τ ∥νk − µk ∥2TV .
2 k
Furthermore, the boundedness of C ensures ∥νk − µk ∥ ⩽ r. Substituting, we get:

L 2 2
E(µk+1 ) ⩽ E(µk ) + τk gk + τ r ,
2 k
where gk := ⟨∇E(µk ), νk − µk ⟩ Defining hk = E(µk ) − E ∗ as the suboptimality at iteration k, we have:

L 2 2
hk+1 ⩽ hk − τk gk + τ r . (1.5)
2 k
We now bound gk , using the optimality of νk

gk := ⟨∇E(µk ), νk − µk ⟩ = min⟨∇E(µk ), ν − µk ⟩
ν∈C

and by convexity,
E(νk ) ⩾ E(µk ) + ⟨∇E(µk ), νk − µk ⟩
so that ⟨∇E(µk ), ν − µk ⟩ ⩽ E(ν) − E(µk ) so

gk ⩽ min E(ν) − E(µk ) = E ∗ − E(µk ) = −hk .


ν∈C

Plugging this into (1.5), we obtained the fundamental descent property

L 2 2
hk+1 ⩽ hk − τk hk + τ r .
2 k
2
Substituting τk = k+1 :
2Lr2
 
2
hk+1 ⩽ hk 1 − + .
k+1 (k + 1)2

11
Recursion Argument Assume the inductive hypothesis:

2Lr2
hk ⩽ .
k+1
We will prove that:
2Lr2
hk+1 ⩽ .
k+2
Using the inductive hypothesis in the recursive relation for hk+1 :

2Lr2 2Lr2
 
2
hk+1 ⩽ 1− + .
k+1 k+1 (k + 1)2

Simplify the coefficient:


2Lr2 2Lr2 k − 1
 
2
1− = · .
k+1 k+1 k+1 k+1
Substitute back:
2Lr2 (k − 1) 2Lr2
hk+1 ⩽ + .
(k + 1)2 (k + 1)2
Combine terms:
2Lr2 ((k − 1) + 1) 2Lr2
hk+1 ⩽ = .
(k + 1)2 k+2

In the case of MLP training, where E is defined in (1.4), the proposition bellow shows that L ⩽ M 2 ∥σ∥2∞ ,
where M = R∥f ∥B , and we have r = 2 (radius of the space of probability for TV). Recall that∥f ∥B is the
Barron norm of the target function f . Furthermore, we know that E(µ∗ ) = 0, as the existence of a valid
representative measure was established in Equation (1.3). By applying the Frank-Wolfe algorithm, we deduce
the existence of a discrete measure µk consisting of at most k+1 Dirac masses. This discrete measure achieves
an approximation error:  
1
E(µk ) = O .
k
Thus, the Frank-Wolfe algorithm constructs a sparse representation of the target function with a provably
decreasing error bound as the number of iterations k increases.
Proposition 3. The first variation ∇E(µ) of the functional E(µ) defined in (1.4), which is
Z
∇E(µ)(ω) = φ(x, ω) (Φ(µ)(x) − f (x)) dx,
K
R
where Φ(µ)(x) = Ω φ(x, ω) dµ(ω) and φ(x, ω) = aσ(⟨w, x⟩ + b), is L-Lipschitz with respect to the total
variation norm. Specifically, for any µ, µ′ ∈ P(Ω):

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ L∥µ − µ′ ∥TV ,

where L = ∥σ∥2∞ M 2 .
Proof. The difference of ∇E(µ) and ∇E(µ′ ) is:
Z

∇E(µ)(ω) − ∇E(µ )(ω) = φ(x, ω) (Φ(µ)(x) − φ(µ′ )(x)) dx.
K
Z
Φ(µ)(x) − φ(µ′ )(x) = φ(x, ω) d(µ − µ′ )(ω).

12
Substitute the above into the expression for ∇E(µ):
Z Z 
′ ′ ′ ′
∇E(µ)(ω) − ∇E(µ )(ω) = φ(x, ω) φ(x, ω ) d(µ − µ )(ω ) dx.
K Ω

Using Fubini’s theorem, introducing k(ω, ω ′ ) :=φ(x, ω)φ(x, ω ′ ) dx,


R
K
Z
∇E(µ)(ω) − ∇E(µ′ )(ω) = k(ω, ω ′ )d(µ − µ′ )(ω ′ ).

Take the L∞ norm with respect to ω:


Z

∥∇E(µ) − ∇E(µ )∥∞ = sup k(ω, ω ′ )d(µ − µ′ )(ω ′ ) .
ω∈Ω Ω

Using the triangle inequality:

∥∇E(µ) − ∇E(µ′ )∥∞ ⩽ ∥k∥L∞ (K×K) ∥µ − µ′ ∥TV .

One has Z
∥k∥L∞ (K×K) = sup | φ(x, ω)φ(x, ω ′ )dx| ⩽ ∥φ∥2L∞ (K×Ω) ⩽ M 2 ∥σ∥2∞ .
(ω,ω ′ )∈Ω2 K

13
14
Bibliography

[1] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of
Machine Learning Research, 18(1):629–681, 2017.

[2] Francis Bach. Learning theory from first principles. 2021.

[3] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE
Transactions on Information theory, 39(3):930–945, 1993.

[4] Amir Beck. Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MAT-
LAB. SIAM, 2014.

[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization
and statistical learning via the alternating direction method of multipliers. Foundations and Trends®
in Machine Learning, 3(1):1–122, 2011.

[6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] E. Candès and D. Donoho. New tight frames of curvelets and optimal representations of objects with
piecewise C2 singularities. Commun. on Pure and Appl. Math., 57(2):219–266, 2004.

[8] E. J. Candès, L. Demanet, D. L. Donoho, and L. Ying. Fast discrete curvelet transforms. SIAM
Multiscale Modeling and Simulation, 5:861–899, 2005.

[9] A. Chambolle. An algorithm for total variation minimization and applications. J. Math. Imaging Vis.,
20:89–97, 2004.

[10] Antonin Chambolle, Vicent Caselles, Daniel Cremers, Matteo Novaga, and Thomas Pock. An intro-
duction to total variation for image analysis. Theoretical foundations and numerical methods for sparse
recovery, 9(263-340):227, 2010.

[11] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta
Numerica, 25:161–319, 2016.

[12] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal
on Scientific Computing, 20(1):33–61, 1999.

[13] Philippe G Ciarlet. Introduction à l’analyse numérique matricielle et à l’optimisation. 1982.

[14] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. SIAM
Multiscale Modeling and Simulation, 4(4), 2005.

[15] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,


signals and systems, 2(4):303–314, 1989.

[16] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. on Pure and Appl. Math., 57:1413–1541, 2004.

15
[17] D. Donoho and I. Johnstone. Ideal spatial adaptation via wavelet shrinkage. Biometrika, 81:425–455,
Dec 1994.
[18] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume
375. Springer Science & Business Media, 1996.

[19] M. Figueiredo and R. Nowak. An EM Algorithm for Wavelet-Based Image Restoration. IEEE Trans.
Image Proc., 12(8):906–916, 2003.
[20] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing, volume 1.
Birkhäuser Basel, 2013.
[21] Stephane Mallat. A wavelet tour of signal processing: the sparse way. Academic press, 2008.

[22] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated varia-
tional problems. Commun. on Pure and Appl. Math., 42:577–685, 1989.
[23] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization,
1(3):127–239, 2014.

[24] Gabriel Peyré. L’algèbre discrète de la transformée de Fourier. Ellipses, 2004.


[25] J. Portilla, V. Strela, M.J. Wainwright, and Simoncelli E.P. Image denoising using scale mixtures of
Gaussians in the wavelet domain. IEEE Trans. Image Proc., 12(11):1338–1351, November 2003.
[26] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys.
D, 60(1-4):259–268, 1992.
[27] Otmar Scherzer, Markus Grasmair, Harald Grossauer, Markus Haltmeier, Frank Lenzen, and L Sirovich.
Variational methods in imaging. Springer, 2009.
[28] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal,
27(3):379–423, 1948.

[29] Jean-Luc Starck, Fionn Murtagh, and Jalal Fadili. Sparse image and signal processing: Wavelets and
related geometric multiscale analysis. Cambridge university press, 2015.

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy