0% found this document useful (0 votes)
8 views35 pages

SDE For SGD

Uploaded by

rweinert00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views35 pages

SDE For SGD

Uploaded by

rweinert00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES

STOCHASTIC GRADIENT DESCENT

ADRIEN SCHERTZER† AND LOUCAS PILLAUD-VIVIEN⋆

A BSTRACT. We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD)
arXiv:2407.02322v1 [cs.LG] 2 Jul 2024

for the least-square problem. Indeed, pursuing the work of [1], we analyze Stochastic Differential Equations
(SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online set-
ting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective
of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly
degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates
of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude.
Numerical simulations supporting our findings are also presented.

1. I NTRODUCTION

The stochastic gradient descent (SGD) is the workhorse of any large-scale machine learning pipeline. De-
scribed more than seventy years ago as a versatile stochastic algorithm [2], it has been studied thoroughly
since then (e.g. [3, 4]), and applied with success as an efficient computational and statistical device for large
scale machine learning [5]. Yet, since the emergence of deep-neural networks (DNNs), some new lights
have been shed on the use of the algorithm: among other, the role of SGD’s noise in the good generalization
performances of DNNs [6] has been described empirically [7], while its particular shape of covariance is
expected to play a predominant role is its dynamics [8].
In this direction, [1] proposed to approximate the SGD dynamics thanks to a Stochastic Differential
Equation (SDE) whose noise’s covariance matches the one of SGD. Since then, many works have leveraged
this continuous perspective, attempting to better describe some of SGD’s phenomenology: among other,
the role of step-size [9], the escape time and direction from local minimizers [10], study of the invariant
distribution [11] or the heavy tail phenomenon [12]. However, lured by the strong analytical tools that offer
SDE models, some misconceptions on the basic nature of SGD’s noise in the machine learning context have
also emerged. Among other, one important message of [13] is to recall that SGD’s noise has a particular
shape and intensity that make SGD far from a Langevin type of dynamics where isotropic noise is added to
the gradient at each step. Notably, if the data model can be interpolated perfectly, the invariant measure of
SGD could be largely degenerate without need of step-size decay: this is the case in the overparametrized
regime [14, 15].
In this article, we take a step back from general purpose studies of SGD and focus on the specific case of
least-squares, where the predictor is a linear function of fixed features. Needless to say that linear predictors
do not lead to state-of-the art performance in most modern tasks, yet, the abundant literature on the neural
tangent kernel [16] recalled us that we have still to understand better linear models. Moreover, even if the
article is not written within the setting of kernel methods for the sake of simplicity, every assumption is


U NIVERSITÄT B ONN , EMAIL : ASCHERTZ @ UNI - BONN . DE

E COLE DES P ONTS PARIS T ECH - C ERMICS , EMAIL : LOUCAS . PILLAUD - VIVIEN @ ENPC . FR
1
2 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

made in order to easily adapt to this setting, replacing Rd equipped with its canonical Euclidean structure
by an abstract reproducing kernel Hilbert space.
Purpose and contributions. The aim of the present paper is to show that, with a proper model, SDEs offer
nice analytical tools that help to clean the analysis while capturing the qualitative as well as the quantita-
tive essence of SGD’s dynamics. We present a wide range of results on both online and empirical SGD,
demonstrating through quantitative analysis that the key difference lies in the model’s ability to achieve
perfect interpolation. We also present systemically non-asymptotic rates of convergence in all the settings
we present, either in ℓ2 -norm in the interpolation regime (Theorems 4.3 and 5.1), or in Wasserstein dis-
tance (towards the invariant distribution) in the case of a noisy system (Theorems 4.6 and 5.4). In the latter
case, we further investigate the invariant distribution: we pinpoint its location (Proposition 4.7), and more
importantly, we demonstrate in Proposition 4.9 that although there is no heavy tail phenomenon in finite
time, it emerges asymptotically. We finally address convergence of variance reduction techniques like time-
averaging (Proposition 4.10) and decay of step-size (Proposition 4.11). Throughout the article, we try to
present some theoretical background on the study of SDEs: among other the use of Lyapunov potentials and
of coupling methods that enables to study quantitatively the speed of convergence to equilibrium.
Further related work. Formal links between the true stochastic gradient descent and its continuous model
are studied in [1] on the theoretical side, where a weak error analysis is provided, and in [17] on the experi-
mental side. In a similar article in spirit, [18] provides convergence results on general convex function, but
with a systematic polynomial step-size decay. Some results implying the use of SDEs to study the influence
of the noise on the implicit bias the algorithm are given in the least-squares case in [19], and is an active topic
in general [20, 21, 22, 23]. Quantitative studies of the invariant distribution focusing on the particular shape
and intensity of the noise covariance include [24, 25]. Note that when the step-size is properly re-scaled
with respect to the dimension, SGD converges in the high-dimensional limite to a SDE that is similar to the
one we study here [26], which is called homogenized SGD in the least-square context [27]. Power laws of
convergence toward stationarity related to the eigenvalue decay of the covariance matrix (capacity condi-
tion) are ubiquitous in statistics [28] and the study of SGD for least-squares [29, 14, 30, 31, 32]. Finally, the
heavy-tail phenomenon, has been re-discovered lately as an interesting feature of SGD with multiplicative
noise [33, 12] and more recently in the context of SDE in the work of [34].
Organisation of the paper. In Section 2, we present the general set-up of SGD both in the population
and empirical cases as well as the possibility of the data in both cases to be fully interpolated. Section 3
explains the relevance of building a consistent SDE model of SGD and recall technical details related to
SDEs. In Section 4, the results concerning the dynamics of SGD in the training case are given, both in the
interpolation regime (Section 4.1) and the noisy regime (Section 4.2). Section 5 is built similarly to the
previous one and is devoted to online SGD. The proofs are postponed to the Appendix, for which precise
references can be found in the main text.

2. S ET- UP : S TOCHASTIC G RADIENT D ESCENT ON L EAST S QUARES P ROBLEMS

In this section, we introduce the least-square problem that we consider throughout the article, putting em-
phasis on the difference between empirical distributions (training loss) and true ones (population loss).
Nonetheless, the central argument of the article is that, aside from the known differences between empirical
and test cases, the primary qualitative distinction hinges on whether the possibility of loss can be zero (inter-
polation), irrespective of the discrete nature of the distribution. The stochastic gradient descent is introduced
at the end of the section.
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 3

2.1. The least-square problem: population and empirical losses.


We consider a regression problem with input/output pair (X, Y ) ∈ Rd × R distributed according to a joint
probability law ρ ∈ P(Rd × R). To learn the rule linking inputs to outputs, we take a linear family of
predictors {fθ : x 7→ ⟨θ, x⟩, θ ∈ Rd } and aim at minimizing the average square loss on the penalty
2
ℓ(θ, (X, Y )) := 12 ⟨θ, X⟩ − Y
1 h 2 i
(1) L(θ) := E(X,Y )∼ρ ⟨θ, X⟩ − Y .
2

Remark 2.1 (Link with RKHS). Here, the family of predictor consists in linear functions of the input data
x ∈ Rd . While for the sake of clarity, we’ll keep this linear family throughout the article, note that the same
results apply for any family of linear predictors in an abstract reproducing kernel Hilbert space (H, ⟨·, ·⟩H ),
with feature Rd ∋ x 7→ φ(x) ∈ H, changing the dot product and family of predictors with the natural
structure: {fθ : x 7→ ⟨θ, φ(x)⟩H , θ ∈ H}.
In this article, in order to put emphasis on the possible difference between the two settings, we make a
clear distinction between the population and the empirical cases. To keep notations simple, we refer to the
population case whenever Ω := supp(ρ) is an open set of Rd . On the contrary, we refer to the empirical
case when ρ is a finite sum of atomic measures, i.e. there exist (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R such that
1 Pn
ρ = n i=1 δ(xi ,yi ) . Obviously, the term empirical refers to the fact that, in this case, Eq.(1) can be seen as
the training loss
n
1 Xh 2 i
(2) L(θ) = ⟨θ, xi ⟩ − yi .
2n
i=1

In this case, the number n ∈ N∗will always denote the number of observed inputs/outputs pairs (xi , yi )i=1...n ,
and we will use the notations X = [x1 , . . . , xn ]⊤ ∈ Rn×d to denote the design matrix, and y = (y1 , . . . , yn )⊤ ∈
1
Rn to denote the output vector. With these notations, the training loss rewrites L(θ) = 2n ||Xθ − y||22 .
Note that even though for the sake of clarity we will sometimes treat them separately, these cases fit into
the same framework and notations.

2.2. Noisy and noiseless settings.


For this, let us introduce the set of interpolators:
n o
(3) I = θ ∈ Rd , such that L(θ) = 0 .

We distinguish between the two following settings.


Noisy setting I = ∅. In this case, the loss L is strictly lower bounded, i.e. there exists σ > 0 such that
L(θ) ≥ σ 2 for all θ ∈ Rd . The two following examples show that this situation typically arises for different
reasons in the population and the empirical cases. In the former, this is an effect of the noise model on the
output, whereas in the latter, this is only a consequence of the underparametrization.
Example 2.2 (Noisy model). Assume that there exists θ∗ ∈ Rd and ξ ∈ R a random variable independent
of X of mean zero and variance 2σ 2 , such that Y = ⟨θ∗ , X⟩ + ξ. Then
1 2 1
L(θ) = Σ1/2 (θ − θ∗ ) + E(ξ 2 ) ≥ σ 2 ,
2 2
where we use the notation for the input covariance matrix: Σ := Eρ [XX ⊤ ] ∈ Rd×d .
4 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Example 2.3 (Underparametrized setting). Consider the underparametrized regime for which n > d, and
assume that each i.i.d. input/output couple (xi , yi )i=1...n come from independent distributions that have
densities (i.e. for all i ∈ J1, nK, xi and yi are independent and are distributed according to laws that are
absolutely continuous with respect to Lebesgue). Then, we have that almost surely I = ∅.
Noiseless setting I ̸= ∅. This means that there exists at least one perfect linear interpolator of the model.
In the population setting, this corresponds to the strong assumption that the model is well-specified and
noiseless (formally ξ = 0, if we refer to Example 2.2). Notably, this regime has received a large attention
recently to model the large expressive power of neural network [19, 14, 35]. Yet, in the empirical case, this
typically and simply corresponds to the overparametrized regime d ≥ n.
Example 2.4 (Overparametrized setting). Consider the overparametrized regime for which d ≥ n, and
assume that ((x1 , y1 ), . . . , (xn , yn )) are i.i.d. samples drawn from a distribution that has a density. Then,
we have that, the zero-loss set is the affine set I = θ∗ + Ker(X), where θ∗ is any element of I. Furthermore,
dim I ≥ d − n ≥ 0 almost surely.

2.3. The stochastic gradient descent


The stochastic gradient descent (SGD) aims at minimizing a function through unbiased estimates of its
gradient. While this method has been developed for a different purpose in the early 50’s [2], it is remarkable
how SGD fits perfectly the modern large-scale machine learning framework [36]. Indeed, the SGD iterative
procedure corresponds to sample at each time t ∈ N∗ an independent draw (xt , yt ) ∼ ρ and update the
predictor θ with respect to the local gradient calculated on this sample:
(4) θt+1 = θt − γ∇θ ℓ(θt , (xt , yt )),
where γ > 0 is the step size. Even if the population and empirical cases fall under the same general
framework of stochastic approximation, let us detail the setups for these.
The population case: online SGD.. The population case corresponds to what is refer to as online SGD
in the literature [5]. In this case, for each time t ∈ N∗ , we have an independent draw (xt , yt ) ∼ ρ, and if
we denote Ft = ((xk , yk ), k ≤ t) the natural adapted filtration, then E[∇θ ℓ(θt , (xt , yt ))|Ft−1 ] = ∇L(θt ).
That is ∇θ ℓ(θt , (xt , yt )) is an unbiased estimate of the true gradient of the risk and the recursion writes in
explicit form
(5) θt+1 = θt − γ (⟨θt , xt ⟩ − yt ) xt .
Remark that even though at iteration t, it has seen only t samples, the SGD recursion directly optimizes the
population loss L. This is due to the fact that as long the recursion goes, we have access to fresh samples
from the distribution ρ. This can be seen informally as the case n = +∞ and is in contrast to the finite n
empirical case. A convenient way to present this dynamics is to force the apparition of the true gradient and
rewrite the rest as a martingale increment. Indeed Eq. (5) reads
θt+1 = θt − γ∇θ L(θt ) + γ (∇θ L(θt ) − (⟨θt , xt ⟩ − yt ) xt )
(6) = θt − γEρ [(⟨θt , X⟩ − Y ) X] + γm(θt , (xt , yt )) ,
where m(θt , (xt , yt )) := Eρ [(⟨θt , X⟩ − Y ) X] − (⟨θt , xt ⟩ − yt ) xt .
The empirical case: training SGD.. In this case, the main difference is that we consider a finite number of
samples so that the SGD recursion will iterate over them uniformly at random. Mathematically, this corre-
sponds to take at each time t ∈ N∗ an independent sample of the uniform distribution it ∼ Unif({1, . . . , n})
and consider the unbiased estimate ∇θ ℓ(θt , (xit , yit )) of the true gradient of the training loss Eq.(2). For
this, the adapted filtration writes Ft = ((xk , yk )k≤n , (ik )k≤t ) and the recursion reads:
(7) θt+1 = θt − γ (⟨θt , xit ⟩ − yit ) xit .
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 5

Despite the fact that this dynamics looks very similar, the important difference is that, the batch of samples
being fixed, the dynamics can select several times the same pair (xk , yk ). This is the reason why, informally
after t ≥ Θ(n) iterations, the training dynamics Eq.(7) will deviate from the online SGD presented in Eq.(5).
To end this section, let us also rewrite this dynamics with respect to the gradient descent plus the martingale
increments:
θt+1 = θt − γ∇θ L(θt ) + γ (∇θ L(θt ) − (⟨θt , xit ⟩ − yit ) xit )
n
1X
(8) = θt − γ (⟨θt , xi ⟩ − yi ) xi + γm(θt , (xit , yit )) ,
n
i=1
1 Pn
where m(θt , (xt , yt )) := n i=1 (⟨θt , xi ⟩ − yi ) xi − (⟨θt , xit ⟩ − yit ) xit . Note that these equations are in
fact exactly the same as the ones
P presented in the population case, simply considering that the distribution
is an empirical one, i.e. ρ = n1 ni=1 δ(xi ,yi ) . We nonetheless decided to present them explicitly for the sake
of clearness.

3. C ONTINUOUS MODEL OF SGD

In this section we give stochastic differential equations (SDEs) models for SGD. In a first time, we provide
general necessary conditions to model well SGD. We instantiate them more precisely in a second time.

3.1. The requirement of a SDE model


As said above, the decomposition of the SGD recursion as in equations (6),(8) is a generic feature of the
stochastic descent [4], as it has led to the celebrated ODE method [37] to study stochastic approximation of
this type. Going further, this decomposition between a drift ∇L and local martingale term m(·, (x, y)) is
reminiscent of the decomposition occuring for Itô processes, i.e. solution to a SDE of the type

(9) dθt = b(t, θt )dt + γσ(t, θt )dBt ,
where (Bt )t≥0 is a Brownian motion of Rd . These types of models have been largely studied in the litera-
ture [38], as, beyond their large modelling abilities, they offer useful tools, e.g. Itô calculus, when it comes
to mathematical analysis. One of the aim of this article is to link the SGD dynamics to processes that are
exemplary in the SDE literature. In order to establish this link, we first have to answer the following question

What SDE model fits well with the SGD dynamics?

This question has received a large attention in the last decade, and a good principle to answer this is to turn
to stochastic modified equations [1]. This is a natural way to build models of SDE since they are consistent
in the infinitesimal step-size limit with SGD. In order to build such model, there are two requirements:
(i) The drift term b(t, θt ) should match −∇L(θt ).
(ii) The noise factor σ should have the same covariance as the local martingale m, i.e.
h i
(10) σ(t, θt )σ(t, θt )⊤ = Eρ m(θt , (xt , yt ))m(θt , (xt , yt ))⊤ Ft−1 .
Besides technical assumptions, these are the two requirements presented in [1, Theorem 3] to show that the
SDE model is consistent in the small step-size limit with the SGD recursion. Going beyond the approxi-
mation concerns tackled by (i) and (ii), it has recently been observed that the SGD noise carries a specific
shape that the SDE model should carry as well [13, 22]. This requirement has a more qualitative nature but
is important to fully capture the essence of the SGD dynamics
6 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

(iii) The noise term σ(t, θt )dBt should span the same space as m(θt , (xt , yt )).
We will see below, that, in order to build a SDE model of SGD, this third requirement is particularly
important in the empirical case, where the noise a strong degeneracy.

3.2. Explicit form of the SDE models


Let us first write explicitly the multiplicative noise factor σ(t, θt ) in the population and empirical cases.
Then, we derive the expression of the SDE models that we analyze later.
The population case. In the population case, the calculation has already been made in [19] and, defining
rX (θ) = ⟨θt , X⟩ − Y ∈ R, the residual random variable, then we have that
 h i 
σ(t, θt )σ(t, θt )⊤ = Eρ rX (θt )2 XX ⊤ − Eρ [rX (θt )X] Eρ [rX (θt )X]⊤ .
In this case, there is no geometric specificity of the noise and we choose to σ(t, θt ) as the PSD square root
of the right hand side, that is, we define, for all θ ∈ Rd ,
 h i 1/2
(11) σ(θ) := Eρ rX (θ)2 XX ⊤ − Eρ [rX (θ)X] Eρ [rX (θ)X]⊤ ∈ Rd×d ,
and we study the following SDE model:

(12) dθt = −Eρ [(⟨θt , X⟩ − Y ) X] dt + γσ(θt )dBt ,
where (Bt )t≥0 is a Brownian motion of Rd and σ ∈ Rd×d is given by (11).
The empirical case. The empirical case is, from the geometric perspective, a bit more subtle. Indeed,
as can be seen directly in Eq.(7), the iterates of SGD stay in the low-dimenional space of dimension at
most n: θ0 + span(x1 , . . . , xn ) ⊂ Rd . Hence, it is primordial for a good SDE model that the noise
carries this degenerate structure. Let us apply for the SDE noise factor calculation. Let us define rx (θ) :=
(⟨θt , x1 ⟩ − y1 , . . . , ⟨θt , xn ⟩ − yn )⊤ ∈ Rn , the vector of the residuals, then
 
⊤ 1 ⊤ 2 1 ⊤ ⊤
σ(t, θt )σ(t, θt ) = X diag(rx (θt )) X − 2 X rx (θt ) rx (θt ) X
n n
 
1 ⊤ 2 1 ⊤
= X diag(rx (θt )) − rx (θt ) rx (θt ) X
n n
  ⊤
1 1 1
= X⊤ diag(rx (θt )) − rx (θt )1⊤ diag(rx (θt )) − rx (θt )1⊤ X,
n n n
where for any θ ∈ Rd , we define diag(θ) ∈ Rd×d as the diagonal matrix with the coordinates of θ as
diagonal entries and 1 ∈ Rn the vector of ones. In the case where n < d, the overall covariance matrix is
degenerate as its rank is at most n: this is because the “noise of SGD” belong naturally to span(x1 , . . . , xn ).
Hence, it is important to keep this degeneracy and this is the reason why we choose the square root that
preserves it:
1
σ(θt ) := √ X⊤ Rx (θt ) ∈ Rd×n ,
n
where we defined Rx (θ) := diag(rx (θt )) − n1 rx (θt )1⊤ ∈ Rn×n . With these notations, we study the follow-
ing SDE model:
r
1 ⊤ γ ⊤
(13) dθt = − X (Xθt − y)dt + X Rx (θt )dBt ,
n n
where (Bt )t≥0 is a Brownian motion of Rn .
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 7

Initial condition and moments. Initialization will be taken at some θ0 ∈ Rd which can be considered as a
random variable. Standard choices for the law of θ0 include the standard Gaussian of Rd , or a dirac measure
on some vector θ0 , e.g. θ0 = 0. Either cases, its law ρ0 has moments of all orders, and since drift and
multiplicative noise of the SDEs (12)-(13) have at most linear growth, Theorem 3.5 of [38] shows that the
marginal laws (ρt )t≥0 all have moments at all orders at any time t ≥ 0.
Now that we motivated the SDE models in the previous part, and stated them in Eqs.(12)-(13), we study
their convergence property. This is the purpose of the main results that are stated in the two following
sections.

4. SGD ON THE TRAINING LOSS

Recall that this corresponds to the finite data set case with n input/output pairs (xi , yi )i=1...n , stacked into
data matrix X ∈ Rn×d and data output vector y ∈ Rd . We study in this section the SDE given in equa-
tion (13). We assume that all data are bounded, i.e. there is some K > 0 such that ∥xi ∥2 ≤ K for all
i ∈ J1, nK. Let us introduce first an important element of the model. In this section, we define
Definition 4.1. Let X† denote the pseudo-inverse of the design matrix X. We define
(14) θ∗ = X† y + (I − X† X)θ0 .
Note that, in generic cases, in the underparametrized case for which d ≤ n, we have X† = (X⊤ X)−1 X⊤ .
Then, X† X = I, and θ∗ = X† y is the Ordinary least-square estimator and does not depend on θ0 . In
the generic overparameterized case for which n ≤ d, we have X† = X⊤ (XX⊤ )−1 , and hence Xθ∗ =
XX† y + (X − XX† X)θ0 = y, that is to say that θ∗ ∈ I.
Finally, for both cases, we define Σ := n1 X⊤ X ∈ Rd×d the design covariance matrix. With this notation,
the SDE becomes
r
γ ⊤
dθt = −Σ(θt − θ∗ )dt + X Rx (θt )dBt ,
n
where (Bt )t≥0 is a Brownian motion of Rn .
Comparison with standard processes. As already said, the dynamics in the noiseless and noisy cases are
of different natures because √ of the possibility to cancel the multiplicative noise term. Indeed, for clarity
imagine that n = d and X/ n = Σ = Id . If the noise can cancel, Rx (θt ) has the shape of a linear term like
diag(θ − θ∗ ), and each the coordinate of the difference η = θ − θ∗ follows a one-dimensional Geometric

Brownian motion dηt = −ηt dt + γηt dBt , for which it is known that ηt → 0 almost surely, which
corresponds to θt → θ∗ . This comparison is the governing principle of the analysis of this setup. Otherwise,
if the noise cannot cancel and is strictly lower bounded, then, under the same proxy, the movement of each
√ p
coordinate of η resembles the SDE dηt = −ηt dt+ γ ηt2 + σ 2 dBt which looks like a Ornstein-Uhlenbeck
process if η ≪ σ, but with a noise that has a multiplicative part ηt dBt when η ≫ σ. These are known in one
dimension under the name of Pearsons diffusions [39] and exhibit stationary distribution with heavy tails.
The difference between these two settings is the reason why we divide the results into two different
subsections.

4.1. The noiseless case


In this section, we let (θt )t≥0 follow the dynamics given by Eq.(13) initialized at θ0 ∈ Rd and assume that
Ker(X) ̸= {0} and that I ̸= ∅. Note that this generically occurs in the overparametrized case for which
n ≤ d. Recall that we have defined θ∗ = X† y + (I − X† X)θ0 in Eq.(14), we present now a geometric
interpretation of θ∗ in this case,
8 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Lemma 4.2. The vector θ∗ is the orthogonal projection of θ0 into I, that is


(15) θ∗ = argmin ∥θ − θ0 ∥2 .
Xθ=y

The proof is deferred to Section A.1 of the Appendix. Note that this fact is often known as the implicit bias
of least-squares methods and has statistical consequences studied under the name of “benign overfitting” [40,
41].
Remarkably, despite the randomness of the SDE (13), the convergence is almost sure towards θ∗ , with
explicit rates that we show below.
1
Theorem 4.3. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd , then for γ < 3K ,
we have that (θt )t≥0 converges almost surely to θ∗ with the following rates
(i) Parametric rate. For all t ≥ 0, we have that
(16) E[∥θt − θ∗ ∥2 ] ≤ ∥θ0 − θ∗ ∥2 e−µ(2−Kγ)t ,
where µ > 0 is the smallest non zero eigenvalue of Σ.
(ii) Non-parametric rate. For all t ≥ 0, we have that for all α > 0,
 α
2 1
(17) E[∥θt − θ∗ ∥ ] ≤ ,
∥θ0 − θ∗ ∥−2/α + Cα t
1 −α (θ − θ )⟩ + γKα ∥θ − θ ∥2 )−1/α , and Kα = maxi≤n ⟨xi , Σ−α xi ⟩.
where Cα = 2α (⟨θ0 − θ∗ , Σ 0 ∗ 2−Kγ 0 ∗

Hence, for all t ≥ 0,


  α 
2 2 −µt 1
E[∥θt − θ∗ ∥ ] ≤ min ∥θ0 − θ∗ ∥ e , inf ,
α∈R+ ∥θ0 − θ∗ ∥−2/α + Cα t
and, if µ is non-zero but very small (e.g. 10−10 ), this inequality describes well the difference between
the transcient regime of convergence that is polynomial and the asymptotic regime that is exponential but
occuring after time-scale 1/µ. These polynomial convergence bounds relying on the covariance upperbound,
Kα , are common in the study of SGD in RKHS [29, 42, 14]. Second, by Markov inequality, remark that the
estimates (i) and (ii) give convergence rates in probability. Furthermore, one steadily checks that one can
derive the same kind of inequality as in (i) for the expected iterates (E[θt ])t≥0 . In this case, one obtains the
overestimation ∥θ0 − θ∗ ∥2 e−2µt and from the perspective of the rate given by the Theorem, the stochastic
nature of the dynamic seems to slow down the convergence. Finally, note that considering θ0 as a random
variable, all expectations can be thought of as conditional expectations E[ · |θ0 ]. The results are illustrated
in Figure 1 and 2.

4.2. The noisy case


In this section too, (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd . However, we
assume in this subsection that KerX = {0} and that I = ∅. Note that this is generically the case if n ≥ d.
The important consequence of this is that the loss is uniformly lower bounded by some positive number,
indeed, we have
Lemma 4.4. Let σ 2 = L(θ∗ ), we have, for all θ ∈ Rd that L(θ) ≥ σ 2 .
Proof. The function θ 7→ L(θ) is convex, hence any minimizer of L satisfies the equation ∇L(θ) = 0. This
is the case of θ∗ , as ∇L(θ) = Σ(θ − θ∗ ). ■
Hence the noise covariance matrix is (most of the time) uniformly bounded away from zero, and this
implies that there exists a unique invariant measure to the SDE equation. This in contrast to the noiseless
̸ ∅) where any measure that is supported in I is naturally invariant.
case (I =
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 9

Error of SGD (log - log scale)


0

kθt − θ∗k2
−1

−2

−3
0 2 4 6 8
Time

F IGURE 1. Plot showing the error of SGD along time for an overparametrized regime
where n = 100 and d = 200. The samples (xi )i≤n come from a Gaussian distribution
with a covariance whose eigenvalues decay as a power law. The vertical dotted orange line
illustrates the separation between the two regimes depicted by Theorem 4.3, the polyno-
mial one before (a straight line in a log-log plot) and the exponential line, after typical time
scale 1/µ. This illustrates perfectly the rates of convergence shown in Theorem 4.3.

θ0

θ∗

F IGURE 2. Display of a two-dimensional projection of 10 trajectories of SGD for n = 100,


d = 200, in the case that there is an perfect interpolator θ∗ . The ellipses represent the level
curves of the training loss.
10 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Semi-group. Due to the non-degeneracy of the noise term in this case, it is more convenient to track the
dynamics of the probability measure ρt := Law(θt ) for any R t ≥ 0, which is the d
time marginal of the
process initialized at θ0 and defined such that E[f (θt )] = f dρt , for any f : R → R smooth enough.
In fact the SDE Eq.(13) has an associated semi-group Pt , defined so that for all η ∈ Rd , (Pt f )(η) :=
E[f (θt ) | θt=0 = η]. Formally, if ρt has a density, we have ρt (θ) = (Pt δθ )(θ0 ) for all θ ∈ Rd . Note that by
local strong ellipticity, even if ρ0 = δθ0 is singular, regularization properties of SDEs ensure that at t = 0+ ,
the measure ρt has a density with respect to Lebesgue as well as a second order moment [43, Section 4 of
Chapter 9]. We place ourselves in this setup where ρ0 ∈ P2 (Rd ), the set of probability measures µ such that
2
R
∥θ∥ dµ(θ) < +∞.
Infinitesimal generator. It is known that the time-marginal law of (θt )t≥0 satisfies the (parabolic) Fokker-
Planck equation (at least in the weak sense):
∂t ρt = L∗ ρt ,
with the operator L∗ being defined as, for all θ ∈ Rd
d h 
∗ γ X ⊤ 2
i
(L f )(θ) = div [Σ (θ − θ∗ ) f (θ)] + ∂ij X Rx (θ)X f (θ) ,
2n ij
i,j=1

whose adjoint (with respect to the canonical dot product of L2 (Rd )) is often referred to as the infinitesimal
generator of the dynamics and writes,
d
γ X ⊤ 2
(18) (Lf )(θ) = −⟨Σ (θ − θ∗ ) , ∇f (θ)⟩ + [X Rx (θ)X]ij ∂ij f (θ),
2n
i,j=1

for all test functions f : Rd → R sufficiently smooth. Recall furthermore that the evolution of expectations
d
of observables (f (θt ))t≥0 is given by the Dynkin forrmula [38, Lemma 3.2]: dt E [f (θt )] = LE [f (θt )]. This
identity enables, as in the deterministic case for gradient flow, the use of Lyapunov functions. This is a very
useful tool to study the asymptotic behavior of stochastic processes. This is the objective of the following
lemma:
Lemma 4.5. Let V (θ) = 12 ∥θ − θ∗ ∥2 , we have the inequality, for all θ ∈ Rd ,
(19) LV (θ) ≤ −2(1 − γK/2)L(θ) + 2σ 2 .
In consequence, for γ ≤ 1/(3K), there exists a stationary process to the SDE (13).
The proof is postponed to the Appendix A.2.

4.2.1. Invariant measure and convergence.


Any invariant measure ρ∞ satisfies the (elliptic) Partial Differential Equation (PDE): L∗ ρ∞ = 0. As said
before if the multiplicative noise does not cancel, L∗ is uniformly elliptic and the uniqueness of such a
measure is ensured. This is the case generically if n ≥ 2d, however, for the sake of completeness we
prove that in any case there exists a unique invariant measure (at this expense of a bit more technicality).
Moreover, by ergodicity, the law of the iterates eventually converges towards this unique invariant measure.
In the following result we also provide a quantitative statement, in Wassertein distance, on the speed of
convergence of the dynamics towards such a measure.
Theorem 4.6. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 distributed according to
1
ρ0 ∈ P2 (Rd ), then for γ < 3K , there exists a unique stationary distribution ρ∗ ∈ P2 (Rd ), and quantitatively,
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 11

(i) Parametric rate. For all t ≥ 0, we have that


(20) W22 (ρt , ρ∗ ) ≤ W22 (ρ0 , ρ∗ )e−2µ(1−2γK)t ,
where µ > 0 is the smallest non zero eigenvalue of Σ.
(ii) Non-parametric rate. For all t ≥ 0, we have that ∀α > 0,
 α
2 ∗ 1
(21) W2 (ρt , ρ ) ≤ ,
W22 (ρ0 , ρ∗ )−2/α + Cα t
 −1/α
with Cα = 2(1−2γK)
α E [⟨θ 0 − Θ∗ , Σ −α (θ − Θ )⟩] + 2γKα W 2 (ρ , ρ∗ ))
0 ∗ 1−2γK 2 0 , and where the ex-
pectation is taken w.r.t. the optimal Wasserstein coupling between θ0 ∼ ρ0 and Θ∗ ∼ ρ∗ .
The proof of these results is deferred in Appendix, Section A.2.1 and rests on coupling techniques stan-
dard in the asymptotic behavior of SDEs [44] for the Wasserstein distance. One could adapt the argument
to derive rates of convergence for other probabilistic distances that have a coupling representations like the
total variation [45]; however, due to the multiplicative noise, convergence rates in the “natural metric" given
by L2 (ρ∗ ) are more difficult to obtain. This could be the occasion for a future investigation (see the recent
work of [46] for a possibility to overcome this). Similarly as for the noiseless cases, there is a difference
between time-scale of convergence between the early regime t ≪ 1/µ that is polynomial and the asymptotic
time t ≥ 1/µ that is exponentially fast.

4.2.2. Localization of the invariant measure.


Above, we have shown that the law of the iterates converges towards a measure ρ∗ at a certain speed. We
know that the invariant measure satisfies the equation L∗ ρ∗ = 0; yet, this does not really give any practical
description or insight on it. Notably, we want to understand how its localization depends on θ∗ and on the
parameters σ 2 , γ or X. This is the purpose of the following proposition.
Proposition 4.7. Let Θ∗ ∼ ρ∗ , then for γ < 1
3Kwe have
h i γKσ 2
(22) E [Θ∗ ] = θ∗ , and E ∥Θ∗ − θ∗ ∥2 ≤ .
µ(1 − γK)
The proof can be found in the Appendix, Section A.2.2. Note that, as expected, the deviation of Θ∗ to its
mean goes to 0 as σ or γ goes to 0. Moreover, in the limit γ ≪ 1, we have E ∥Θ∗ − θ∗ ∥2 ∼ γKσ 2 /µ, which
reflects, up to constants, the scale given by the Gaussian calculation provided in Remark 4.8.
Remark 4.8. In the case for which we consider that the residuals equilibrate near the noise level σ > 0, we
can model that Rx ≃ σIn , so that the SDE dynamics rewrites
r
1 ⊤ γσ 2 ⊤
dθt = − X (Xθt − y)dt + X dBt .
n n
This simple model is particularly convenient as it allows to compute in close form the invariant distribution.
Indeed, solving L∗ ρ∗ = 0 in this case gives for all θ ∈ Rd ,
∥θ−θ∗ ∥2

ρ∗ (θ) ∝ e γσ 2 .
γσ 2
That is to say that ρ∗ is the Gaussian of mean θ∗ and covariance Id .
2
However, recall that Rx (·) is in fact linear, this implies that for large ∥θ∥, the noise in the SDE amplifies:
this has consequences on the tails of the distribution, see next paragraph.
12 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

SGD SGD
Averaged SGD Averaged SGD
Decaying step-sizes SGD Decaying step-sizes SGD

SGD SGD
Averaged SGD Averaged SGD
Decaying step-sizes SGD Decaying step-sizes SGD

F IGURE 3. Four plots showing the trajectory of SGD in the noisy setting. The arrow of
time goes from top left to bottom right. We see that the two variance reduction methods
(time average and decaying step-sizes) converge towards θ∗ (confirming Propositions 4.10
and 4.11), while plain SGD has a stationary distribution with certain fluctuations around
its mean θ∗ as explained in Theorem 4.6 and Proposition 4.7. Plain SGD is faster to its
invariant distribution as the variance reduction methods as shown in the convergence rates
provided in the results.

Heavy-tails or not? Recall that at any time t > 0 all the moments of the law of θt exist (provided the initial
distribution had such moments). Yet, depending on the step-size, moments of the invariant distribution
might not exist. More precisely, we show that if one fixes a step-size γ > 0, all moments up to a certain
value α(γ) of the invariant distribution exist and all higher moments do not. In other words the step size is
a direct control of the tail of the asymptotic distribution.

1
Proposition 4.9. For n ≥ 2d and a fixed γ < 3K , it exists α > 0 large enough such that

E(∥Θ∗ ∥α ) = +∞.

The proof of this result can be found in the Appendix, Section A.2.2. This result is in accordance of the
fact that multiplicative noise in SGD induces heavy-tails of its asymptotic distribution [33, 12, 34].
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 13

4.2.3. Convergence of variance reduction techniques


If the aim is to have an estimator that converges at long times to the optimal θ∗ , it is necessary to use variance
reduction techniques: here we show how time averages and step-size decay can be used in order to achieve
this.
Ergodicity: convergence of the time average. Classical ergodic theorems tell that the timeRaverage along
t
a trajectory converges towards spatial mean taken according to the invariant measure, i.e. 1t 0 φ(θs )ds →
Eρ∗ [φ(Θ∗ )], as t goes to infinity, for every smooth function φ. Taking φ = I, we thus expect that
Rt
θ̄t := 1t 0 θs ds → Eρ∗ [Θ∗ ] = θ∗ , as t goes to infinity, as explained in Proposition 4.7. We quantify
the convergence speed in the following.
Proposition 4.10. We have for all t ≥ 0 that
8γKσ 2 10∥θ0 − θ∗ ∥2
E∥Σ(θ̄t − θ∗ )∥2 ≤ + .
t t2
Step-size decay. In this section, the step-size γ = γt depends on the time t and tends to 0 as t tends to ∞.
In this case, (θt )t≥0 eventually converges.
Proposition 4.11. Let γt = 1/(2K + tα ) and α > 1, we have

E[∥θt − θ∗ ∥2 ] ≤ α−1 ,
t
21+α Kσ 2
with Cα = e−α E[∥θ0 − θ∗ ∥2 ]e(2(α − 1)/µ)α−1 + (2α/µ)α σ 2 +

(α−1) .

This result of polynomial decay when the step-size decay polynomially is similar to the one provided
in the work of [18], and is re-proven here for the sake of completeness. For both results, the proofs are
postponed to the Appendix, Section A.2.3. All results on convergence are illustrated in Figure 3.

5. O NLINE SGD ( POPULATION LOSS )

In the previous section, on SGD in the empirical setting, we have shown that the core of the qualitative
description relies on the fact that the noise can cancel or not. The situation is similar for online SGD. In fact
most of the calculation that we have shown so far transfer almost immediately within this setting. This is the
reason why we will only show the results concerning the convergence in the noisy and the noiseless settings
and describe briefly what properties will be similar than the ones exhibited in the previous section.
First, we recall the main equation governing the dynamics (12):

dθt = −Eρ [(⟨θt , X⟩ − Y ) X] dt + γσ(θt )dBt ,
where (Bt )t≥0 is a Brownian motion of Rd and σ ∈ Rd×d is given by
 h i 1/2
σ(θ) := Eρ rX (θ)2 XX ⊤ − Eρ [rX (θ)X] Eρ [rX (θ)X]⊤ ∈ Rd×d .

Let us define Σ = Eρ [XX ⊤ ] ∈ Rd×d the covariance matrix that we assume invertible and θ∗ = Σ−1 Eρ [Y X] ∈
Rd . We have that for all θ ∈ Rd , Eρ [(⟨θ, X⟩ − Y ) X] = Σ(θ − θ∗ ), and hence the dynamics writes:

dθt = −Σ(θt − θ∗ )dt + γσ(θt )dBt .
Despite the difference between the empirical and population learning setups, the story here is similar to
what is describe in Section 4. Indeed, the core of the behavior of the algorithm relies only on the fact
that θ∗ cancels also the noise matrix σ, i.e σ(θ∗ ) = 0. If this is the case, i.e. we are in the interpolation
(or noiseless) regime, then the same analysis as in Subsection 4.1 applies and the movement resembles
14 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

multivariate geometric Brownian motion. If uniformly σ(θ) ≻ 0, then there will be a non degenerate
invariant measure and the same results as Subsection 4.2 go through.

5.1. The noiseless case


Here we assume that the distribution ρ is such that Y = ⟨X, θ∗ ⟩ and that the features are almost surely
bounded, i.e., ∥X∥2 ≤ K for some K > 0. This means that we are in the noiseless/interpolation regime
where the input/output distribution admits a linear interpolator. As previously, in this case, despite the
randomness of the SDE, the convergence is almost sure towards θ∗ , with explicit rates that we show below.
1
Theorem 5.1. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd , then for γ < 3K ,
we have that ∥θt − θ∗ ∥ converges almost surely to 0 with the following rates
(i) Parametric rate. For all t ≥ 0, we have that

(23) E[∥θt − θ∗ ∥2 ] ≤ ∥θ0 − θ∗ ∥2 e−µ(2−Kγ)t ,

where µ > 0 is the smallest non zero eigenvalue of Σ.


(ii) Non-parametric rate. For all t ≥ 0, we have that for all α > 0,
 α
2 1
(24) E[∥θt − θ∗ ∥ ] ≤ ,
∥θ0 − θ∗ ∥−2/α + Cα t
−α η ⟩ γKα
where we defined Cα = 1
2α (⟨η0 , Σ 0 + 2−Kγ ∥θ0 − θ∗ ∥2 )−1/α , and Kα = maxi≤n ⟨xi , Σ−α xi ⟩.

Hence, as in the case of the training dynamics, we have that for all t ≥ 0,
  α 
2 2 −µt 1
E[∥θt − θ∗ ∥ ] ≤ min ∥θ0 − θ∗ ∥ e , inf ,
α∈R+ ∥θ0 − θ∗ ∥−2/α + Cα t

and, if µ is non-zero but very small (e.g. 10−10 ), this inequality describes well the difference between the
transcient regime of convergence that is polynomial and the asymptotic regime that is exponential but occur-
ing after time-scale 1/µ. We recalled here the result as in the previous section for the sake of completeness,
but the result and the proof of the theorem (which can be found in Appendix, Section B.1) are rather similar
to the training case. This shows that what really matters is the ability of the model to be interpolated or not
and not the finite sample size.

5.2. The noisy case


In this section we suppose that we are in the noisy regime. This corresponds to the fact that there exists a
constant a > 0 such that for all θ ∈ Rd , we have σ 2 (θ)/L(θ) ≽ a2 Id as well as L(θ) ≥ a2 , that is to say
that neither the loss nor the multiplicative noise can cancel. Let us give first a concrete example when this
happens.

Example 5.2 (Gaussian model). Assume that X ∼ N (0, Σ), with Σ ⪰ µId and that there exists θ∗ ∈ Rd and
ξ ∈ R a random variable independent of X of mean zero and variance 2σ 2 , such that Y = ⟨θ∗ , X⟩+ξ. Then,
by some forth order Gaussian moment calculation given in [14, Lemma 1], we have the exact calculation

σ(θ)2 = (Σ(θ − θ∗ )) (Σ(θ − θ∗ ))⊤ + 2L(θ)Σ ≽ 2µL(θ)Id ,

as well as L(θ) ≥ σ 2 .
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 15

Semi-group and Fokker-Planck equation. As for the training setup, due to the non-degeneracy of the
noise term in this case, it is more convenient to track the dynamics of the probability measure ρt := Law(θt )
for any t ≥ 0. We place ourselves in the situation ρ0 ∈ P2 (Rd ) and recall that (θt )t≥0 satisfies the (parabolic)
Fokker-Planck equation (at least in the weak sense) ∂t ρt = L∗ ρt , with the operator L∗ being the adjoint (with
respect to the canonical dot product of L2 (Rd )) of the infinitesimal generator of the dynamics that writes,
d
γ X
(25) (Lf )(θ) = −⟨Σ (θ − θ∗ ) , ∇f (θ)⟩ + [σ(θ)σ(θ)⊤ ]ij ∂ij f (θ),
2
i,j=1

for all test functions f : Rd → R sufficiently smooth. Recall furthermore that the evolution of expectations
of observables (f (θt ))t≥0 is given by the action of the generator:
d
(26) E [f (θt )] = LE [f (θt )] .
dt
5.2.1. Invariant measure and convergence.
As the multiplicative noise does not cancel, L∗ is uniformly elliptic and there is existence and uniqueness
of the invariant measure ρ∞ of the SDE, which satisfies the PDE L∗ ρ∞ = 0. Moreover, by ergodicity,
the law of the iterates eventually converges towards this unique invariant measure. In order to show this
quantitavely, we first prove a useful lemma on the Lipschitz behavior of the multiplicative noise:
Lemma 5.3. There exists of constant c > 0 depending only on the distribution ρ, such that for all θ, η ∈ Rd ,
we have
(27) ∥σ(θ) − σ(η)∥2HS ≤ 2cK⟨Σ(θ − η), θ − η⟩ .
The proof is in the Appendix, Section B.2. We are now ready to state the main theorem of the section.
Indeed, in the following result we also provide a quantitative statement, in Wassertein distance, on the speed
of convergence of the dynamics towards such a measure.
1
Theorem 5.4. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd , then for γ < Kc ,
there exists a unique stationary distribution ρ∗ ∈ P2 (Rd ), and quantitatively,
(i) Parametric rate. For all t ≥ 0, we have that
(28) W22 (ρt , ρ∗ ) ≤ W22 (ρ0 , ρ∗ )e−2µ(1−γKc)t ,
where µ > 0 is the smallest non zero eigenvalue of Σ.
(ii) Non-parametric rate. Assume that we have the inequality for all α > 0,
(29) ∥Σ−α/2 (σ(θ) − σ(η))∥2HS ≤ 2cα Kα ⟨Σ(θ − η), θ − η⟩ ,
then, for all t ≥ 0, we have that ∀α > 0,
 α
2 ∗ 1
(30) W2 (ρt , ρ ) ≤ ,
W22 (ρ0 , ρ∗ )−2/α + Cα t
 −1/α
with Cα = 2(1−γcK)
α E [⟨θ 0 − Θ∗ , Σ −α (θ − Θ )⟩] + γcα Kα W 2 (ρ , ρ∗ ))
0 ∗ 1−γcK 2 0 , and where the ex-
pectation is taken w.r.t. the optimal Wasserstein coupling between θ0 ∼ ρ0 and Θ∗ ∼ ρ∗ .
Similarly as before, the result is similar to the one of the underparametrized regime for the training
dynamics. This shows once again that what really matters is the ability of the model to be interpolated or
not and not the finite sample size. The proof of the theorem can be found in Appendix, Section B.2.
16 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

5.2.2. Further (expected) properties.


In the following paragraph we state what should be the properties of the dynamics similarly to the SDE on
the training loss case. Any proof could be adapted, but we take the option to avoid being redundant and state
directly the expected properties without re-doing the work of the previous section. To be fair we do not state
the results as propositions and theorems. We will also abuse of the notation c > 0 that plays the role of a
universal constant in any of the expressions below.
It is possible to localize the invariant measure similarly to what has been done in Proposition 4.7. We give
below estimates on its mean and standard deviation around it. Indeed, if Θ∗ be a random variable distributed
according to ρ∗ , then we expect that
h i γKσ 2
(31) E [Θ∗ ] = θ∗ , and E ∥Θ∗ − θ∗ ∥2 ≤ c .
µ

R along a trajectory converge towards spatial mean taken accord-


We expect from ergodicity that time averages
1 t
ing to the invariant measure, i.e. θ̄t := t 0 θs ds → Eρ∗ [Θ∗ ] = θ∗ , as t goes to infinity, as stated in the
equation above. A quantification of this fact would give:
γKσ 2 + ∥θ0 − θ∗ ∥2
E∥Σ(θ̄t − θ∗ )∥2 ≤ c .
t
Step-size decay would help canceling the noise in large times: indeed, choosing the step-size sequence as
γt = 1/(K + tα ) for α > 1, we get
c
E[∥θt − θ∗ ∥2 ] ≤ α−1 ,
t
The heavy-tails phenomenon is also expected to hold similarly to the training case. More precisely, we show
that if one fixes a step-size γ > 0, all moments up to a certain value α(γ) of the invariant distribution exist
and all higher moments do not. In other words the step size is a direct control of the tail of the asymptotic
distribution.

6. C ONCLUSION AND P ERSPECTIVES

In this article, we have shown how the SGD could be efficiently modeled by a SDE to reflect its main
qualitative and quantitative features: convergence speed, difference between the noisy and noiseless settings
and study of the asymptotic distribution. The specificity of the least-square set-up enabled us to show
some localization of the invariant measure ρ∗ in the noisy context: however, it seems possible to improve
this understanding to its precise shape, and some questions remain: is ρ∗ log-concave? Is it possible to
characterize its covariance in order to better apprehend its shape? Also, regarding its heavy-tail behavior,
it would be great to have a precise estimate of the exponent, from which the moment of the stationary
distribution explodes.
This work has been done in order to convey the idea that the SDE framework can improve the under-
standing of the SGD dynamics. This is clear for the least-squares setup, yet the important question is to go
beyond this and try to apply the same methodology for the non-convex dynamics arising from the training of
non-linear neural networks. The study of single or multi-index models [26, 47] could be a first step toward
broadening this systematic study.
Acknowledgments. AS acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German
Research Foundation) under Germany’s Excellence Strategy – GZ 2047/1, Projekt-ID 390685813. AS and
LP warmly thank the the Simons Foundation and especially the Flatiron Institute for its support as this
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 17

research has been initiated while AS was invited in New York. Finally, the authors extend their gratitude to
the Incubateur de Fraîcheur for hosting them and providing an ideal atmosphere that fostered exceptional
discussions.

R EFERENCES

[1] Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i:
Mathematical foundations. The Journal of Machine Learning Research, 20(1):1474–1520, 2019. 1, 2, 5
[2] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951. 1, 4
[3] Gilles Pagès. Sur quelques algorithmes récursifs pour les probabilités numériques. ESAIM: Probability and Statistics, 5:141–
170, 2001. 1
[4] Michel Benaïm. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII, pages 1–68.
Springer, 2006. 1, 5
[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems,
2008. 1, 4
[6] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires
rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. 1
[7] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
1
[8] Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes
learns sparse features. In International Conference on Machine Learning, pages 903–925. PMLR, 2023. 1
[9] Jeff Z HaoChen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covari-
ance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021. 1
[10] Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent
exponentially favors flat minima. arXiv preprint arXiv:2002.03495, 2020. 1
[11] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for
deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2018. 1
[12] Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. In International Conference on
Machine Learning, pages 3964–3975. PMLR, 2021. 1, 2, 12
[13] Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal
of Nonlinear Science, 33(3):45, 2023. 1, 5
[14] R. Berthier, F. Bach, and P. Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless
linear model. In Advances in Neural Information Processing Systems, 2020. 1, 2, 4, 8, 14
[15] Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in
the interpolation regime. Advances in Neural Information Processing Systems, 34:21581–21591, 2021. 1
[16] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural net-
works. Advances in neural information processing systems, 31, 2018. 1
[17] Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. On the validity of modeling sgd with stochastic differential equations (sdes).
Advances in Neural Information Processing Systems, 34:12712–12725, 2021. 2
[18] Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approximation results for sgd and its
continuous-time counterpart. In Conference on Learning Theory, pages 1965–2058. PMLR, 2021. 2, 13
[19] Alnur Ali, Edgar Dobriban, and Ryan Tibshirani. The implicit regularization of stochastic gradient flow for least squares. In
International conference on machine learning, pages 233–244. PMLR, 2020. 2, 4, 6
[20] Liu Ziyin, Kangqiao Liu, Takashi Mori, and Masahito Ueda. Strength of minibatch noise in sgd. arXiv preprint
arXiv:2102.05375, 2021. 2
[21] Lei Wu, Chao Ma, et al. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective.
Advances in Neural Information Processing Systems, 31, 2018. 2
[22] Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable
benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021. 2, 5
[23] Loucas Pillaud-Vivien, Julien Reygner, and Nicolas Flammarion. Label noise (stochastic) gradient descent implicitly solves
the lasso for quadratic parametrisation. In Conference on Learning Theory, pages 2127–2159. PMLR, 2022. 2
[24] Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. In International Conference on
Machine Learning, pages 15959–15975. PMLR, 2022. 2
18 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

[25] Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part ii: Continuous time analysis.
Journal of Nonlinear Science, 34(1):16, 2024. 2
[26] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for sgd: Effective dynamics and
critical scaling. Advances in Neural Information Processing Systems, 35:25349–25362, 2022. 2, 16
[27] Courtney Paquette, Elliot Paquette, Ben Adlam, and Jeffrey Pennington. Homogenization of sgd in high-dimensions: Exact
dynamics and generalization properties. arXiv preprint arXiv:2205.07069, 2022. 2
[28] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational
Mathematics, 7(3):331–368, 2007. 2
[29] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Annals of Statistics, 44(4):1363–
1399, 2016. 2, 8
[30] L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through
multiple passes. Advances in Neural Information Processing Systems, 31:8114–8124, 2018. 2
[31] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The
crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021. 2
[32] Blake Bordelon and Cengiz Pehlevan. Learning curves for sgd on structured features. arXiv preprint arXiv:2106.02713, 2021.
2
[33] Liam Hodgkinson and Michael Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In International
Conference on Machine Learning, pages 4262–4274. PMLR, 2021. 2, 12
[34] Zhe Jiao and Martin Keller-Ressel. Emergence of heavy tails in homogenized stochastic gradient descent. arXiv preprint
arXiv:2402.01382, 2024. 2, 12
[35] S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated
perceptron. In International Conference on Artificial Intelligence and Statistics, pages 1195–1204. PMLR, 2019. 4
[36] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review,
60(2):223–311, 2018. 4
[37] J Harold, G Kushner, and George Yin. Stochastic approximation and recursive algorithm and applications. Application of
Mathematics, 35, 1997. 5
[38] Rafail Khasminskii. Stochastic stability of differential equations, volume 66. Springer Science & Business Media, 2011. 5, 7,
10, 23
[39] Julie Lyng Forman and Michael Sørensen. The pearson diffusions: A class of statistically tractable diffusion processes.
Scandinavian Journal of Statistics, 35(3):438–465, 2008. 7
[40] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of
the National Academy of Sciences, 117(48):30063–30070, 2020. 8
[41] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201,
2021. 8
[42] Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. Advances in Neural
Information Processing Systems, 29, 2016. 8
[43] Avner Friedman. Partial differential equations of parabolic type. Courier Dover Publications, 2008. 10
[44] Max-K von Renesse and Karl-Theodor Sturm. Transport inequalities, gradient estimates, entropy and ricci curvature.
Communications on pure and applied mathematics, 58(7):923–940, 2005. 11
[45] Martin Hairer. Convergence of markov processes. Lecture notes, 18:26, 2010. 11
[46] Patrick Cattiaux and Arnaud Guillin. A journey with the integrated γ 2 criterion and its weak forms. In Geometric Aspects of
Functional Analysis: Israel Seminar (GAFA) 2020-2022, pages 167–208. Springer, 2023. 11
[47] Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow. arXiv
preprint arXiv:2310.19793, 2023. 16
[48] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. 23, 34
[49] Jonathan C Mattingly, Andrew M Stuart, and Michael V Tretyakov. Convergence of numerical time-averaging and stationary
measures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010. 29
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 19

O RGANIZATION OF THE A PPENDIX

A. SGD on the training loss: Proof of Section 4.

A.1: The noiseless case. This includes the proof of the fact that θ∗ is the orthogonal projection of θ0 into I
(Lemma 4.2) and the proof of the convergence theorem of θ to θ∗ (Theorem 4.3).
A.2: The noisy case. We prove here the existence of a stationary distribution (Lemma 4.5).
A.2.1: Invariant measure and convergence. In this subsection, we prove the quantitative convergence to
the the stationary distribution (Theorem 4.6) with the help of a technical Lemma (Lemma A.1).
A.2.2: Localization of the invariant measure. We prove the insights given on the first and second mo-
ments of the stationary distribution (Proposition 4.7). Then, we show the moment explosion of the invariant
distribution (Proposition 4.9).
A.2.3: Convergence of variance reduction techniques. In this subsection, we prove the convergence of
the time-average of the iterates to θ∗ , i.e. ergodicity (Proposition 4.10) as well as the convergence of θt to
θ∗ in the case of the step-size decay (Proposition 4.11).

B: Online SGD: Proofs of Section 5.

B.1: The noiseless case We give the proof of the convergence to θ∗ with rates (Theorem 5.1).
B.2: The noisy case. We first prove that multiplicative noise carries some Lipschitz property (Lemma 5.3).
This allow us to prove the quantitative convergence to the the stationary distribution (Theorem 5.4).

A PPENDIX A. SGD ON THE TRAINING LOSS : P ROOFS OF S ECTION 4

A.1. The noiseless case


We begin by proving Lemma 4.2 on the fact that θ∗ is the orthogonal projection of θ0 into I, that is θ∗ =
argmin ∥θ − θ0 ∥2 .
Xθ=y

Lemma 4.2. The proof of the lemma follows from Karush–Kuhn–Tucker conditions. Indeed, the argmin
is unique, being the projection of θ0 on a linear set, and it satisfies that there exists Lagrange multipliers
λ ∈ Rn such that
θ∗ − θ0 = X⊤ λ , and Xθ∗ = y .

This means that λ = (XX⊤ )−1 (y − Xθ0 ), and hence,


θ∗ = θ0 + X⊤ (XX⊤ )−1 (y − Xθ0 )
= X† y + (I − X† X)θ0 ,
which corresponds to the given definition of θ∗ . ■
20 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

We prove now the main convergence theorem of this section. This is Theorem 4.3.

Theorem 4.3. (i) The key ingredient of the proof is the Gronwall Lemma. Combining the Itô formula with
(13) gives us
d γ  
E∥θt − θ∗ ∥2 = −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + ETr XX⊤ Rx2 (θt )
dt n
n
∗ γ X
≤ −2E⟨θt − θ∗ , Σ(θt − θ )⟩ + E ∥xi ∥2 (⟨θt , xi ⟩ − yi )2
n
i=1
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γKE⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −(2 − γK)E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ .
By integrating the latter, we get
Z t
2
E∥θt − θ∗ ∥2 = ∥θ0 − θ∗ ∥2 − E∥X (θu − θ∗ ) ∥2 du
n 0
γ t
Z    
⊤ 2 1 ⊤
+ ETr X diag(rx (θ)) − rx (θt ) rx (θt ) X du.
n 0 n
Note that
t n
γ t X 2
Z Z
γ 

 
ETr X 2
diag(rx (θ)) X du = E rx (θt )i (XX⊤ )ii du
n 0 n 0
i=1
Z t X n
γ
= E rx2 (θt )i ∥xi ∥2 du
n 0
i=1
Z t n
γ X
≤ E max ∥xi ∥2 rx2 (θt )i du
n 0 i
i=1
Z t
≤ 2γK EL(θu )du,
0
and
Z t Z t
γ 

 γ⊤
 
ETr X rx (θt ) rx (θt ) X du = 2 ETr rx (θu )⊤ XX⊤ rx (θu ) du
n20 n 0
Z t  Z t
γ ⊤

≥ λmin (Σ) E rx (θt ) rx (θt ) du = 2γλmin (Σ) EL(θu )du.
n 0 0
Collecting the estimates, we obtain
Z t
2 2
E∥θt − θ∗ ∥ ≤ ∥θ0 − θ∗ ∥ − (4 + 2γλmin (Σ) − 2γK) EL(θu )du.
0
We remark that θu − θ∗ ∈ Ran(X T ). Indeed, we have
θu − θ∗ = (θu − θ0 ) + (θ0 − θ∗ ),
the first term on the r.h.s. is in Ran(X T ) by (13) and the second term also by (14). We recall that Rd =
Ran(X T ) ⊕ Ker(X) and therefore note that X|Ran(X T ) is a bijection into its image. Combining the two last
facts yield
λmin (Σ) 1
E∥ (θu − θ∗ ) ∥2 ≤ E∥X (θu − θ∗ ) ∥2 = EL(θu )
2 2n
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 21

because XX⊤ as the same spectrum as X⊤ X. All in all, we have


Z t
E∥θt − θ∗ ∥2 ≤ ∥θ0 − θ∗ ∥2 − λmin (Σ) (2 + γλmin (Σ) − γK) E∥ (θu − θ∗ ) ∥2 du,
0

and the first statement of the Theorem follows with Gronwall Lemma. We move to the proof of the second
statement of the Theorem.
(ii) We define ηt := θt − θ∗ and recall that
Z t
E∥ηt ∥2 ≤ ∥θ0 − θ∗ ∥2 − 2(2 − γK) EL(θu )du.
0

The first consequence of this inequality is that


Z t
1
EL(θu )du ≤ ∥θ0 − θ∗ ∥2 .
0 2(2 − γK)

We want to lower bound the term EL(θu ) without using the smallest eigenvalue of Σ, that is supposedly
infinitely small. First note X⊤ X is symmetric and thus diagonalizable in an orthonormal basis (vi ). We
steadily check that

Ker(X⊤ X) = Ker(X)

by invertibility of XX⊤ . We thus have Rd = Ran(X T ) ⊕ Ker(X⊤ X). We denote by λ1 ≤ · · · ≤ λd∗ the
non zero eigenvalue of X⊤ X where d∗ ≤ d is the number of non-zero eigenvalues and (v1 , . . . , vd∗ ) the
P∗
corresponding eigenvectors. For all t > 0, we thus have the following decomposition ηt = dk=1 ηkt vk .
Thanks to the Hölder inequality, we claim that the following holds true:
Let p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,

 1  h iq/p
E ∥ηt ∥2 p ≤ 2E [L(θt )] E ⟨ηt , Σ−p/q ηt ⟩

(32) .

We move to the proof of the above claim. To lighten the notation, we write ηkt = ηk in the following to wit

d ∗ d ∗ d ∗
2
 X  2 X h
2p p 2q −p
i X p  h 2 −p/q iq
E ηk2 λk
 
E ∥ηt ∥ = E ηk = E ηk λk ηk λk ≤ E ηk λk ,
k=1 k=1 k=1

thanks to Hölder inequality w.r.t. the expectation. Then, applying once again the Hölder inequality for the
sum, we have
!p
d∗ d ∗
i q
!
h
2 −p/q
X X
E ∥ηt ∥2 ≤
   2 
E ηk λk E ηk λk
k=1 k=1
 h iq
p
= (E2L(θ)) E ⟨ηt , Σ−p/q ηt ⟩ ,
22 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

and the claim (32) follows. We now prove that t 7→ E ⟨ηt , Σ−p/q ηt ⟩ is bounded. Indeed, we proceed as

before with Itô formula to get
d   dηt −p/q dηt −p/q dηt
E ⟨ηt , Σ−p/q ηt ⟩ = 2E⟨ ,Σ ηt ⟩ + E⟨ ,Σ ⟩
dt dt dt r dt r
−p/q γ ⊤ −p/q γ ⊤
= −2E⟨−Σηt , Σ ηt ⟩ + E⟨ X Rx (θt )dBt , Σ X Rx (θt )dBt ⟩
n n
γ  
= −2E⟨ηt , Σ1−p/q ηt ⟩ + ETr (XT R)T Σ−p/q XT R
  n  
γ −p/q ⊤ 2 1 ⊤
≤ ETr Σ X diag(rx (θ)) − rx (θt ) rx (θt ) X
n n
n
γ X
≤ E ⟨xi , Σ−p/q xi ⟩(⟨ηt , xi ⟩ − yi )2
n
i=1
≤ 2γKp/q L(θt ) ,

where we have used that for all i ∈ J1, nK, we have ⟨xi , Σ−p/q xi ⟩ ≤ Kp/q . Then, by integrating with respect
to t, it yields
  Z t
E ⟨ηt , Σ−p/q ηt ⟩ ≤ ⟨η0 , Σ−p/q η0 ⟩ + 2γKp/q EL(θu )du
0
γKp/q
≤ ⟨η0 , Σ−p/q η0 ⟩ + ∥θ0 − θ∗ ∥2 .
2 − γK

γKp/q
Hence, calling C = 21 (⟨η0 , Σ−p/q η0 ⟩ + 2−γK ∥θ0 − θ∗ ∥2 )−p/q , we have the inequality, for all t ≥ 0,
1/p
EL(θt ) ≥ C E∥ηt ∥2 ,

and this yields the inequality


Z t 1/p
2 2
(33) E∥ηt ∥ ≤ ∥η0 ∥ − C E∥ηu ∥2 du ,
0

that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0, we have
" # 1
1/p−1
1
E∥ηt ∥2 ≤ 1 ,
∥η0 ∥2(1/p−1)
+ (1/p − 1)Ct

this gives the result claim in the theorem. To see how the last inequality goes, we define g(t) = ∥η0 ∥2 −
Rt 1/p
C 0 E∥ηu ∥2 du (which is positive) and we rewrite (33) as
 ′ p
g (t) g ′ (t) 1
≤ g(t) ⇐⇒ ≥ −C =⇒ (g(t)−1/p+1 − g(0)−1/p+1 ) ≥ −Ct,
−C g(t) 1/p −1/p + 1
and thus
  1
1 −1/p+1
g(t) ≤ −Ct(− + 1) + ∥η0 ∥2(−1/p+1) ,
p
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 23

and we conclude with (33).


To prove the convergence almost surely, we use the Itô Formula to obtain
Z sr
γ
2 2
⟨Rx (θu )⊤ X (θu − θ∗ ) , dBu ⟩

E ∥θt − θ∗ ∥ Fs = ∥θ0 − θ∗ ∥ + 2
0 n
Z t    
γ 1
+E Tr X⊤ diag(rx (θ))2 du − rx (θu ) rx (θu )⊤ X
0 n n
2
− ⟨(θu − θ∗ ) , X⊤ X (θt − θ∗ )⟩du Fs ,

n
1
For γ < 2K , we have proven in (i) that the term inside the conditional expectation value/integral is negative.
We can thus overestimate the latter by integrating from 0 to s to obtain
E ∥θt − θ∗ ∥2 Fs ≤ ∥θs − θ∗ ∥2 ,


and we deduce that ∥θt − θ∗ ∥2 is a positive supermartingale, L1 bounded and therefore converge almost
surely to 0. ■

A.2. The noisy case


We begin by showing why V (θ) := 21 ∥θ−θ∗ ∥2 is a Lyapunov function for the dynamics. This is Lemma 4.5.

Lemma 4.5. First, as L is a quadratic function, we have that for all θ ∈ Rd , we have
⟨∇L(θ), θ − θ∗ ⟩ = 2(L(θ) − L(θ∗ )).
We recall that X⊤ Xθ ∗ = X⊤ y and it follows that ∇L(θ) = Σ(θ − θ∗ ). We thus deduce that for all θ ∈ Rd ,
γ h i
LV (θ) = −⟨Σ (θ − θ∗ ) , (θ − θ∗ )⟩ + Tr X⊤ Rx2 (θ)X
γ h 2n i
= 2(L(θ∗ ) − L(θ)) + Tr X⊤ Rx2 (θ)X
2n
n
γ X
≤ 2(L(θ∗ ) − L(θ)) + (⟨xi , θ⟩ − yi )2 ∥xi ∥2
2n
i=1
 
γK
≤ 2 L(θ∗ ) − L(θ)(1 − ) .
2
This Lyapunov identity implies the existence of a stationary distribution as explained in [38, Theorem 3.7].
This implies that for all t ≥ 0, the dynamics does not explode. ■
A.2.1. Invariant measure and convergence
We now turn into proving the main theorem of this part on the quantitative convergence to the stationary
distribution.

Theorem 4.6. (i) Wassertein contraction comes essentially from coupling arguments. Let γ ≤ K−1 and
ρ10 , ρ20 ∈ P2 (Rd ) two possible initial distributions.  Then by [48, Theorem 4.1], there exists a couple of
random variables θ01 , θ02 such that W22 (ρ10 , ρ20 ) = E ∥θ01 − θ02 ∥2 . Let (θt1 )t≥0 (resp. (θt2 )t≥0 ) be the solution


of the SDE (13), sharing the same Brownian motion (Bt )t≥0 . Then, for all t ≥ 0, the random variable
(θt1 , θt2 ) is a coupling between ρ1t and ρ2t , and hence
W22 (ρ2t , ρ1t ) ≤ E ∥θt1 − θt2 ∥2 .
 
24 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Moreover, we denote by ∥.∥HS the Frobenius norm and by Itô formula, we have
d  1 2 γ
E ∥θt − θt2 ∥2 = − E⟨θt1 − θt2 , X⊤ (Xθt1 − y) − X⊤ (Xθt2 − y)⟩ + E∥X⊤ Rx (θt1 ) − X⊤ Rx (θt2 )∥2HS

dt n n
2 γ
= − E∥X(θt1 − θt2 )∥2 + E∥X⊤ (Rx (θt1 ) − Rx (θt2 ))∥2HS
n n
2 1 2 2
≤ − E∥X(θt − θt )∥
n  
2γ ⊤ 1 2 2 1 ⊤ 1 2 ⊤ 2
+ E ∥X diag(⟨θt − θt , xi ⟩i )∥HS + 2 ∥X (⟨θt − θt , xi ⟩)i 1 ∥HS .
n n
Furthermore, we have for all θ1 , θ2 ∈ Rd that
 
∥X⊤ diag(⟨θ1 − θ2 , xi ⟩i )∥2HS = Tr XX⊤ diag(⟨θ1 − θ2 , xi ⟩2i )
n
X
= ∥xi ∥2 ⟨θ1 − θ2 , xi ⟩2
i=1
≤ K∥X(θ1 − θ2 )∥2 ,
and
1 ⊤ 1 2 ⊤ 2 1  ⊤ 1 2 1 2 ⊤

∥X (⟨θ − θ , xi ⟩) i 1 ∥ HS = Tr XX (⟨θ − θ , xi ⟩)i (⟨θ − θ , xi ⟩)i
n2 n
1    
≤ Tr XX⊤ Tr (⟨θ1 − θ2 , xi ⟩)i (⟨θ1 − θ2 , xi ⟩)⊤ i
n
≤ K∥X(θ1 − θ2 )∥2 ,
where we use from the first to the second line the inequality Tr(AB) ≤ Tr(A)Tr(B) for any A and B
positive semi-definite. Altogether, this gives the inequality:
d  1 2 − 4γK
E ∥θt − θt2 ∥2 ≤ − E∥X(θt1 − θt2 )∥2

(34)
dt n
(35) ≤ −2µ(1 − 2γK)E∥θt1 − θt2 ∥2 ,
and by Gronwall Lemma, denoting cγ = 2µ(1 − 2γK) this gives that
W22 (ρ1t , ρ2t ) ≤ E ∥θt1 − θt2 ∥2 ≤ e−cγ t E ∥θ01 − θ02 ∥2 = e−cγ t W22 (ρ20 , ρ10 ) .
   

Now, for all s ≥ 0 setting ρ10 = ρ0 ∈ P2 (Rd ) and ρ20 = ρs ∈ P2 (Rd ), we have for all t ≥ 0,
W22 (ρt , ρt+s ) ≤ e−cγ t W22 (ρ0 , ρs ) ,
which shows that the process (ρt )t≥0 is of Cauchy type, and since (P2 (Rd ), W2 ) is a Polish space, ρt →
ρ∗ ∈ P2 (Rd ) as t grows to infinity. Now, since there exists a stationary solution to the process, let us fix
ρ10 = ρ∗ ∈ P2 (Rd ) and ρ20 = ρ0 ∈ P2 (Rd ). We have then,
W22 (ρt , ρ∗ ) ≤ e−cγ t W22 (ρ0 , ρ∗ ) ,
which concludes the first part of the Theorem.

(ii) We will use the same steps as for the proof of Theorem 4.3 (ii). Again, one steadily checks that for
p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,
 1/p
E ∥ θt1 − θt2 ∥2

1  1 2
 2
(36) E ∥X θt − θt ∥ ≥  q/p .
n E ⟨θ1 − θ2 , Σ−p/q (θ1 − θ2 )⟩
t t t t
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 25

By Ito formula, skipping the details, we get


d  1 
E ⟨θt − θt2 , Σ−p/q (θt1 − θt2 )⟩
dt
γ  
= −2E⟨θt1 − θt2 , Σ1−p/q (θt1 − θt2 )⟩ + ETr Σ−p/q XT (Rx (θ1 ) − Rx (θ2 ))(Rx (θ1 ) − Rx (θ2 ))⊤ X
n
γ 
−p/q T

≤ ETr Σ X (Rx (θ1 ) − Rx (θ2 ))(Rx (θ1 ) − Rx (θ2 ))⊤ X
n
γ
= ∥(Rx (θ1 ) − Rx (θ2 ))⊤ XΣ−p/2q ∥2HS
n
2γ 1
E ∥Σ−p/2q X⊤ diag(⟨θt1 − θt2 , xi ⟩i )∥2HS + 2 ∥Σ−p/2q X⊤ (⟨θt1 − θt2 , xi ⟩)i 1⊤ ∥2HS


n n
2γ   −p/q ⊤  1  
= E Tr XΣ X diag(⟨θ1 − θ2 , xi ⟩2i ) + Tr XΣ−p/q X⊤ (⟨θt1 − θt2 , xi ⟩)i (⟨θt1 − θt2 , xi ⟩)⊤
i
n n
n
2γ  X 1    
≤ E ⟨xi , Σ−p/q xi ⟩⟨θ1 − θ2 , xi ⟩2 + Tr XΣ−p/q X⊤ Tr (⟨θt1 − θt2 , xi ⟩)i (⟨θt1 − θt2 , xi ⟩)⊤i ,
n n
i=1

the last line by the fact that Tr(AB) ≤ Tr(A)Tr(B) for A and B positive semi-definite. We thus conclude
that
d  1  4γ
E ⟨θt − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ E∥X(θ1 − θ2 )∥2 Kp/q ,
dt n
where we have used that for all i ∈ J1, nK, we have ⟨xi , Σ−p/q xi ⟩ ≤ Kp/q . In addition, by (34), we get
Z t
n
E ∥X(θu1 − θu2 )∥2 du ≤ E∥(θ01 − θ02 )∥2 .
 
0 2 − 4γK
Collecting the estimates, we thus get
1   1/p
E ∥X θt1 − θt2 ∥2 ≥ CE ∥ θt1 − θt2 ∥2
  
,
n
  −q/p
Kp/q
with C = E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ + 2γ 1−2γK E(∥θ01 − θ02 ∥2 )

, which combined with (34),
gives
Z t
 1 2 2
 1 2 2
E∥(θu1 − θu2 )∥2 du ,
 
E ∥θt − θt ∥ ≤ E ∥θ0 − θ0 ∥ − C (2 − 4γK)
0

that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0 (for C = 2C (1 − 2γK))), we
have
 h  1−1/p i 1
1−1/p
E ∥θt1 − θt2 ∥2 ≤ E ∥θ01 − θ02 ∥2

+ (1/p − 1)Ct ,
and we conclude as for (i). ■

A.2.2. Localization of the invariant measure.


We now turn in the localization of the invariant measure presented in Proposition 4.7.

Proposition 4.7. We proved that θt tends weakly to Θ∗ for γ ≤ 1/K. To insure that the first and second
moment of θt tends to the first and two moments of Θ∗ , we first prove that it exists a M such that for all t,
26 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

E ∥θt − θ∗ ∥4 ≤ M . By the Itô formula,


 
d d1 2 1 2 γK 2
E[V (θ)] = E[∥θt − θ∗ ∥ ] ≤ E − ∥X(θt − θ∗ )∥ + ∥Xθt − y∥
dt dt 2 n 2n
 
1 γK 2 γK ∗ 2
≤ E (− + )∥X(θt − θ∗ )∥ + ∥Xθt − y∥
n n n
≤ (−2 + 2γK)µV (θ) + 2γKσ 2 ,
the second line by triangle inequality and the third line because L(θ∗ ) = σ 2 . This implies by Gronwall
Lemma that
γKσ 2 γKσ 2
 
−2(1−γK)t 1 2
(37) E[V (θ)] ≤ e ∥θ0 − θ∗ ∥ − + ,
2 (1 − γK)µ (1 − γK)µ
thus t 7→ E[∥θt ∥2 ] is uniformly upper bounded for γ < 1/K. Again, by Ito formula, we get
d
E ∥θt − θ∗ ∥4 = −4E ∥θt − θ∗ ∥2 ⟨θt − θ∗ , Σ(θ − θ∗)⟩
dt
γ   
+ ETr X⊤ Rx (θt )Rx (θt )⊤ X 8(θt − θ∗ )(θt − θ∗ )⊤ + 4 ∥θt − θ∗ ∥2 Id
2n
4
≤ − E ∥θt − θ∗ ∥2 ∥X(θt − θ∗ )∥2
n
γ   
+ ETr X⊤ diag(rx (θ))2 X 8(θt − θ∗ )(θt − θ∗ )⊤ + 4 ∥θt − θ∗ ∥2 Id
2n
4
≤ − E ∥θt − θ∗ ∥2 ∥X(θt − θ∗ )∥2
n
8γ    4γK
+ ETr X⊤ diag(rx (θ))2 X (θt − θ∗ )(θt − θ∗ )⊤ + E ∥θt − θ∗ ∥2 ∥Xθt − y∥2
2n 2n
4 6γK
≤ − E ∥θt − θ∗ ∥2 ∥X(θt − θ∗ )∥2 + E ∥Xθt − y∥2 ∥θt − θ∗ ∥2 ,
n n
the last line by the fact that Tr(AB) ≤ Tr(A)Tr(B) for A and B positive semi-definite and we deduce with
the triangle inequality that the latter is less than
 
d 4 4 12γK 12γK
E ∥θt − θ∗ ∥ ≤ − + E ∥θt − θ∗ ∥2 ∥X(θt − θ∗ )∥2 + E ∥Xθ∗ − y∥2 ∥θt − θ∗ ∥2
dt n n n
≤ (−4 + 12γK)µE ∥θt − θ∗ ∥4 + 24γKσ 2 E ∥θt − θ∗ ∥2 ,
1
the last term is bounded by (37) and we conclude with Gronwall Lemma for γ < 3K the claim on the fourth
moment.
d
For the mean, we consider the equation satisfies by E(θt ): dt E(θt ) = −∇L(E(θt )) , by linearity of
θ 7→ ∇L(θ). Hence, we have
d1
∥E(θt ) − θ∗ ∥2 = −⟨Σ(E(θt ) − θ∗ ), E(θt ) − θ∗ ⟩ ≤ −µ ∥E(θt ) − θ∗ ∥2 ,
dt 2
which gives that E(θt ) → θ∗ as t goes to infinity. Combining the latter with the boundedness of the fourth
moment (One actually needs only a 1 + ϵ moment) and the weak convergence of θt to Θ∗ imply the first
claim of the proposition. By (37), we recall that

γKσ 2 γKσ 2
 
−2(1−γK)t 1
E[V (θ)] ≤ e E∥θ0 − θ∗ ∥2 − + ,
2 (1 − γK)µ (1 − γK)µ
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 27

which which combined with the boundedness of the fourth moment implies by taking the limit in the in-
equality, and weak convergence of ρt towards ρ∗ that θ∗ ∼ ρ∗ satisfies the claimed inequality in the propo-
sition. ■
We turn into proving the moment explosion of the invariant distribution. This corresponds to proving
Proposition 4.9. First, we need the following lemma that aims at lower bounding some quartic form of θ:
Lemma A.1. For n ≥ 2d, there exists a constant c > 0 that depends only on X, y such that ∀η ∈ Rd ,
⟨Xη, Rx2 Xη⟩ ≥ c∥η∥2 ∥Xη∥2 .
We assume the validity of Lemma A.1 for the time being and turn into the proof of Proposition 4.9.
Proof. of Proposition 4.9 Using Itô’s Lemma, we obtain that
d 1 γα(α − 1)∥ηt ∥2(α−2)  ⊤ 2 
E[ ∥ηt ∥2α ] = −αE∥ηt ∥2(α−1) ⟨Σηt , ηt ⟩ + E Tr X Rx Xηt ηtT
dt 2 n
γα∥ηt ∥ 2(α−1)  
+E Tr X⊤ Rx2 X .
2n
We suppose that ∥Θ∗ ∥ has moments of order 2α. We take η0 = Θ∗ and thus obtain
γ(α − 1)  
E∥Θ∗ ∥2(α−1) ⟨ΣΘ∗ , Θ∗ ⟩ ≥ E∥Θ∗ ∥2(α−2) Tr X⊤ Rx2 XΘ∗ Θ⊤∗ .
n
Using the Lemma A.1, we deduce that
µ
E∥Θ∗ ∥2α ≥ c γ(α − 1)E∥Θ∗ ∥2α ,
λ
which leads to a contradiction for α large enough and we thus deduce that ∥Θ∗ ∥ has no moments of order
λ
2α for α > 1 + µc . ■
It remains to prove Lemma A.1. For the ease of writing and clarity we now denote the residual r =
Xθ − y ∈ Rn , and the constant noise vector σ = Xθ∗ − y ∈ Rn , and note in this proof Rx2 ≡ Rr2
and Xη = r − σ to emphasize that every thing here can be expressed as a function of the residuals r.
Proof. of Lemma A.1. We first give the kernel of R2 as a function of r. Let I = {i ∈ J1, nK, ri = 0} and
αr = (1/r1 , 1/r2 , . . . , 1/rn )⊤ , then
(i) if I ̸= ∅, Ker(Rr2 ) = span(ei )i∈I
(ii) if I = ∅, Ker(Rr2 ) = span(αr ) .
We fix r ∈ Rn and let z ∈ Ker(Rr2 ), that is
 
2 1 ⊤
diag(r) − rr z=0,
n
i.e. for all i ∈ J1, nK,
⟨r, z⟩
ri2 zi = ri .
n
If I = ∅, we have the relationship for all i, j ∈ J1, nK, ri zi = ⟨r,z⟩
n = rj zj , which instantly gives the result.
⟨r,z⟩
Otherwise if I ̸= ∅, for all i ∈ I , we have ri zi = n and summing all these equality of i ∈ I c , we have
c

X ⟨r, z⟩
⟨r, z⟩ = ri zi = (n − |I|) ,
c
n
i∈I
28 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

̸ n, that z ⊥ r. Recalling that for i ∈ I c , we have zi = ⟨r,z⟩


which implies, as |I| = ri n = 0 and (i),(ii) follow.
Let us remark that r 7→ Rr2 carries the homogeneity property that for all r ∈ Rn ,
2
Rr/∥r∥ = Rr2 /∥r∥2 ,
so that denoting the sphere Sn−1 := {v ∈ Rn , s.t. ∥v∥ = 1}, we have for u = r/∥r∥ ∈ Sn−1 and
v = (r − σ)/∥r − σ∥ ∈ Sn−1 ∩ Ran(X) (note that v is defined i.f.f. η is not equal to 0), we have
⟨Xη, Rx2 Xη⟩
(38) = ⟨v, Ru2 v⟩ .
∥Xη∥2 ∥Xη + σ∥2
Now we also need the following intermediate result: Let Πu be the orthogonal projector into Ker(Ru2 )⊥ , we
have that there exists a constant c > 0 that depends only on the data such that for all η ∈ Rd ,
∥Πu v∥ ≥ c .
There are two cases given by (i) and (ii), that allow to compute explicitly Πu for all u ∈ Rn :
(i) If I ̸= ∅, we have

P
c xi xi η, η
∥Πu v∥ = Pi∈I
2
n ⊤
≥c,
i=1 xi xi η, η
because any covariance matrix of the form i∈I c xi x⊤ c
P
i is invertible if |I | ≥ d, which is verified for n ≥ 2d.
(ii) If I = ∅,
⟨v, αr ⟩2
∥Πu v∥2 = 1 − = 1 − φ(η) ,
∥αr ∥2
where P 2
n ⟨xi ,η⟩
i=1 (⟨xi ,η⟩−σi )
φ(η) = Pn .
( i=1 ⟨xi , η⟩2 ) ( ni=1 (⟨xi , η⟩ − σi )−2 )
P

From the Cauchy-Schwarz inequality, φ ∈ [0, 1] and let m = supη∈Rd φ(η). We show below that m < 1.
First, let’s study the situation at infinity. For this, we write η = η̄∥η∥, where η̄ = η/∥η∥ ∈ Sd−1 and call
J = {i ∈ J1, nK, such that ⟨xi , η̄⟩ = 0}. We know that |J| ≤ d. We fix η̄, and study the limit of φ as ∥η∥
grows.
If J ̸= ∅, we have
(n − |J|)2
φ(η) ∼∥η∥→∞ Pn 2
 P −2 .

i∈J c ⟨xi , η⟩ i∈J σi
Hence, if J ̸= ∅, then φ(η) → 0 when ∥η∥ → ∞. Now if J = ∅,
n2
φ(η) ∼∥η∥→∞ Pn := ϕ(η̄).
( i=1 ⟨xi , η̄⟩2 ) ( ni=1 ⟨xi , η̄⟩−2 )
P

Moreover, M := supη̄∈Sd−1 ϕ(η̄) < 1, because if M = 1, it would be attained for a certain η̄∗ by compact-
ness of Sd−1 and continuity of ϕ, and if M = 1, then ϕ(η̄∗ ) = 1 and by the equality case of Cauchy-Schwarz,
this corresponds to the fact that there exists λ ∈ R such that |Xη̄∗ | = λ, which has no solution for n > d
(recall that because η̄∗ ∈ Sd−1 , λ = 0 offer no solution either). We conclude from this that M < 1.
Now either the supremum of φ is M , and in this case we have m = M < 1 and we are done with the
proof. Either, m > M , and in this case, the supremum is attained in a compact set of Rd , hence attained
for a certain η∗ ∈ Rd by continuity of φ. As previously, if m = 1, this corresponds to the equality case of
Cauchy-Schwarz and hence the existence of λ ∈ R such that
(⟨xi , η⟩ − σi )2 ⟨xi , η⟩2 = λ ,
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 29

which has no solution generically for n > d. Finally m = 1 is impossible and we have proven that m < 1,
which corresponds to c = 1 − m > 0.
Now if we take back Eq.(38), and remark that

⟨v, Ru2 v⟩ = ⟨Πu v, Ru2 Πu v⟩ = ∥Πu ∥2 ⟨Π̄u v, Ru2 Π̄u v⟩ ≥ c ⟨Π̄u v, Ru2 Π̄u v⟩ ≥ c λmin (Ru2 2 )⊥
).
|Ker(Ru

Now define Ξ : Sn−1 → R≥0 as Ξ(u) = λmin (Ru2 2 )⊥


). We look for ℓ := minu∈Sn−1 Ξ(u). As Ξ is
|Ker(Ru
a continuous function defined on a compact, it attains is minimum within this compact. If ℓ = 0, then this
means that there exists u∗ ∈ Sn−1 such that Ξ(u∗ ) = λmin (Ru2 ∗ ) = 0, which is impossible by the
|Ker(R2 ∗ )⊥
u
fact that we have restricted Ru2 orthogonalilly to its kernel. Hence ℓ > 0 and finally we have
⟨Xη, Rx2 Xη⟩
= ⟨v, Ru2 v⟩ ≥ cℓ > 0 ,
∥Xη∥2 ∥Xη + σ∥2
which leads to the proof of the lemma considering that there exists K > 0 such that ∥Xη + σ∥2 ≥ K∥η∥2
and noting c = cℓK. Indeed, combining the fact that ⟨Σηt , ηt ⟩ ≤ 4(L(θt )+L(θ∗ )) by the triangle inequality
and the fact that L(θ∗ ) ≤ L(θt ), we obtain that ⟨Σηt , ηt ⟩ ≤ 8L(θt ), which implies that
µn
⟨ηt , ηt ⟩ ≤ ∥Xη + σ∥2 .
4

A.2.3. Convergence of variance reduction techniques


Now, we turn into the convergence of the time-average of the iterates. This corresponds to the proof of
Proposition 4.10.

Proposition 4.10. The idea we use to show the quantitative convergence of the average iterates comes
from [49] and the use of Poisson equation. Indeed, let us define the map φ : Rd → Rd , φ(θ) = Σ−1 θ,
and note (φi (θ))i∈J1,dK its coordinates. Note that, for all i ∈ J1, dK, we have ∇φi (θ) = Σ−1 ei , where
(ei )i∈J1,dK is the canonical basis of Rd . Hence, we have the Poisson equation, for all i ∈ J1, dK and θ ∈ Rd ,

Lφi (θ) = −⟨Σ(θ − θ∗ ), Σ−1 ei ⟩ = θi − θ∗,i .

Hence considering the action of L on fields of Rd applied coordinate-wise, we have more generally Lφ(θ) =
θ − θ∗ . Thus, we have by Itô calculus that for all s ≥ 0
r
γ −1 ⊤
dφ(θs ) = (θs − θ∗ )ds + Σ X Rx (θs )dBs
n
Rt
and thus integrating from 0 to t and dividing by t, and defining the martingale Mt = − √1n X⊤ 0 Rx (θs )dBs
we have,
√ √
1 γ −1 Σ−1 (θt − θ0 ) γ −1
θ̄t − θ∗ = (φ(θt ) − φ(θ0 )) + Σ Mt = + Σ Mt .
t t t t
Hence, multiplying by Σ, taking norms and then expectations, we have
2E∥θt − θ0 ∥2 2γ
E∥Σ(θ̄t − θ∗ )∥2 ≤ + 2 E∥Mt ∥2 .
t2 t
30 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Then, by Itô isometry, we have


Z t 
21 ⊤ 2
E∥Mt ∥ = E ∥X Rx (θs )∥HS ds
n 0
Z t   
1 ⊤ 2
= E Tr XX Rx (θs ) ds
n 0
Z t
≤ 2K E [L(θs )] ds .
0

Moreover, we have seen that in Lemma 4.5 that for γ ≤ K−1 , we have

d1
E∥θt − θ∗ ∥2 ≤ −EL(θt ) + 2σ 2 ,
dt 2
Rt
and hence we have the upper bound on the loss 0 E [L(θs )] ds ≤ 12 ∥θ0 − θ∗ ∥2 + 2σ 2 t. Finally, this gives
that

E∥Mt ∥2 ≤ K∥θ0 − θ∗ ∥2 + 4Kσ 2 t .

On the other hand, we have for all t ≥ 0

E∥θt − θ0 ∥2 ≤ 2E∥θt − θ∗ ∥2 + 2E∥θ∗ − θ0 ∥2


γKσ 2 4γKσ 2
 
−2(1−γK)t 1 2
≤ 4e ∥θ0 − θ∗ ∥ − + + 2∥θ∗ − θ0 ∥2
2 (1 − γK)µ (1 − γK)µ
≤ 4∥θ0 − θ∗ ∥2 ,

the second line by (37) for γ < 1/K. Overall, we have the bound,

8γKσ 2 10∥θ0 − θ∗ ∥2
E∥Σ(θ̄t − θ∗ )∥2 ≤ + .
t t2

Now, we turn into the convergence of θt in the case of the step-size decay. This corresponds to the proof
of Proposition 4.11.

Proposition 4.11. By Ito formula, we obtain that


 
d 2 2 2 γt K 2
E[∥θt − θ∗ ∥ ] ≤ E − ∥X(θt − θ∗ )∥ + ∥Xθt − y∥
dt n n
 
2 2γt K 2γt K
≤ E (− + )∥X(θt − θ∗ )∥2 + ∥Xθt∗ − y∥2
n n n
≤ (−2 + 2γt K)µE[∥θt − θ∗ ∥2 ] + 4γt Kσ 2
≤ −µE[∥θt − θ∗ ∥2 ] + 4γt Kσ 2 .
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 31

Applying the Gronwall lemma, we get


Z t
−µt
2
E[∥θt − θ∗ ∥ ] ≤ E[∥θ0 − θ∗ ∥ ]e 2
+ e−µ(t−s) 4Kσ 2 /(2K + sα )ds
0
Z t/2 Z t
2 −µt −µt/2 2
≤ E[∥θ0 − θ∗ ∥ ]e +e 4Kσ /(2K)ds + 4Kσ 2 /sα ds
0 t/2
(t/2)1−α
≤ E[∥θ0 − θ∗ ∥2 ] + tσ 2
e−µt/2 + 4Kσ 2

α−1
 21+α Kσ 2
 
1
≤ α−1 e−α E[∥θ0 − θ∗ ∥2 ]e(2(α − 1)/µ)α−1 + (2α/µ)α σ 2 + .
t (α − 1)

A PPENDIX B. P ROOFS OF S ECTION 5 : ONLINE SGD

B.1. The noiseless case


We begin this proof’s section with the proof of convergence in the noiseless case. This corresponds to
Theorem 5.1.

Theorem 5.1. (i) The key ingredient of the proof is the Gronwall Lemma. Combining the Itô formula with
(13) gives us
d  
E∥θt − θ∗ ∥2 = −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γETr σ(θt )σ(θt )⊤
dt h  i
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γE Tr (⟨θt , X⟩ − Y )2 XX ⊤
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γKE⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −(2 − γK)E⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −µ(2 − γK)E∥θt − θ∗ ∥2 .
By integrating the latter thanks to Gronwall Lemma, we get the result claim for (i).

(ii) We define ηt := θt − θ∗ and recall that, thanks to the penultimate inequality of (i) we have by integration
that:
Z t
2 2
E∥ηt ∥ ≤ ∥θ0 − θ∗ ∥ − 2(2 − γK) EL(θu )du.
0
The first consequence of this inequality is that
Z t
1
EL(θu )du ≤ ∥θ0 − θ∗ ∥2 .
0 2(2 − γK)
We want to lower bound the term EL(θu ) without using the smallest eigenvalue of Σ, that is supposedly
infinitely small. Similarly to what was done before, thanks to the Hölder inequality, if p, q ∈ (0, 1), p + q =
1, we have, for all t ≥ 0,
 1  h iq/p
E ∥ηt ∥2 p ≤ 2E [L(θt )] E ⟨ηt , Σ−p/q ηt ⟩

(39) .
32 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

We now prove that t 7→ E ⟨ηt , Σ−p/q ηt ⟩ is bounded. Indeed, we proceed as before with Itô formula to get


d    
E ⟨ηt , Σ−p/q ηt ⟩ = −2E⟨ηt , Σ1−p/q ηt ⟩ + γETr σσ ⊤ Σ−p/q
dt  
≤ γETr Σ−p/q (⟨θt , X⟩ − Y )2 XX ⊤
h i
≤ γE ⟨X, Σ−p/q X⟩(⟨θt , X⟩ − Y )2
≤ 2γKp/q L(θt ) ,
where we have used that almost surely, we have ⟨X, Σ−p/q X⟩ ≤ Kp/q . Then, by integrating with respect to
t, it yields
  Z t
−p/q −p/q
E ⟨ηt , Σ ηt ⟩ ≤ ⟨η0 , Σ η0 ⟩ + 2γKp/q EL(θu )du
0
γKp/q
≤ ⟨η0 , Σ−p/q η0 ⟩ + ∥θ0 − θ∗ ∥2 .
2 − γK

γKp/q
Hence, calling C = 21 (⟨η0 , Σ−p/q η0 ⟩ + 2−γK ∥θ0− θ∗ ∥2 )−p/q , we have the inequality, for all t ≥ 0,
1/p
EL(θt ) ≥ C E∥ηt ∥2 ,
and this yields the inequality
Z t 1/p
2 2
(40) E∥ηt ∥ ≤ ∥η0 ∥ − C E∥ηu ∥2 du ,
0
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0, we have
" # 1
1/p−1
2 1
E∥ηt ∥ ≤ 1 ,
∥η ∥2(1/p−1)
+ (1/p − 1)Ct
0

this gives the result claim in the theorem. To see how the last inequality goes, we define g(t) = ∥η0 ∥2 −
Rt 1/p
C 0 E∥ηu ∥2 du (which is positive) and we rewrite (40) as
 ′ p
g (t) g ′ (t) 1
≤ g(t) ⇐⇒ ≥ −C =⇒ (g(t)−1/p+1 − g(0)−1/p+1 ) ≥ −Ct,
−C g(t) 1/p −1/p + 1
and thus
  1
1 2(−1/p+1)
−1/p+1
g(t) ≤ −Ct(− + 1) + ∥η0 ∥ ,
p
and we conclude with (40).
To prove the convergence almost surely, we use the Itô Formula to obtain
Z s

E ∥θt − θ∗ ∥2 Fs = ∥θ0 − θ∗ ∥2 + 2 γ

⟨(θu − θ∗ ) , σ(θu )dBu ⟩
0
Z t h   i 

+ γE Tr σσ (θu ) − 4L(θu ) du Fs ,
0
1
For γ < 2K ,
we have proven in (i) that the term inside the conditional expectation value/integral is negative.
We can thus overestimate the latter by integrating from 0 to s to obtain
E ∥θt − θ∗ ∥2 Fs ≤ ∥θs − θ∗ ∥2 ,

STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 33

and we deduce that ∥θt − θ∗ ∥2 is a positive supermartingale, L1 bounded and therefore converge almost
surely to 0. ■

B.2. The noisy case


We begin this section by proving that the multiplicative noise carries some Lipschitz property. This corre-
sponds to Lemma 5.3.

Lemma 5.3. Recall that the diffusion matrix writes


σ(θ) = (f (θ) − g(θ))1/2 , with
h i
f (θ) = Eρ (⟨θ, X⟩ − Y )2 XX ⊤ , and g(θ) = Eρ [(⟨θ, X⟩ − Y ) X] Eρ [(⟨θ, X⟩ − Y ) X]⊤ .

Let us introduce for all θ ∈ Rd , the differential operator of σ, that is dσθ : Rd → Rd×d such that for all
h ∈ Rd ,
σ(θ + h) = σ(θ) + dσθ (h) + o(∥h∥).
We calculate, for all θ, h ∈ Rd ,
σ(θ + h) = (f (θ + h) − g(θ + h))1/2
= (f (θ) − g(θ) + lθ (h) + o(∥h∥))1/2
where lθ (h) = dfθ (h) − dgθ (h) and
h i
dfθ (h) = 2Eρ (⟨θ, X⟩ − Y ) ⟨h, X⟩XX ⊤

dgθ (h) = E [⟨h, X⟩X] Eρ [(⟨θ, X⟩ − Y ) X]⊤ + Eρ [(⟨θ, X⟩ − Y ) X] E [⟨h, X⟩X]⊤ .

Hence, introducing the square root operator ψ : Sd+ → Sd+ , such that for all M ∈ Sd+ , ψ(M ) = M 1/2 . We
have, for all θ, h ∈ Rd ,
dσθ (h) = dψσ2 (θ) (lθ (h)) ,
that is that dσθ (h) is the unique solution to the matrix Lyapunov equation
σ(θ)dσθ (h) + dσθ (h)σ(θ) = lθ (h) ,
or equivalently, expressed in close form as the expression
Z +∞
dσθ (h) = e−sσ(θ) lθ (h)e−sσ(θ) ds .
0

That being written. Let us pose for t ∈ [0, 1], the function Ψ(t) = σ(tθ + (1 − t)η), we have
Z 1
σ(θ) − σ(η) = Ψ(1) − Ψ(0) = Ψ′ (u)du ,
0

and knowing that Ψ′ (u) = dσmu (θ − η), where mu = uθ + (1 − u)η, and hence using the triangular
inequality for the Hilbert-Schmidt norm, we have
Z 1
∥σ(θ) − σ(η)∥HS ≤ ∥dσmu (θ − η)∥HS du
0

where (mu )0≤u≤1 is a parametrization of the line joining θ and η: mu = uθ + (1 − u)η.


34 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT

Hence, it remains to upper bound the Hilbert-Schmidt norm of the differential of σ thanks to its integral
representation presented above. In fact for all θ, h ∈ Rd , we have
Z +∞
∥dσθ (h)∥HS ≤ ∥lθ (h)∥ ∥e−sσ(θ) ∥2HS ds .
0
Moreover, we have ∥lθ (h)∥ ≤ ∥dfθ (h)∥ + ∥dgθ (h)∥ and using that ∥.∥HS is sub-multiplicative, we have
h i
∥dfθ (h)∥ ≤ 2E |⟨θ, X⟩ − Y | |⟨h, X⟩|∥XX ⊤ ∥
r h ip
≤ 2K E |⟨θ, X⟩ − Y |2 E [|⟨h, X⟩|2 ]
√ p p
= 2 2K L(θ) ⟨Σh, h⟩ ,
and similarly,
√ p p
∥dgθ (h)∥ ≤ 2 2K L(θ) ⟨Σh, h⟩ .
Finally, we have the bound
Z +∞ Z +∞
1
∥e−sσ(θ) ∥2HS ds Tr(e−2sσ(θ) )ds = Tr σ −1 (θ) .

=
0 0 2
As σ 2 (θ) ⪰ a2 L(θ)Id , we deduce that
2 d2
L(θ)Tr σ −1 (θ)
≤ Tr L(θ)σ −2 (θ) d ≤ 2 ,

a
the first inequality by Cauchy-Schwarz and the Lemma 5.3 follows. ■
We can now turn into the proof of the main theorem of this section: the proof of convergence in the noisy
case. This corresponds to proving Theorem 5.4.

Theorem 5.4. (i) Wassertein contraction comes essentially from coupling arguments. Let γ ≤ K−1 and
ρ10 , ρ20 ∈ P2 (Rd ) two possible initial distributions.  Then by [48, Theorem 4.1], there exists a couple of
random variables θ0 , θ0 such that W2 (ρ0 , ρ0 ) = E ∥θ0 − θ0 ∥ . Let (θt1 )t≥0 (resp. (θt2 )t≥0 ) be the solution
1 2 2 1 2 1 2 2


of the SDE (13), sharing the same Brownian motion (Bt )t≥0 . Then, for all t ≥ 0, the random variable
(θt1 , θt2 ) is a coupling between ρ1t and ρ2t , and hence
W22 (ρ2t , ρ1t ) ≤ E ∥θt1 − θt2 ∥2 .
 

Moreover, we denote by ∥.∥HS the Frobenius norm and by Itô formula, we have
d  1
E ∥θt − θt2 ∥2 = −2E ⟨θt1 − θt2 , Σ(θt1 − θ∗ ) − Σ(θt2 − θ∗ )⟩ + γE∥σ(θt1 ) − σ(θt2 )∥2HS
  
dt
≤ −2E ⟨Σ(θt1 − θt2 ), θt1 − θt2 ⟩ + 2γKcE ⟨Σ(θt1 − θt2 ), θt1 − θt2 ⟩ ,
   

thanks to Lemma 5.3. Hence, this gives the inequality:


d  1
E ∥θt − θt2 ∥2 ≤ −2(1 − γKc)E ⟨Σ(θt1 − θt2 ), θt1 − θt2 ⟩
  
(41)
dt
(42) ≤ −2µ(1 − γKc)E∥θt1 − θt2 ∥2 ,
and by Gronwall Lemma, denoting cγ = 2µ(1 − γKc) this gives that
W22 (ρ1t , ρ2t ) ≤ E ∥θt1 − θt2 ∥2 ≤ e−cγ t E ∥θ01 − θ02 ∥2 = e−cγ t W22 (ρ20 , ρ10 ) .
   
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 35

Now, for all s ≥ 0 setting ρ10 = ρ0 ∈ P2 (Rd ) and ρ20 = ρs ∈ P2 (Rd ), we have for all t ≥ 0,
W22 (ρt , ρt+s ) ≤ e−cγ t W22 (ρ0 , ρs ) ,
which shows that the process (ρt )t≥0 is of Cauchy type, and since (P2 (Rd ), W2 ) is a Polish space, ρt →
ρ∗ ∈ P2 (Rd ) as t grows to infinity. Now, since there exists a stationary solution to the process, let us fix
ρ10 = ρ∗ ∈ P2 (Rd ) and ρ20 = ρ0 ∈ P2 (Rd ). We have then,
W22 (ρt , ρ∗ ) ≤ e−cγ t W22 (ρ0 , ρ∗ ) ,
which concludes the first part of the Theorem.

(ii) We will use the same steps as for the proof of Theorem 4.3 (ii). Again, one steadily checks that for
p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,
1/p
E ∥θt1 − θt2 ∥2
h 
1/2 1 2
 2i
(43) E ∥Σ θt − θt ∥ ≥  q/p .
E ⟨θt1 − θt2 , Σ−p/q (θt1 − θt2 )⟩
By Ito formula, skipping the details, we get
d  1   
E ⟨θt − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ γETr Σ−p/q (σ(θ1 ) − σ(θ2 ))2
dt
≤ 2γcp/q Kp/q ⟨Σ(θ1 − θ2 ), θ1 − θ2 ⟩ ,
thanks to assumption (29). In addition, by (41), we get
Z t h
 i 1
E ∥Σ1/2 θu1 − θu2 ∥2 du ≤ E∥θ01 − θ02 ∥2 .
0 2(1 − γKc)
Collecting the estimates, we thus get
  γc K  
p/q p/q
E ⟨θt1 − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ E∥θ01 − θ02 ∥2 + E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ .
1 − γKc
That is to say that
h  i 1/p
E ∥Σ1/2 θt1 − θt2 ∥2 ≥ CE ∥θt1 − θt2 ∥2

(44) ,
   γcp/q Kp/q −q/p
with C = E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ + 1−γcK E(∥θ01 − θ02 ∥2 ) , which combined with equa-
tion (41), gives
Z t
 1 2 2
 1 2 2
1/p
E ∥(θu1 − θu2 )∥2
  
E ∥θt − θt ∥ ≤ E ∥θ0 − θ0 ∥ − 2C (1 − γcK) du ,
0
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0 (for C = 2C (1 − γcK))), we
have
 h  1−1/p i 1
1−1/p
E ∥θt1 − θt2 ∥2 ≤ E ∥θ01 − θ02 ∥2

+ (1/p − 1)Ct ,
and we conclude as for (i). ■

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy