SDE For SGD
SDE For SGD
A BSTRACT. We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD)
arXiv:2407.02322v1 [cs.LG] 2 Jul 2024
for the least-square problem. Indeed, pursuing the work of [1], we analyze Stochastic Differential Equations
(SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online set-
ting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective
of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly
degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates
of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude.
Numerical simulations supporting our findings are also presented.
1. I NTRODUCTION
The stochastic gradient descent (SGD) is the workhorse of any large-scale machine learning pipeline. De-
scribed more than seventy years ago as a versatile stochastic algorithm [2], it has been studied thoroughly
since then (e.g. [3, 4]), and applied with success as an efficient computational and statistical device for large
scale machine learning [5]. Yet, since the emergence of deep-neural networks (DNNs), some new lights
have been shed on the use of the algorithm: among other, the role of SGD’s noise in the good generalization
performances of DNNs [6] has been described empirically [7], while its particular shape of covariance is
expected to play a predominant role is its dynamics [8].
In this direction, [1] proposed to approximate the SGD dynamics thanks to a Stochastic Differential
Equation (SDE) whose noise’s covariance matches the one of SGD. Since then, many works have leveraged
this continuous perspective, attempting to better describe some of SGD’s phenomenology: among other,
the role of step-size [9], the escape time and direction from local minimizers [10], study of the invariant
distribution [11] or the heavy tail phenomenon [12]. However, lured by the strong analytical tools that offer
SDE models, some misconceptions on the basic nature of SGD’s noise in the machine learning context have
also emerged. Among other, one important message of [13] is to recall that SGD’s noise has a particular
shape and intensity that make SGD far from a Langevin type of dynamics where isotropic noise is added to
the gradient at each step. Notably, if the data model can be interpolated perfectly, the invariant measure of
SGD could be largely degenerate without need of step-size decay: this is the case in the overparametrized
regime [14, 15].
In this article, we take a step back from general purpose studies of SGD and focus on the specific case of
least-squares, where the predictor is a linear function of fixed features. Needless to say that linear predictors
do not lead to state-of-the art performance in most modern tasks, yet, the abundant literature on the neural
tangent kernel [16] recalled us that we have still to understand better linear models. Moreover, even if the
article is not written within the setting of kernel methods for the sake of simplicity, every assumption is
†
U NIVERSITÄT B ONN , EMAIL : ASCHERTZ @ UNI - BONN . DE
⋆
E COLE DES P ONTS PARIS T ECH - C ERMICS , EMAIL : LOUCAS . PILLAUD - VIVIEN @ ENPC . FR
1
2 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
made in order to easily adapt to this setting, replacing Rd equipped with its canonical Euclidean structure
by an abstract reproducing kernel Hilbert space.
Purpose and contributions. The aim of the present paper is to show that, with a proper model, SDEs offer
nice analytical tools that help to clean the analysis while capturing the qualitative as well as the quantita-
tive essence of SGD’s dynamics. We present a wide range of results on both online and empirical SGD,
demonstrating through quantitative analysis that the key difference lies in the model’s ability to achieve
perfect interpolation. We also present systemically non-asymptotic rates of convergence in all the settings
we present, either in ℓ2 -norm in the interpolation regime (Theorems 4.3 and 5.1), or in Wasserstein dis-
tance (towards the invariant distribution) in the case of a noisy system (Theorems 4.6 and 5.4). In the latter
case, we further investigate the invariant distribution: we pinpoint its location (Proposition 4.7), and more
importantly, we demonstrate in Proposition 4.9 that although there is no heavy tail phenomenon in finite
time, it emerges asymptotically. We finally address convergence of variance reduction techniques like time-
averaging (Proposition 4.10) and decay of step-size (Proposition 4.11). Throughout the article, we try to
present some theoretical background on the study of SDEs: among other the use of Lyapunov potentials and
of coupling methods that enables to study quantitatively the speed of convergence to equilibrium.
Further related work. Formal links between the true stochastic gradient descent and its continuous model
are studied in [1] on the theoretical side, where a weak error analysis is provided, and in [17] on the experi-
mental side. In a similar article in spirit, [18] provides convergence results on general convex function, but
with a systematic polynomial step-size decay. Some results implying the use of SDEs to study the influence
of the noise on the implicit bias the algorithm are given in the least-squares case in [19], and is an active topic
in general [20, 21, 22, 23]. Quantitative studies of the invariant distribution focusing on the particular shape
and intensity of the noise covariance include [24, 25]. Note that when the step-size is properly re-scaled
with respect to the dimension, SGD converges in the high-dimensional limite to a SDE that is similar to the
one we study here [26], which is called homogenized SGD in the least-square context [27]. Power laws of
convergence toward stationarity related to the eigenvalue decay of the covariance matrix (capacity condi-
tion) are ubiquitous in statistics [28] and the study of SGD for least-squares [29, 14, 30, 31, 32]. Finally, the
heavy-tail phenomenon, has been re-discovered lately as an interesting feature of SGD with multiplicative
noise [33, 12] and more recently in the context of SDE in the work of [34].
Organisation of the paper. In Section 2, we present the general set-up of SGD both in the population
and empirical cases as well as the possibility of the data in both cases to be fully interpolated. Section 3
explains the relevance of building a consistent SDE model of SGD and recall technical details related to
SDEs. In Section 4, the results concerning the dynamics of SGD in the training case are given, both in the
interpolation regime (Section 4.1) and the noisy regime (Section 4.2). Section 5 is built similarly to the
previous one and is devoted to online SGD. The proofs are postponed to the Appendix, for which precise
references can be found in the main text.
In this section, we introduce the least-square problem that we consider throughout the article, putting em-
phasis on the difference between empirical distributions (training loss) and true ones (population loss).
Nonetheless, the central argument of the article is that, aside from the known differences between empirical
and test cases, the primary qualitative distinction hinges on whether the possibility of loss can be zero (inter-
polation), irrespective of the discrete nature of the distribution. The stochastic gradient descent is introduced
at the end of the section.
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 3
Remark 2.1 (Link with RKHS). Here, the family of predictor consists in linear functions of the input data
x ∈ Rd . While for the sake of clarity, we’ll keep this linear family throughout the article, note that the same
results apply for any family of linear predictors in an abstract reproducing kernel Hilbert space (H, ⟨·, ·⟩H ),
with feature Rd ∋ x 7→ φ(x) ∈ H, changing the dot product and family of predictors with the natural
structure: {fθ : x 7→ ⟨θ, φ(x)⟩H , θ ∈ H}.
In this article, in order to put emphasis on the possible difference between the two settings, we make a
clear distinction between the population and the empirical cases. To keep notations simple, we refer to the
population case whenever Ω := supp(ρ) is an open set of Rd . On the contrary, we refer to the empirical
case when ρ is a finite sum of atomic measures, i.e. there exist (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R such that
1 Pn
ρ = n i=1 δ(xi ,yi ) . Obviously, the term empirical refers to the fact that, in this case, Eq.(1) can be seen as
the training loss
n
1 Xh 2 i
(2) L(θ) = ⟨θ, xi ⟩ − yi .
2n
i=1
In this case, the number n ∈ N∗will always denote the number of observed inputs/outputs pairs (xi , yi )i=1...n ,
and we will use the notations X = [x1 , . . . , xn ]⊤ ∈ Rn×d to denote the design matrix, and y = (y1 , . . . , yn )⊤ ∈
1
Rn to denote the output vector. With these notations, the training loss rewrites L(θ) = 2n ||Xθ − y||22 .
Note that even though for the sake of clarity we will sometimes treat them separately, these cases fit into
the same framework and notations.
Example 2.3 (Underparametrized setting). Consider the underparametrized regime for which n > d, and
assume that each i.i.d. input/output couple (xi , yi )i=1...n come from independent distributions that have
densities (i.e. for all i ∈ J1, nK, xi and yi are independent and are distributed according to laws that are
absolutely continuous with respect to Lebesgue). Then, we have that almost surely I = ∅.
Noiseless setting I ̸= ∅. This means that there exists at least one perfect linear interpolator of the model.
In the population setting, this corresponds to the strong assumption that the model is well-specified and
noiseless (formally ξ = 0, if we refer to Example 2.2). Notably, this regime has received a large attention
recently to model the large expressive power of neural network [19, 14, 35]. Yet, in the empirical case, this
typically and simply corresponds to the overparametrized regime d ≥ n.
Example 2.4 (Overparametrized setting). Consider the overparametrized regime for which d ≥ n, and
assume that ((x1 , y1 ), . . . , (xn , yn )) are i.i.d. samples drawn from a distribution that has a density. Then,
we have that, the zero-loss set is the affine set I = θ∗ + Ker(X), where θ∗ is any element of I. Furthermore,
dim I ≥ d − n ≥ 0 almost surely.
Despite the fact that this dynamics looks very similar, the important difference is that, the batch of samples
being fixed, the dynamics can select several times the same pair (xk , yk ). This is the reason why, informally
after t ≥ Θ(n) iterations, the training dynamics Eq.(7) will deviate from the online SGD presented in Eq.(5).
To end this section, let us also rewrite this dynamics with respect to the gradient descent plus the martingale
increments:
θt+1 = θt − γ∇θ L(θt ) + γ (∇θ L(θt ) − (⟨θt , xit ⟩ − yit ) xit )
n
1X
(8) = θt − γ (⟨θt , xi ⟩ − yi ) xi + γm(θt , (xit , yit )) ,
n
i=1
1 Pn
where m(θt , (xt , yt )) := n i=1 (⟨θt , xi ⟩ − yi ) xi − (⟨θt , xit ⟩ − yit ) xit . Note that these equations are in
fact exactly the same as the ones
P presented in the population case, simply considering that the distribution
is an empirical one, i.e. ρ = n1 ni=1 δ(xi ,yi ) . We nonetheless decided to present them explicitly for the sake
of clearness.
In this section we give stochastic differential equations (SDEs) models for SGD. In a first time, we provide
general necessary conditions to model well SGD. We instantiate them more precisely in a second time.
This question has received a large attention in the last decade, and a good principle to answer this is to turn
to stochastic modified equations [1]. This is a natural way to build models of SDE since they are consistent
in the infinitesimal step-size limit with SGD. In order to build such model, there are two requirements:
(i) The drift term b(t, θt ) should match −∇L(θt ).
(ii) The noise factor σ should have the same covariance as the local martingale m, i.e.
h i
(10) σ(t, θt )σ(t, θt )⊤ = Eρ m(θt , (xt , yt ))m(θt , (xt , yt ))⊤ Ft−1 .
Besides technical assumptions, these are the two requirements presented in [1, Theorem 3] to show that the
SDE model is consistent in the small step-size limit with the SGD recursion. Going beyond the approxi-
mation concerns tackled by (i) and (ii), it has recently been observed that the SGD noise carries a specific
shape that the SDE model should carry as well [13, 22]. This requirement has a more qualitative nature but
is important to fully capture the essence of the SGD dynamics
6 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
(iii) The noise term σ(t, θt )dBt should span the same space as m(θt , (xt , yt )).
We will see below, that, in order to build a SDE model of SGD, this third requirement is particularly
important in the empirical case, where the noise a strong degeneracy.
Initial condition and moments. Initialization will be taken at some θ0 ∈ Rd which can be considered as a
random variable. Standard choices for the law of θ0 include the standard Gaussian of Rd , or a dirac measure
on some vector θ0 , e.g. θ0 = 0. Either cases, its law ρ0 has moments of all orders, and since drift and
multiplicative noise of the SDEs (12)-(13) have at most linear growth, Theorem 3.5 of [38] shows that the
marginal laws (ρt )t≥0 all have moments at all orders at any time t ≥ 0.
Now that we motivated the SDE models in the previous part, and stated them in Eqs.(12)-(13), we study
their convergence property. This is the purpose of the main results that are stated in the two following
sections.
Recall that this corresponds to the finite data set case with n input/output pairs (xi , yi )i=1...n , stacked into
data matrix X ∈ Rn×d and data output vector y ∈ Rd . We study in this section the SDE given in equa-
tion (13). We assume that all data are bounded, i.e. there is some K > 0 such that ∥xi ∥2 ≤ K for all
i ∈ J1, nK. Let us introduce first an important element of the model. In this section, we define
Definition 4.1. Let X† denote the pseudo-inverse of the design matrix X. We define
(14) θ∗ = X† y + (I − X† X)θ0 .
Note that, in generic cases, in the underparametrized case for which d ≤ n, we have X† = (X⊤ X)−1 X⊤ .
Then, X† X = I, and θ∗ = X† y is the Ordinary least-square estimator and does not depend on θ0 . In
the generic overparameterized case for which n ≤ d, we have X† = X⊤ (XX⊤ )−1 , and hence Xθ∗ =
XX† y + (X − XX† X)θ0 = y, that is to say that θ∗ ∈ I.
Finally, for both cases, we define Σ := n1 X⊤ X ∈ Rd×d the design covariance matrix. With this notation,
the SDE becomes
r
γ ⊤
dθt = −Σ(θt − θ∗ )dt + X Rx (θt )dBt ,
n
where (Bt )t≥0 is a Brownian motion of Rn .
Comparison with standard processes. As already said, the dynamics in the noiseless and noisy cases are
of different natures because √ of the possibility to cancel the multiplicative noise term. Indeed, for clarity
imagine that n = d and X/ n = Σ = Id . If the noise can cancel, Rx (θt ) has the shape of a linear term like
diag(θ − θ∗ ), and each the coordinate of the difference η = θ − θ∗ follows a one-dimensional Geometric
√
Brownian motion dηt = −ηt dt + γηt dBt , for which it is known that ηt → 0 almost surely, which
corresponds to θt → θ∗ . This comparison is the governing principle of the analysis of this setup. Otherwise,
if the noise cannot cancel and is strictly lower bounded, then, under the same proxy, the movement of each
√ p
coordinate of η resembles the SDE dηt = −ηt dt+ γ ηt2 + σ 2 dBt which looks like a Ornstein-Uhlenbeck
process if η ≪ σ, but with a noise that has a multiplicative part ηt dBt when η ≫ σ. These are known in one
dimension under the name of Pearsons diffusions [39] and exhibit stationary distribution with heavy tails.
The difference between these two settings is the reason why we divide the results into two different
subsections.
The proof is deferred to Section A.1 of the Appendix. Note that this fact is often known as the implicit bias
of least-squares methods and has statistical consequences studied under the name of “benign overfitting” [40,
41].
Remarkably, despite the randomness of the SDE (13), the convergence is almost sure towards θ∗ , with
explicit rates that we show below.
1
Theorem 4.3. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd , then for γ < 3K ,
we have that (θt )t≥0 converges almost surely to θ∗ with the following rates
(i) Parametric rate. For all t ≥ 0, we have that
(16) E[∥θt − θ∗ ∥2 ] ≤ ∥θ0 − θ∗ ∥2 e−µ(2−Kγ)t ,
where µ > 0 is the smallest non zero eigenvalue of Σ.
(ii) Non-parametric rate. For all t ≥ 0, we have that for all α > 0,
α
2 1
(17) E[∥θt − θ∗ ∥ ] ≤ ,
∥θ0 − θ∗ ∥−2/α + Cα t
1 −α (θ − θ )⟩ + γKα ∥θ − θ ∥2 )−1/α , and Kα = maxi≤n ⟨xi , Σ−α xi ⟩.
where Cα = 2α (⟨θ0 − θ∗ , Σ 0 ∗ 2−Kγ 0 ∗
kθt − θ∗k2
−1
−2
−3
0 2 4 6 8
Time
F IGURE 1. Plot showing the error of SGD along time for an overparametrized regime
where n = 100 and d = 200. The samples (xi )i≤n come from a Gaussian distribution
with a covariance whose eigenvalues decay as a power law. The vertical dotted orange line
illustrates the separation between the two regimes depicted by Theorem 4.3, the polyno-
mial one before (a straight line in a log-log plot) and the exponential line, after typical time
scale 1/µ. This illustrates perfectly the rates of convergence shown in Theorem 4.3.
θ0
θ∗
Semi-group. Due to the non-degeneracy of the noise term in this case, it is more convenient to track the
dynamics of the probability measure ρt := Law(θt ) for any R t ≥ 0, which is the d
time marginal of the
process initialized at θ0 and defined such that E[f (θt )] = f dρt , for any f : R → R smooth enough.
In fact the SDE Eq.(13) has an associated semi-group Pt , defined so that for all η ∈ Rd , (Pt f )(η) :=
E[f (θt ) | θt=0 = η]. Formally, if ρt has a density, we have ρt (θ) = (Pt δθ )(θ0 ) for all θ ∈ Rd . Note that by
local strong ellipticity, even if ρ0 = δθ0 is singular, regularization properties of SDEs ensure that at t = 0+ ,
the measure ρt has a density with respect to Lebesgue as well as a second order moment [43, Section 4 of
Chapter 9]. We place ourselves in this setup where ρ0 ∈ P2 (Rd ), the set of probability measures µ such that
2
R
∥θ∥ dµ(θ) < +∞.
Infinitesimal generator. It is known that the time-marginal law of (θt )t≥0 satisfies the (parabolic) Fokker-
Planck equation (at least in the weak sense):
∂t ρt = L∗ ρt ,
with the operator L∗ being defined as, for all θ ∈ Rd
d h
∗ γ X ⊤ 2
i
(L f )(θ) = div [Σ (θ − θ∗ ) f (θ)] + ∂ij X Rx (θ)X f (θ) ,
2n ij
i,j=1
whose adjoint (with respect to the canonical dot product of L2 (Rd )) is often referred to as the infinitesimal
generator of the dynamics and writes,
d
γ X ⊤ 2
(18) (Lf )(θ) = −⟨Σ (θ − θ∗ ) , ∇f (θ)⟩ + [X Rx (θ)X]ij ∂ij f (θ),
2n
i,j=1
for all test functions f : Rd → R sufficiently smooth. Recall furthermore that the evolution of expectations
d
of observables (f (θt ))t≥0 is given by the Dynkin forrmula [38, Lemma 3.2]: dt E [f (θt )] = LE [f (θt )]. This
identity enables, as in the deterministic case for gradient flow, the use of Lyapunov functions. This is a very
useful tool to study the asymptotic behavior of stochastic processes. This is the objective of the following
lemma:
Lemma 4.5. Let V (θ) = 12 ∥θ − θ∗ ∥2 , we have the inequality, for all θ ∈ Rd ,
(19) LV (θ) ≤ −2(1 − γK/2)L(θ) + 2σ 2 .
In consequence, for γ ≤ 1/(3K), there exists a stationary process to the SDE (13).
The proof is postponed to the Appendix A.2.
SGD SGD
Averaged SGD Averaged SGD
Decaying step-sizes SGD Decaying step-sizes SGD
SGD SGD
Averaged SGD Averaged SGD
Decaying step-sizes SGD Decaying step-sizes SGD
F IGURE 3. Four plots showing the trajectory of SGD in the noisy setting. The arrow of
time goes from top left to bottom right. We see that the two variance reduction methods
(time average and decaying step-sizes) converge towards θ∗ (confirming Propositions 4.10
and 4.11), while plain SGD has a stationary distribution with certain fluctuations around
its mean θ∗ as explained in Theorem 4.6 and Proposition 4.7. Plain SGD is faster to its
invariant distribution as the variance reduction methods as shown in the convergence rates
provided in the results.
Heavy-tails or not? Recall that at any time t > 0 all the moments of the law of θt exist (provided the initial
distribution had such moments). Yet, depending on the step-size, moments of the invariant distribution
might not exist. More precisely, we show that if one fixes a step-size γ > 0, all moments up to a certain
value α(γ) of the invariant distribution exist and all higher moments do not. In other words the step size is
a direct control of the tail of the asymptotic distribution.
1
Proposition 4.9. For n ≥ 2d and a fixed γ < 3K , it exists α > 0 large enough such that
E(∥Θ∗ ∥α ) = +∞.
The proof of this result can be found in the Appendix, Section A.2.2. This result is in accordance of the
fact that multiplicative noise in SGD induces heavy-tails of its asymptotic distribution [33, 12, 34].
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 13
This result of polynomial decay when the step-size decay polynomially is similar to the one provided
in the work of [18], and is re-proven here for the sake of completeness. For both results, the proofs are
postponed to the Appendix, Section A.2.3. All results on convergence are illustrated in Figure 3.
In the previous section, on SGD in the empirical setting, we have shown that the core of the qualitative
description relies on the fact that the noise can cancel or not. The situation is similar for online SGD. In fact
most of the calculation that we have shown so far transfer almost immediately within this setting. This is the
reason why we will only show the results concerning the convergence in the noisy and the noiseless settings
and describe briefly what properties will be similar than the ones exhibited in the previous section.
First, we recall the main equation governing the dynamics (12):
√
dθt = −Eρ [(⟨θt , X⟩ − Y ) X] dt + γσ(θt )dBt ,
where (Bt )t≥0 is a Brownian motion of Rd and σ ∈ Rd×d is given by
h i 1/2
σ(θ) := Eρ rX (θ)2 XX ⊤ − Eρ [rX (θ)X] Eρ [rX (θ)X]⊤ ∈ Rd×d .
Let us define Σ = Eρ [XX ⊤ ] ∈ Rd×d the covariance matrix that we assume invertible and θ∗ = Σ−1 Eρ [Y X] ∈
Rd . We have that for all θ ∈ Rd , Eρ [(⟨θ, X⟩ − Y ) X] = Σ(θ − θ∗ ), and hence the dynamics writes:
√
dθt = −Σ(θt − θ∗ )dt + γσ(θt )dBt .
Despite the difference between the empirical and population learning setups, the story here is similar to
what is describe in Section 4. Indeed, the core of the behavior of the algorithm relies only on the fact
that θ∗ cancels also the noise matrix σ, i.e σ(θ∗ ) = 0. If this is the case, i.e. we are in the interpolation
(or noiseless) regime, then the same analysis as in Subsection 4.1 applies and the movement resembles
14 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
multivariate geometric Brownian motion. If uniformly σ(θ) ≻ 0, then there will be a non degenerate
invariant measure and the same results as Subsection 4.2 go through.
Hence, as in the case of the training dynamics, we have that for all t ≥ 0,
α
2 2 −µt 1
E[∥θt − θ∗ ∥ ] ≤ min ∥θ0 − θ∗ ∥ e , inf ,
α∈R+ ∥θ0 − θ∗ ∥−2/α + Cα t
and, if µ is non-zero but very small (e.g. 10−10 ), this inequality describes well the difference between the
transcient regime of convergence that is polynomial and the asymptotic regime that is exponential but occur-
ing after time-scale 1/µ. We recalled here the result as in the previous section for the sake of completeness,
but the result and the proof of the theorem (which can be found in Appendix, Section B.1) are rather similar
to the training case. This shows that what really matters is the ability of the model to be interpolated or not
and not the finite sample size.
Example 5.2 (Gaussian model). Assume that X ∼ N (0, Σ), with Σ ⪰ µId and that there exists θ∗ ∈ Rd and
ξ ∈ R a random variable independent of X of mean zero and variance 2σ 2 , such that Y = ⟨θ∗ , X⟩+ξ. Then,
by some forth order Gaussian moment calculation given in [14, Lemma 1], we have the exact calculation
as well as L(θ) ≥ σ 2 .
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 15
Semi-group and Fokker-Planck equation. As for the training setup, due to the non-degeneracy of the
noise term in this case, it is more convenient to track the dynamics of the probability measure ρt := Law(θt )
for any t ≥ 0. We place ourselves in the situation ρ0 ∈ P2 (Rd ) and recall that (θt )t≥0 satisfies the (parabolic)
Fokker-Planck equation (at least in the weak sense) ∂t ρt = L∗ ρt , with the operator L∗ being the adjoint (with
respect to the canonical dot product of L2 (Rd )) of the infinitesimal generator of the dynamics that writes,
d
γ X
(25) (Lf )(θ) = −⟨Σ (θ − θ∗ ) , ∇f (θ)⟩ + [σ(θ)σ(θ)⊤ ]ij ∂ij f (θ),
2
i,j=1
for all test functions f : Rd → R sufficiently smooth. Recall furthermore that the evolution of expectations
of observables (f (θt ))t≥0 is given by the action of the generator:
d
(26) E [f (θt )] = LE [f (θt )] .
dt
5.2.1. Invariant measure and convergence.
As the multiplicative noise does not cancel, L∗ is uniformly elliptic and there is existence and uniqueness
of the invariant measure ρ∞ of the SDE, which satisfies the PDE L∗ ρ∞ = 0. Moreover, by ergodicity,
the law of the iterates eventually converges towards this unique invariant measure. In order to show this
quantitavely, we first prove a useful lemma on the Lipschitz behavior of the multiplicative noise:
Lemma 5.3. There exists of constant c > 0 depending only on the distribution ρ, such that for all θ, η ∈ Rd ,
we have
(27) ∥σ(θ) − σ(η)∥2HS ≤ 2cK⟨Σ(θ − η), θ − η⟩ .
The proof is in the Appendix, Section B.2. We are now ready to state the main theorem of the section.
Indeed, in the following result we also provide a quantitative statement, in Wassertein distance, on the speed
of convergence of the dynamics towards such a measure.
1
Theorem 5.4. Let (θt )t≥0 follows the dynamics given by Eq.(13) initialized at θ0 ∈ Rd , then for γ < Kc ,
there exists a unique stationary distribution ρ∗ ∈ P2 (Rd ), and quantitatively,
(i) Parametric rate. For all t ≥ 0, we have that
(28) W22 (ρt , ρ∗ ) ≤ W22 (ρ0 , ρ∗ )e−2µ(1−γKc)t ,
where µ > 0 is the smallest non zero eigenvalue of Σ.
(ii) Non-parametric rate. Assume that we have the inequality for all α > 0,
(29) ∥Σ−α/2 (σ(θ) − σ(η))∥2HS ≤ 2cα Kα ⟨Σ(θ − η), θ − η⟩ ,
then, for all t ≥ 0, we have that ∀α > 0,
α
2 ∗ 1
(30) W2 (ρt , ρ ) ≤ ,
W22 (ρ0 , ρ∗ )−2/α + Cα t
−1/α
with Cα = 2(1−γcK)
α E [⟨θ 0 − Θ∗ , Σ −α (θ − Θ )⟩] + γcα Kα W 2 (ρ , ρ∗ ))
0 ∗ 1−γcK 2 0 , and where the ex-
pectation is taken w.r.t. the optimal Wasserstein coupling between θ0 ∼ ρ0 and Θ∗ ∼ ρ∗ .
Similarly as before, the result is similar to the one of the underparametrized regime for the training
dynamics. This shows once again that what really matters is the ability of the model to be interpolated or
not and not the finite sample size. The proof of the theorem can be found in Appendix, Section B.2.
16 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
In this article, we have shown how the SGD could be efficiently modeled by a SDE to reflect its main
qualitative and quantitative features: convergence speed, difference between the noisy and noiseless settings
and study of the asymptotic distribution. The specificity of the least-square set-up enabled us to show
some localization of the invariant measure ρ∗ in the noisy context: however, it seems possible to improve
this understanding to its precise shape, and some questions remain: is ρ∗ log-concave? Is it possible to
characterize its covariance in order to better apprehend its shape? Also, regarding its heavy-tail behavior,
it would be great to have a precise estimate of the exponent, from which the moment of the stationary
distribution explodes.
This work has been done in order to convey the idea that the SDE framework can improve the under-
standing of the SGD dynamics. This is clear for the least-squares setup, yet the important question is to go
beyond this and try to apply the same methodology for the non-convex dynamics arising from the training of
non-linear neural networks. The study of single or multi-index models [26, 47] could be a first step toward
broadening this systematic study.
Acknowledgments. AS acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German
Research Foundation) under Germany’s Excellence Strategy – GZ 2047/1, Projekt-ID 390685813. AS and
LP warmly thank the the Simons Foundation and especially the Flatiron Institute for its support as this
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 17
research has been initiated while AS was invited in New York. Finally, the authors extend their gratitude to
the Incubateur de Fraîcheur for hosting them and providing an ideal atmosphere that fostered exceptional
discussions.
R EFERENCES
[1] Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i:
Mathematical foundations. The Journal of Machine Learning Research, 20(1):1474–1520, 2019. 1, 2, 5
[2] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407, 1951. 1, 4
[3] Gilles Pagès. Sur quelques algorithmes récursifs pour les probabilités numériques. ESAIM: Probability and Statistics, 5:141–
170, 2001. 1
[4] Michel Benaïm. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII, pages 1–68.
Springer, 2006. 1, 5
[5] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems,
2008. 1, 4
[6] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires
rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. 1
[7] Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd. arXiv preprint arXiv:1802.08770, 2018.
1
[8] Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Sgd with large step sizes
learns sparse features. In International Conference on Machine Learning, pages 903–925. PMLR, 2023. 1
[9] Jeff Z HaoChen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covari-
ance. In Conference on Learning Theory, pages 2315–2357. PMLR, 2021. 1
[10] Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent
exponentially favors flat minima. arXiv preprint arXiv:2002.03495, 2020. 1
[11] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for
deep networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–10. IEEE, 2018. 1
[12] Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. In International Conference on
Machine Learning, pages 3964–3975. PMLR, 2021. 1, 2, 12
[13] Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part i: Discrete time analysis. Journal
of Nonlinear Science, 33(3):45, 2023. 1, 5
[14] R. Berthier, F. Bach, and P. Gaillard. Tight nonparametric convergence rates for stochastic gradient descent under the noiseless
linear model. In Advances in Neural Information Processing Systems, 2020. 1, 2, 4, 8, 14
[15] Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in
the interpolation regime. Advances in Neural Information Processing Systems, 34:21581–21591, 2021. 1
[16] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural net-
works. Advances in neural information processing systems, 31, 2018. 1
[17] Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. On the validity of modeling sgd with stochastic differential equations (sdes).
Advances in Neural Information Processing Systems, 34:12712–12725, 2021. 2
[18] Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approximation results for sgd and its
continuous-time counterpart. In Conference on Learning Theory, pages 1965–2058. PMLR, 2021. 2, 13
[19] Alnur Ali, Edgar Dobriban, and Ryan Tibshirani. The implicit regularization of stochastic gradient flow for least squares. In
International conference on machine learning, pages 233–244. PMLR, 2020. 2, 4, 6
[20] Liu Ziyin, Kangqiao Liu, Takashi Mori, and Masahito Ueda. Strength of minibatch noise in sgd. arXiv preprint
arXiv:2102.05375, 2021. 2
[21] Lei Wu, Chao Ma, et al. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective.
Advances in Neural Information Processing Systems, 31, 2018. 2
[22] Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable
benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021. 2, 5
[23] Loucas Pillaud-Vivien, Julien Reygner, and Nicolas Flammarion. Label noise (stochastic) gradient descent implicitly solves
the lasso for quadratic parametrisation. In Conference on Learning Theory, pages 2127–2159. PMLR, 2022. 2
[24] Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. In International Conference on
Machine Learning, pages 15959–15975. PMLR, 2022. 2
18 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
[25] Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type part ii: Continuous time analysis.
Journal of Nonlinear Science, 34(1):16, 2024. 2
[26] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for sgd: Effective dynamics and
critical scaling. Advances in Neural Information Processing Systems, 35:25349–25362, 2022. 2, 16
[27] Courtney Paquette, Elliot Paquette, Ben Adlam, and Jeffrey Pennington. Homogenization of sgd in high-dimensions: Exact
dynamics and generalization properties. arXiv preprint arXiv:2205.07069, 2022. 2
[28] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational
Mathematics, 7(3):331–368, 2007. 2
[29] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Annals of Statistics, 44(4):1363–
1399, 2016. 2, 8
[30] L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through
multiple passes. Advances in Neural Information Processing Systems, 31:8114–8124, 2018. 2
[31] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The
crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021. 2
[32] Blake Bordelon and Cengiz Pehlevan. Learning curves for sgd on structured features. arXiv preprint arXiv:2106.02713, 2021.
2
[33] Liam Hodgkinson and Michael Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In International
Conference on Machine Learning, pages 4262–4274. PMLR, 2021. 2, 12
[34] Zhe Jiao and Martin Keller-Ressel. Emergence of heavy tails in homogenized stochastic gradient descent. arXiv preprint
arXiv:2402.01382, 2024. 2, 12
[35] S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated
perceptron. In International Conference on Artificial Intelligence and Statistics, pages 1195–1204. PMLR, 2019. 4
[36] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review,
60(2):223–311, 2018. 4
[37] J Harold, G Kushner, and George Yin. Stochastic approximation and recursive algorithm and applications. Application of
Mathematics, 35, 1997. 5
[38] Rafail Khasminskii. Stochastic stability of differential equations, volume 66. Springer Science & Business Media, 2011. 5, 7,
10, 23
[39] Julie Lyng Forman and Michael Sørensen. The pearson diffusions: A class of statistically tractable diffusion processes.
Scandinavian Journal of Statistics, 35(3):438–465, 2008. 7
[40] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of
the National Academy of Sciences, 117(48):30063–30070, 2020. 8
[41] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201,
2021. 8
[42] Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. Advances in Neural
Information Processing Systems, 29, 2016. 8
[43] Avner Friedman. Partial differential equations of parabolic type. Courier Dover Publications, 2008. 10
[44] Max-K von Renesse and Karl-Theodor Sturm. Transport inequalities, gradient estimates, entropy and ricci curvature.
Communications on pure and applied mathematics, 58(7):923–940, 2005. 11
[45] Martin Hairer. Convergence of markov processes. Lecture notes, 18:26, 2010. 11
[46] Patrick Cattiaux and Arnaud Guillin. A journey with the integrated γ 2 criterion and its weak forms. In Geometric Aspects of
Functional Analysis: Israel Seminar (GAFA) 2020-2022, pages 167–208. Springer, 2023. 11
[47] Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. On learning gaussian multi-index models with gradient flow. arXiv
preprint arXiv:2310.19793, 2023. 16
[48] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. 23, 34
[49] Jonathan C Mattingly, Andrew M Stuart, and Michael V Tretyakov. Convergence of numerical time-averaging and stationary
measures via poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010. 29
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 19
A.1: The noiseless case. This includes the proof of the fact that θ∗ is the orthogonal projection of θ0 into I
(Lemma 4.2) and the proof of the convergence theorem of θ to θ∗ (Theorem 4.3).
A.2: The noisy case. We prove here the existence of a stationary distribution (Lemma 4.5).
A.2.1: Invariant measure and convergence. In this subsection, we prove the quantitative convergence to
the the stationary distribution (Theorem 4.6) with the help of a technical Lemma (Lemma A.1).
A.2.2: Localization of the invariant measure. We prove the insights given on the first and second mo-
ments of the stationary distribution (Proposition 4.7). Then, we show the moment explosion of the invariant
distribution (Proposition 4.9).
A.2.3: Convergence of variance reduction techniques. In this subsection, we prove the convergence of
the time-average of the iterates to θ∗ , i.e. ergodicity (Proposition 4.10) as well as the convergence of θt to
θ∗ in the case of the step-size decay (Proposition 4.11).
B.1: The noiseless case We give the proof of the convergence to θ∗ with rates (Theorem 5.1).
B.2: The noisy case. We first prove that multiplicative noise carries some Lipschitz property (Lemma 5.3).
This allow us to prove the quantitative convergence to the the stationary distribution (Theorem 5.4).
Lemma 4.2. The proof of the lemma follows from Karush–Kuhn–Tucker conditions. Indeed, the argmin
is unique, being the projection of θ0 on a linear set, and it satisfies that there exists Lagrange multipliers
λ ∈ Rn such that
θ∗ − θ0 = X⊤ λ , and Xθ∗ = y .
We prove now the main convergence theorem of this section. This is Theorem 4.3.
Theorem 4.3. (i) The key ingredient of the proof is the Gronwall Lemma. Combining the Itô formula with
(13) gives us
d γ
E∥θt − θ∗ ∥2 = −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + ETr XX⊤ Rx2 (θt )
dt n
n
∗ γ X
≤ −2E⟨θt − θ∗ , Σ(θt − θ )⟩ + E ∥xi ∥2 (⟨θt , xi ⟩ − yi )2
n
i=1
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γKE⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −(2 − γK)E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ .
By integrating the latter, we get
Z t
2
E∥θt − θ∗ ∥2 = ∥θ0 − θ∗ ∥2 − E∥X (θu − θ∗ ) ∥2 du
n 0
γ t
Z
⊤ 2 1 ⊤
+ ETr X diag(rx (θ)) − rx (θt ) rx (θt ) X du.
n 0 n
Note that
t n
γ t X 2
Z Z
γ
⊤
ETr X 2
diag(rx (θ)) X du = E rx (θt )i (XX⊤ )ii du
n 0 n 0
i=1
Z t X n
γ
= E rx2 (θt )i ∥xi ∥2 du
n 0
i=1
Z t n
γ X
≤ E max ∥xi ∥2 rx2 (θt )i du
n 0 i
i=1
Z t
≤ 2γK EL(θu )du,
0
and
Z t Z t
γ
⊤
γ⊤
ETr X rx (θt ) rx (θt ) X du = 2 ETr rx (θu )⊤ XX⊤ rx (θu ) du
n20 n 0
Z t Z t
γ ⊤
≥ λmin (Σ) E rx (θt ) rx (θt ) du = 2γλmin (Σ) EL(θu )du.
n 0 0
Collecting the estimates, we obtain
Z t
2 2
E∥θt − θ∗ ∥ ≤ ∥θ0 − θ∗ ∥ − (4 + 2γλmin (Σ) − 2γK) EL(θu )du.
0
We remark that θu − θ∗ ∈ Ran(X T ). Indeed, we have
θu − θ∗ = (θu − θ0 ) + (θ0 − θ∗ ),
the first term on the r.h.s. is in Ran(X T ) by (13) and the second term also by (14). We recall that Rd =
Ran(X T ) ⊕ Ker(X) and therefore note that X|Ran(X T ) is a bijection into its image. Combining the two last
facts yield
λmin (Σ) 1
E∥ (θu − θ∗ ) ∥2 ≤ E∥X (θu − θ∗ ) ∥2 = EL(θu )
2 2n
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 21
and the first statement of the Theorem follows with Gronwall Lemma. We move to the proof of the second
statement of the Theorem.
(ii) We define ηt := θt − θ∗ and recall that
Z t
E∥ηt ∥2 ≤ ∥θ0 − θ∗ ∥2 − 2(2 − γK) EL(θu )du.
0
We want to lower bound the term EL(θu ) without using the smallest eigenvalue of Σ, that is supposedly
infinitely small. First note X⊤ X is symmetric and thus diagonalizable in an orthonormal basis (vi ). We
steadily check that
Ker(X⊤ X) = Ker(X)
by invertibility of XX⊤ . We thus have Rd = Ran(X T ) ⊕ Ker(X⊤ X). We denote by λ1 ≤ · · · ≤ λd∗ the
non zero eigenvalue of X⊤ X where d∗ ≤ d is the number of non-zero eigenvalues and (v1 , . . . , vd∗ ) the
P∗
corresponding eigenvectors. For all t > 0, we thus have the following decomposition ηt = dk=1 ηkt vk .
Thanks to the Hölder inequality, we claim that the following holds true:
Let p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,
1 h iq/p
E ∥ηt ∥2 p ≤ 2E [L(θt )] E ⟨ηt , Σ−p/q ηt ⟩
(32) .
We move to the proof of the above claim. To lighten the notation, we write ηkt = ηk in the following to wit
d ∗ d ∗ d ∗
2
X 2 X h
2p p 2q −p
i X p h 2 −p/q iq
E ηk2 λk
E ∥ηt ∥ = E ηk = E ηk λk ηk λk ≤ E ηk λk ,
k=1 k=1 k=1
thanks to Hölder inequality w.r.t. the expectation. Then, applying once again the Hölder inequality for the
sum, we have
!p
d∗ d ∗
i q
!
h
2 −p/q
X X
E ∥ηt ∥2 ≤
2
E ηk λk E ηk λk
k=1 k=1
h iq
p
= (E2L(θ)) E ⟨ηt , Σ−p/q ηt ⟩ ,
22 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
and the claim (32) follows. We now prove that t 7→ E ⟨ηt , Σ−p/q ηt ⟩ is bounded. Indeed, we proceed as
before with Itô formula to get
d dηt −p/q dηt −p/q dηt
E ⟨ηt , Σ−p/q ηt ⟩ = 2E⟨ ,Σ ηt ⟩ + E⟨ ,Σ ⟩
dt dt dt r dt r
−p/q γ ⊤ −p/q γ ⊤
= −2E⟨−Σηt , Σ ηt ⟩ + E⟨ X Rx (θt )dBt , Σ X Rx (θt )dBt ⟩
n n
γ
= −2E⟨ηt , Σ1−p/q ηt ⟩ + ETr (XT R)T Σ−p/q XT R
n
γ −p/q ⊤ 2 1 ⊤
≤ ETr Σ X diag(rx (θ)) − rx (θt ) rx (θt ) X
n n
n
γ X
≤ E ⟨xi , Σ−p/q xi ⟩(⟨ηt , xi ⟩ − yi )2
n
i=1
≤ 2γKp/q L(θt ) ,
where we have used that for all i ∈ J1, nK, we have ⟨xi , Σ−p/q xi ⟩ ≤ Kp/q . Then, by integrating with respect
to t, it yields
Z t
E ⟨ηt , Σ−p/q ηt ⟩ ≤ ⟨η0 , Σ−p/q η0 ⟩ + 2γKp/q EL(θu )du
0
γKp/q
≤ ⟨η0 , Σ−p/q η0 ⟩ + ∥θ0 − θ∗ ∥2 .
2 − γK
γKp/q
Hence, calling C = 21 (⟨η0 , Σ−p/q η0 ⟩ + 2−γK ∥θ0 − θ∗ ∥2 )−p/q , we have the inequality, for all t ≥ 0,
1/p
EL(θt ) ≥ C E∥ηt ∥2 ,
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0, we have
" # 1
1/p−1
1
E∥ηt ∥2 ≤ 1 ,
∥η0 ∥2(1/p−1)
+ (1/p − 1)Ct
this gives the result claim in the theorem. To see how the last inequality goes, we define g(t) = ∥η0 ∥2 −
Rt 1/p
C 0 E∥ηu ∥2 du (which is positive) and we rewrite (33) as
′ p
g (t) g ′ (t) 1
≤ g(t) ⇐⇒ ≥ −C =⇒ (g(t)−1/p+1 − g(0)−1/p+1 ) ≥ −Ct,
−C g(t) 1/p −1/p + 1
and thus
1
1 −1/p+1
g(t) ≤ −Ct(− + 1) + ∥η0 ∥2(−1/p+1) ,
p
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 23
and we deduce that ∥θt − θ∗ ∥2 is a positive supermartingale, L1 bounded and therefore converge almost
surely to 0. ■
Lemma 4.5. First, as L is a quadratic function, we have that for all θ ∈ Rd , we have
⟨∇L(θ), θ − θ∗ ⟩ = 2(L(θ) − L(θ∗ )).
We recall that X⊤ Xθ ∗ = X⊤ y and it follows that ∇L(θ) = Σ(θ − θ∗ ). We thus deduce that for all θ ∈ Rd ,
γ h i
LV (θ) = −⟨Σ (θ − θ∗ ) , (θ − θ∗ )⟩ + Tr X⊤ Rx2 (θ)X
γ h 2n i
= 2(L(θ∗ ) − L(θ)) + Tr X⊤ Rx2 (θ)X
2n
n
γ X
≤ 2(L(θ∗ ) − L(θ)) + (⟨xi , θ⟩ − yi )2 ∥xi ∥2
2n
i=1
γK
≤ 2 L(θ∗ ) − L(θ)(1 − ) .
2
This Lyapunov identity implies the existence of a stationary distribution as explained in [38, Theorem 3.7].
This implies that for all t ≥ 0, the dynamics does not explode. ■
A.2.1. Invariant measure and convergence
We now turn into proving the main theorem of this part on the quantitative convergence to the stationary
distribution.
Theorem 4.6. (i) Wassertein contraction comes essentially from coupling arguments. Let γ ≤ K−1 and
ρ10 , ρ20 ∈ P2 (Rd ) two possible initial distributions. Then by [48, Theorem 4.1], there exists a couple of
random variables θ01 , θ02 such that W22 (ρ10 , ρ20 ) = E ∥θ01 − θ02 ∥2 . Let (θt1 )t≥0 (resp. (θt2 )t≥0 ) be the solution
of the SDE (13), sharing the same Brownian motion (Bt )t≥0 . Then, for all t ≥ 0, the random variable
(θt1 , θt2 ) is a coupling between ρ1t and ρ2t , and hence
W22 (ρ2t , ρ1t ) ≤ E ∥θt1 − θt2 ∥2 .
24 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
Moreover, we denote by ∥.∥HS the Frobenius norm and by Itô formula, we have
d 1 2 γ
E ∥θt − θt2 ∥2 = − E⟨θt1 − θt2 , X⊤ (Xθt1 − y) − X⊤ (Xθt2 − y)⟩ + E∥X⊤ Rx (θt1 ) − X⊤ Rx (θt2 )∥2HS
dt n n
2 γ
= − E∥X(θt1 − θt2 )∥2 + E∥X⊤ (Rx (θt1 ) − Rx (θt2 ))∥2HS
n n
2 1 2 2
≤ − E∥X(θt − θt )∥
n
2γ ⊤ 1 2 2 1 ⊤ 1 2 ⊤ 2
+ E ∥X diag(⟨θt − θt , xi ⟩i )∥HS + 2 ∥X (⟨θt − θt , xi ⟩)i 1 ∥HS .
n n
Furthermore, we have for all θ1 , θ2 ∈ Rd that
∥X⊤ diag(⟨θ1 − θ2 , xi ⟩i )∥2HS = Tr XX⊤ diag(⟨θ1 − θ2 , xi ⟩2i )
n
X
= ∥xi ∥2 ⟨θ1 − θ2 , xi ⟩2
i=1
≤ K∥X(θ1 − θ2 )∥2 ,
and
1 ⊤ 1 2 ⊤ 2 1 ⊤ 1 2 1 2 ⊤
∥X (⟨θ − θ , xi ⟩) i 1 ∥ HS = Tr XX (⟨θ − θ , xi ⟩)i (⟨θ − θ , xi ⟩)i
n2 n
1
≤ Tr XX⊤ Tr (⟨θ1 − θ2 , xi ⟩)i (⟨θ1 − θ2 , xi ⟩)⊤ i
n
≤ K∥X(θ1 − θ2 )∥2 ,
where we use from the first to the second line the inequality Tr(AB) ≤ Tr(A)Tr(B) for any A and B
positive semi-definite. Altogether, this gives the inequality:
d 1 2 − 4γK
E ∥θt − θt2 ∥2 ≤ − E∥X(θt1 − θt2 )∥2
(34)
dt n
(35) ≤ −2µ(1 − 2γK)E∥θt1 − θt2 ∥2 ,
and by Gronwall Lemma, denoting cγ = 2µ(1 − 2γK) this gives that
W22 (ρ1t , ρ2t ) ≤ E ∥θt1 − θt2 ∥2 ≤ e−cγ t E ∥θ01 − θ02 ∥2 = e−cγ t W22 (ρ20 , ρ10 ) .
Now, for all s ≥ 0 setting ρ10 = ρ0 ∈ P2 (Rd ) and ρ20 = ρs ∈ P2 (Rd ), we have for all t ≥ 0,
W22 (ρt , ρt+s ) ≤ e−cγ t W22 (ρ0 , ρs ) ,
which shows that the process (ρt )t≥0 is of Cauchy type, and since (P2 (Rd ), W2 ) is a Polish space, ρt →
ρ∗ ∈ P2 (Rd ) as t grows to infinity. Now, since there exists a stationary solution to the process, let us fix
ρ10 = ρ∗ ∈ P2 (Rd ) and ρ20 = ρ0 ∈ P2 (Rd ). We have then,
W22 (ρt , ρ∗ ) ≤ e−cγ t W22 (ρ0 , ρ∗ ) ,
which concludes the first part of the Theorem.
(ii) We will use the same steps as for the proof of Theorem 4.3 (ii). Again, one steadily checks that for
p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,
1/p
E ∥ θt1 − θt2 ∥2
1 1 2
2
(36) E ∥X θt − θt ∥ ≥ q/p .
n E ⟨θ1 − θ2 , Σ−p/q (θ1 − θ2 )⟩
t t t t
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 25
the last line by the fact that Tr(AB) ≤ Tr(A)Tr(B) for A and B positive semi-definite. We thus conclude
that
d 1 4γ
E ⟨θt − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ E∥X(θ1 − θ2 )∥2 Kp/q ,
dt n
where we have used that for all i ∈ J1, nK, we have ⟨xi , Σ−p/q xi ⟩ ≤ Kp/q . In addition, by (34), we get
Z t
n
E ∥X(θu1 − θu2 )∥2 du ≤ E∥(θ01 − θ02 )∥2 .
0 2 − 4γK
Collecting the estimates, we thus get
1 1/p
E ∥X θt1 − θt2 ∥2 ≥ CE ∥ θt1 − θt2 ∥2
,
n
−q/p
Kp/q
with C = E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ + 2γ 1−2γK E(∥θ01 − θ02 ∥2 )
, which combined with (34),
gives
Z t
1 2 2
1 2 2
E∥(θu1 − θu2 )∥2 du ,
E ∥θt − θt ∥ ≤ E ∥θ0 − θ0 ∥ − C (2 − 4γK)
0
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0 (for C = 2C (1 − 2γK))), we
have
h 1−1/p i 1
1−1/p
E ∥θt1 − θt2 ∥2 ≤ E ∥θ01 − θ02 ∥2
+ (1/p − 1)Ct ,
and we conclude as for (i). ■
Proposition 4.7. We proved that θt tends weakly to Θ∗ for γ ≤ 1/K. To insure that the first and second
moment of θt tends to the first and two moments of Θ∗ , we first prove that it exists a M such that for all t,
26 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
γKσ 2 γKσ 2
−2(1−γK)t 1
E[V (θ)] ≤ e E∥θ0 − θ∗ ∥2 − + ,
2 (1 − γK)µ (1 − γK)µ
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 27
which which combined with the boundedness of the fourth moment implies by taking the limit in the in-
equality, and weak convergence of ρt towards ρ∗ that θ∗ ∼ ρ∗ satisfies the claimed inequality in the propo-
sition. ■
We turn into proving the moment explosion of the invariant distribution. This corresponds to proving
Proposition 4.9. First, we need the following lemma that aims at lower bounding some quartic form of θ:
Lemma A.1. For n ≥ 2d, there exists a constant c > 0 that depends only on X, y such that ∀η ∈ Rd ,
⟨Xη, Rx2 Xη⟩ ≥ c∥η∥2 ∥Xη∥2 .
We assume the validity of Lemma A.1 for the time being and turn into the proof of Proposition 4.9.
Proof. of Proposition 4.9 Using Itô’s Lemma, we obtain that
d 1 γα(α − 1)∥ηt ∥2(α−2) ⊤ 2
E[ ∥ηt ∥2α ] = −αE∥ηt ∥2(α−1) ⟨Σηt , ηt ⟩ + E Tr X Rx Xηt ηtT
dt 2 n
γα∥ηt ∥ 2(α−1)
+E Tr X⊤ Rx2 X .
2n
We suppose that ∥Θ∗ ∥ has moments of order 2α. We take η0 = Θ∗ and thus obtain
γ(α − 1)
E∥Θ∗ ∥2(α−1) ⟨ΣΘ∗ , Θ∗ ⟩ ≥ E∥Θ∗ ∥2(α−2) Tr X⊤ Rx2 XΘ∗ Θ⊤∗ .
n
Using the Lemma A.1, we deduce that
µ
E∥Θ∗ ∥2α ≥ c γ(α − 1)E∥Θ∗ ∥2α ,
λ
which leads to a contradiction for α large enough and we thus deduce that ∥Θ∗ ∥ has no moments of order
λ
2α for α > 1 + µc . ■
It remains to prove Lemma A.1. For the ease of writing and clarity we now denote the residual r =
Xθ − y ∈ Rn , and the constant noise vector σ = Xθ∗ − y ∈ Rn , and note in this proof Rx2 ≡ Rr2
and Xη = r − σ to emphasize that every thing here can be expressed as a function of the residuals r.
Proof. of Lemma A.1. We first give the kernel of R2 as a function of r. Let I = {i ∈ J1, nK, ri = 0} and
αr = (1/r1 , 1/r2 , . . . , 1/rn )⊤ , then
(i) if I ̸= ∅, Ker(Rr2 ) = span(ei )i∈I
(ii) if I = ∅, Ker(Rr2 ) = span(αr ) .
We fix r ∈ Rn and let z ∈ Ker(Rr2 ), that is
2 1 ⊤
diag(r) − rr z=0,
n
i.e. for all i ∈ J1, nK,
⟨r, z⟩
ri2 zi = ri .
n
If I = ∅, we have the relationship for all i, j ∈ J1, nK, ri zi = ⟨r,z⟩
n = rj zj , which instantly gives the result.
⟨r,z⟩
Otherwise if I ̸= ∅, for all i ∈ I , we have ri zi = n and summing all these equality of i ∈ I c , we have
c
X ⟨r, z⟩
⟨r, z⟩ = ri zi = (n − |I|) ,
c
n
i∈I
28 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
From the Cauchy-Schwarz inequality, φ ∈ [0, 1] and let m = supη∈Rd φ(η). We show below that m < 1.
First, let’s study the situation at infinity. For this, we write η = η̄∥η∥, where η̄ = η/∥η∥ ∈ Sd−1 and call
J = {i ∈ J1, nK, such that ⟨xi , η̄⟩ = 0}. We know that |J| ≤ d. We fix η̄, and study the limit of φ as ∥η∥
grows.
If J ̸= ∅, we have
(n − |J|)2
φ(η) ∼∥η∥→∞ Pn 2
P −2 .
i∈J c ⟨xi , η⟩ i∈J σi
Hence, if J ̸= ∅, then φ(η) → 0 when ∥η∥ → ∞. Now if J = ∅,
n2
φ(η) ∼∥η∥→∞ Pn := ϕ(η̄).
( i=1 ⟨xi , η̄⟩2 ) ( ni=1 ⟨xi , η̄⟩−2 )
P
Moreover, M := supη̄∈Sd−1 ϕ(η̄) < 1, because if M = 1, it would be attained for a certain η̄∗ by compact-
ness of Sd−1 and continuity of ϕ, and if M = 1, then ϕ(η̄∗ ) = 1 and by the equality case of Cauchy-Schwarz,
this corresponds to the fact that there exists λ ∈ R such that |Xη̄∗ | = λ, which has no solution for n > d
(recall that because η̄∗ ∈ Sd−1 , λ = 0 offer no solution either). We conclude from this that M < 1.
Now either the supremum of φ is M , and in this case we have m = M < 1 and we are done with the
proof. Either, m > M , and in this case, the supremum is attained in a compact set of Rd , hence attained
for a certain η∗ ∈ Rd by continuity of φ. As previously, if m = 1, this corresponds to the equality case of
Cauchy-Schwarz and hence the existence of λ ∈ R such that
(⟨xi , η⟩ − σi )2 ⟨xi , η⟩2 = λ ,
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 29
which has no solution generically for n > d. Finally m = 1 is impossible and we have proven that m < 1,
which corresponds to c = 1 − m > 0.
Now if we take back Eq.(38), and remark that
⟨v, Ru2 v⟩ = ⟨Πu v, Ru2 Πu v⟩ = ∥Πu ∥2 ⟨Π̄u v, Ru2 Π̄u v⟩ ≥ c ⟨Π̄u v, Ru2 Π̄u v⟩ ≥ c λmin (Ru2 2 )⊥
).
|Ker(Ru
Proposition 4.10. The idea we use to show the quantitative convergence of the average iterates comes
from [49] and the use of Poisson equation. Indeed, let us define the map φ : Rd → Rd , φ(θ) = Σ−1 θ,
and note (φi (θ))i∈J1,dK its coordinates. Note that, for all i ∈ J1, dK, we have ∇φi (θ) = Σ−1 ei , where
(ei )i∈J1,dK is the canonical basis of Rd . Hence, we have the Poisson equation, for all i ∈ J1, dK and θ ∈ Rd ,
Hence considering the action of L on fields of Rd applied coordinate-wise, we have more generally Lφ(θ) =
θ − θ∗ . Thus, we have by Itô calculus that for all s ≥ 0
r
γ −1 ⊤
dφ(θs ) = (θs − θ∗ )ds + Σ X Rx (θs )dBs
n
Rt
and thus integrating from 0 to t and dividing by t, and defining the martingale Mt = − √1n X⊤ 0 Rx (θs )dBs
we have,
√ √
1 γ −1 Σ−1 (θt − θ0 ) γ −1
θ̄t − θ∗ = (φ(θt ) − φ(θ0 )) + Σ Mt = + Σ Mt .
t t t t
Hence, multiplying by Σ, taking norms and then expectations, we have
2E∥θt − θ0 ∥2 2γ
E∥Σ(θ̄t − θ∗ )∥2 ≤ + 2 E∥Mt ∥2 .
t2 t
30 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
Moreover, we have seen that in Lemma 4.5 that for γ ≤ K−1 , we have
d1
E∥θt − θ∗ ∥2 ≤ −EL(θt ) + 2σ 2 ,
dt 2
Rt
and hence we have the upper bound on the loss 0 E [L(θs )] ds ≤ 12 ∥θ0 − θ∗ ∥2 + 2σ 2 t. Finally, this gives
that
the second line by (37) for γ < 1/K. Overall, we have the bound,
8γKσ 2 10∥θ0 − θ∗ ∥2
E∥Σ(θ̄t − θ∗ )∥2 ≤ + .
t t2
■
Now, we turn into the convergence of θt in the case of the step-size decay. This corresponds to the proof
of Proposition 4.11.
Theorem 5.1. (i) The key ingredient of the proof is the Gronwall Lemma. Combining the Itô formula with
(13) gives us
d
E∥θt − θ∗ ∥2 = −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γETr σ(θt )σ(θt )⊤
dt h i
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γE Tr (⟨θt , X⟩ − Y )2 XX ⊤
≤ −2E⟨θt − θ∗ , Σ(θt − θ∗ )⟩ + γKE⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −(2 − γK)E⟨θt − θ∗ , Σ(θt − θ∗ )⟩
≤ −µ(2 − γK)E∥θt − θ∗ ∥2 .
By integrating the latter thanks to Gronwall Lemma, we get the result claim for (i).
(ii) We define ηt := θt − θ∗ and recall that, thanks to the penultimate inequality of (i) we have by integration
that:
Z t
2 2
E∥ηt ∥ ≤ ∥θ0 − θ∗ ∥ − 2(2 − γK) EL(θu )du.
0
The first consequence of this inequality is that
Z t
1
EL(θu )du ≤ ∥θ0 − θ∗ ∥2 .
0 2(2 − γK)
We want to lower bound the term EL(θu ) without using the smallest eigenvalue of Σ, that is supposedly
infinitely small. Similarly to what was done before, thanks to the Hölder inequality, if p, q ∈ (0, 1), p + q =
1, we have, for all t ≥ 0,
1 h iq/p
E ∥ηt ∥2 p ≤ 2E [L(θt )] E ⟨ηt , Σ−p/q ηt ⟩
(39) .
32 STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT
We now prove that t 7→ E ⟨ηt , Σ−p/q ηt ⟩ is bounded. Indeed, we proceed as before with Itô formula to get
d
E ⟨ηt , Σ−p/q ηt ⟩ = −2E⟨ηt , Σ1−p/q ηt ⟩ + γETr σσ ⊤ Σ−p/q
dt
≤ γETr Σ−p/q (⟨θt , X⟩ − Y )2 XX ⊤
h i
≤ γE ⟨X, Σ−p/q X⟩(⟨θt , X⟩ − Y )2
≤ 2γKp/q L(θt ) ,
where we have used that almost surely, we have ⟨X, Σ−p/q X⟩ ≤ Kp/q . Then, by integrating with respect to
t, it yields
Z t
−p/q −p/q
E ⟨ηt , Σ ηt ⟩ ≤ ⟨η0 , Σ η0 ⟩ + 2γKp/q EL(θu )du
0
γKp/q
≤ ⟨η0 , Σ−p/q η0 ⟩ + ∥θ0 − θ∗ ∥2 .
2 − γK
γKp/q
Hence, calling C = 21 (⟨η0 , Σ−p/q η0 ⟩ + 2−γK ∥θ0− θ∗ ∥2 )−p/q , we have the inequality, for all t ≥ 0,
1/p
EL(θt ) ≥ C E∥ηt ∥2 ,
and this yields the inequality
Z t 1/p
2 2
(40) E∥ηt ∥ ≤ ∥η0 ∥ − C E∥ηu ∥2 du ,
0
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0, we have
" # 1
1/p−1
2 1
E∥ηt ∥ ≤ 1 ,
∥η ∥2(1/p−1)
+ (1/p − 1)Ct
0
this gives the result claim in the theorem. To see how the last inequality goes, we define g(t) = ∥η0 ∥2 −
Rt 1/p
C 0 E∥ηu ∥2 du (which is positive) and we rewrite (40) as
′ p
g (t) g ′ (t) 1
≤ g(t) ⇐⇒ ≥ −C =⇒ (g(t)−1/p+1 − g(0)−1/p+1 ) ≥ −Ct,
−C g(t) 1/p −1/p + 1
and thus
1
1 2(−1/p+1)
−1/p+1
g(t) ≤ −Ct(− + 1) + ∥η0 ∥ ,
p
and we conclude with (40).
To prove the convergence almost surely, we use the Itô Formula to obtain
Z s
√
E ∥θt − θ∗ ∥2 Fs = ∥θ0 − θ∗ ∥2 + 2 γ
⟨(θu − θ∗ ) , σ(θu )dBu ⟩
0
Z t h i
⊤
+ γE Tr σσ (θu ) − 4L(θu ) du Fs ,
0
1
For γ < 2K ,
we have proven in (i) that the term inside the conditional expectation value/integral is negative.
We can thus overestimate the latter by integrating from 0 to s to obtain
E ∥θt − θ∗ ∥2 Fs ≤ ∥θs − θ∗ ∥2 ,
STOCHASTIC DIFFERENTIAL EQUATIONS MODELS FOR LEAST-SQUARES STOCHASTIC GRADIENT DESCENT 33
and we deduce that ∥θt − θ∗ ∥2 is a positive supermartingale, L1 bounded and therefore converge almost
surely to 0. ■
Let us introduce for all θ ∈ Rd , the differential operator of σ, that is dσθ : Rd → Rd×d such that for all
h ∈ Rd ,
σ(θ + h) = σ(θ) + dσθ (h) + o(∥h∥).
We calculate, for all θ, h ∈ Rd ,
σ(θ + h) = (f (θ + h) − g(θ + h))1/2
= (f (θ) − g(θ) + lθ (h) + o(∥h∥))1/2
where lθ (h) = dfθ (h) − dgθ (h) and
h i
dfθ (h) = 2Eρ (⟨θ, X⟩ − Y ) ⟨h, X⟩XX ⊤
Hence, introducing the square root operator ψ : Sd+ → Sd+ , such that for all M ∈ Sd+ , ψ(M ) = M 1/2 . We
have, for all θ, h ∈ Rd ,
dσθ (h) = dψσ2 (θ) (lθ (h)) ,
that is that dσθ (h) is the unique solution to the matrix Lyapunov equation
σ(θ)dσθ (h) + dσθ (h)σ(θ) = lθ (h) ,
or equivalently, expressed in close form as the expression
Z +∞
dσθ (h) = e−sσ(θ) lθ (h)e−sσ(θ) ds .
0
That being written. Let us pose for t ∈ [0, 1], the function Ψ(t) = σ(tθ + (1 − t)η), we have
Z 1
σ(θ) − σ(η) = Ψ(1) − Ψ(0) = Ψ′ (u)du ,
0
and knowing that Ψ′ (u) = dσmu (θ − η), where mu = uθ + (1 − u)η, and hence using the triangular
inequality for the Hilbert-Schmidt norm, we have
Z 1
∥σ(θ) − σ(η)∥HS ≤ ∥dσmu (θ − η)∥HS du
0
Hence, it remains to upper bound the Hilbert-Schmidt norm of the differential of σ thanks to its integral
representation presented above. In fact for all θ, h ∈ Rd , we have
Z +∞
∥dσθ (h)∥HS ≤ ∥lθ (h)∥ ∥e−sσ(θ) ∥2HS ds .
0
Moreover, we have ∥lθ (h)∥ ≤ ∥dfθ (h)∥ + ∥dgθ (h)∥ and using that ∥.∥HS is sub-multiplicative, we have
h i
∥dfθ (h)∥ ≤ 2E |⟨θ, X⟩ − Y | |⟨h, X⟩|∥XX ⊤ ∥
r h ip
≤ 2K E |⟨θ, X⟩ − Y |2 E [|⟨h, X⟩|2 ]
√ p p
= 2 2K L(θ) ⟨Σh, h⟩ ,
and similarly,
√ p p
∥dgθ (h)∥ ≤ 2 2K L(θ) ⟨Σh, h⟩ .
Finally, we have the bound
Z +∞ Z +∞
1
∥e−sσ(θ) ∥2HS ds Tr(e−2sσ(θ) )ds = Tr σ −1 (θ) .
=
0 0 2
As σ 2 (θ) ⪰ a2 L(θ)Id , we deduce that
2 d2
L(θ)Tr σ −1 (θ)
≤ Tr L(θ)σ −2 (θ) d ≤ 2 ,
a
the first inequality by Cauchy-Schwarz and the Lemma 5.3 follows. ■
We can now turn into the proof of the main theorem of this section: the proof of convergence in the noisy
case. This corresponds to proving Theorem 5.4.
Theorem 5.4. (i) Wassertein contraction comes essentially from coupling arguments. Let γ ≤ K−1 and
ρ10 , ρ20 ∈ P2 (Rd ) two possible initial distributions. Then by [48, Theorem 4.1], there exists a couple of
random variables θ0 , θ0 such that W2 (ρ0 , ρ0 ) = E ∥θ0 − θ0 ∥ . Let (θt1 )t≥0 (resp. (θt2 )t≥0 ) be the solution
1 2 2 1 2 1 2 2
of the SDE (13), sharing the same Brownian motion (Bt )t≥0 . Then, for all t ≥ 0, the random variable
(θt1 , θt2 ) is a coupling between ρ1t and ρ2t , and hence
W22 (ρ2t , ρ1t ) ≤ E ∥θt1 − θt2 ∥2 .
Moreover, we denote by ∥.∥HS the Frobenius norm and by Itô formula, we have
d 1
E ∥θt − θt2 ∥2 = −2E ⟨θt1 − θt2 , Σ(θt1 − θ∗ ) − Σ(θt2 − θ∗ )⟩ + γE∥σ(θt1 ) − σ(θt2 )∥2HS
dt
≤ −2E ⟨Σ(θt1 − θt2 ), θt1 − θt2 ⟩ + 2γKcE ⟨Σ(θt1 − θt2 ), θt1 − θt2 ⟩ ,
Now, for all s ≥ 0 setting ρ10 = ρ0 ∈ P2 (Rd ) and ρ20 = ρs ∈ P2 (Rd ), we have for all t ≥ 0,
W22 (ρt , ρt+s ) ≤ e−cγ t W22 (ρ0 , ρs ) ,
which shows that the process (ρt )t≥0 is of Cauchy type, and since (P2 (Rd ), W2 ) is a Polish space, ρt →
ρ∗ ∈ P2 (Rd ) as t grows to infinity. Now, since there exists a stationary solution to the process, let us fix
ρ10 = ρ∗ ∈ P2 (Rd ) and ρ20 = ρ0 ∈ P2 (Rd ). We have then,
W22 (ρt , ρ∗ ) ≤ e−cγ t W22 (ρ0 , ρ∗ ) ,
which concludes the first part of the Theorem.
(ii) We will use the same steps as for the proof of Theorem 4.3 (ii). Again, one steadily checks that for
p, q ∈ (0, 1), p + q = 1, we have, for all t ≥ 0,
1/p
E ∥θt1 − θt2 ∥2
h
1/2 1 2
2i
(43) E ∥Σ θt − θt ∥ ≥ q/p .
E ⟨θt1 − θt2 , Σ−p/q (θt1 − θt2 )⟩
By Ito formula, skipping the details, we get
d 1
E ⟨θt − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ γETr Σ−p/q (σ(θ1 ) − σ(θ2 ))2
dt
≤ 2γcp/q Kp/q ⟨Σ(θ1 − θ2 ), θ1 − θ2 ⟩ ,
thanks to assumption (29). In addition, by (41), we get
Z t h
i 1
E ∥Σ1/2 θu1 − θu2 ∥2 du ≤ E∥θ01 − θ02 ∥2 .
0 2(1 − γKc)
Collecting the estimates, we thus get
γc K
p/q p/q
E ⟨θt1 − θt2 , Σ−p/q (θt1 − θt2 )⟩ ≤ E∥θ01 − θ02 ∥2 + E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ .
1 − γKc
That is to say that
h i 1/p
E ∥Σ1/2 θt1 − θt2 ∥2 ≥ CE ∥θt1 − θt2 ∥2
(44) ,
γcp/q Kp/q −q/p
with C = E ⟨θ01 − θ02 , Σ−p/q (θ01 − θ02 )⟩ + 1−γcK E(∥θ01 − θ02 ∥2 ) , which combined with equa-
tion (41), gives
Z t
1 2 2
1 2 2
1/p
E ∥(θu1 − θu2 )∥2
E ∥θt − θt ∥ ≤ E ∥θ0 − θ0 ∥ − 2C (1 − γcK) du ,
0
that implies, from a slight modification of Gronwall Lemma that for all t ≥ 0 (for C = 2C (1 − γcK))), we
have
h 1−1/p i 1
1−1/p
E ∥θt1 − θt2 ∥2 ≤ E ∥θ01 − θ02 ∥2
+ (1/p − 1)Ct ,
and we conclude as for (i). ■