Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
Abstract. Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that
arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space.
Focusing on a class of separable (non-convex) objective functions, we establish a “Doeblin-type decomposition,” in
that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the
arXiv:2409.12243v1 [math.OC] 18 Sep 2024
absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull.
Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric
convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation
to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside
the support of the invariant measures (i.e., even if initialized at the global minimum, SGD iterates will leave); and
(3) bifurcations may enable the SGD iterates to transition between two local minima. Key ingredients in the theory
involve viewing the SGD dynamics as a monotone iterated function system and establishing a “splitting condition”
of Dubins and Freedman 1966 and Bhattacharya and Lee 1988.
Key words. Stochastic gradient descent, Diffusion approximation, Doeblin-type decomposition, Markov
chains, Spectral gap, Constant step-size, Bifurcations, Iterated function systems
1. Introduction. In recent years, stochastic gradient descent (SGD) [39] has become an
immensely popular algorithm for minimizing objective functions F : Rd →
− R of the form
n
1X
(1.1) F (x) = fi (x) , where f i : Rd →
− R (fi ̸= 0) .
n i=1
Here P(Rd ) is the space of probability measures over the Borel σ-algebra B(Rd ).
The probability laws for a Markov chain then evolve via deterministic linear dynamics accord-
ing to a Markov operator
(1.4) µk+1 = Pµk .
∗ Submitted to the editors September 20, 2024.
† Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ (shirokof@njit.edu).
‡ Corresponding author, Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ
(pz85@njit.edu).
1
2 D. SHIROKOFF, P. ZALESKI
where p : Rd × B(Rd ) →
− [0, 1] is the transitional kernel
For each x ∈ Rd , p(x, ·) is a probability measure, while for each Borel set A, p(·, A) is a measureable
function so that P is well defined to act on probability measures (or more generally finite measures).
Intuitively, the value of p(x, A) measures the probability that the Markov chain transitions
from a point x into the set A (which is the infinite dimensional analogue to matrix elements of a
Markov matrix). Hence, for the SGD Markov chain (1.3), p is the fraction of maps φi that map
x into A
n
1X
(1.6) p(x, A) = χA φi (x) ,
n i=1
(1.8) Pµ⋆ = µ⋆ .
The purpose of this work is to establish the convergence of the probability measures µk for
constant step-size SGD. We restrict our attention to separable (but non-convex) objective functions
F and fi . This has the advantage of enabling a simple, yet relatively complete, characterization
of the how the probability measures µk converge to a convex combination of the extreme points
of the invariant measures. Our results make use of techniques from the theory of iterated function
systems for monotone maps. A technical part of the proofs involves verifying a splitting condition
[14, 6] associated to Markov operators for these maps.
Analyzing the exact dynamics defined by the operator P then enables a rigorous study of
bifurcations. In particular, we provide an exact bifurcation study of whether iterates of SGD
escape a local minima of F — contradicting predictions inferred by the diffusion approximation.
1.1. Background on SGD with Adaptive ηk → 0 . An intuitive motivation for the
dynamics (1.3) is that the expectation of Xk+1 is a gradient step of F evaluated at Xk , i.e.,
Equation (1.9) implies that on average, one expects the iterates Xk to move towards local minima
of F (x). This observation can be made precise and has lead to a significant body of work on SGD
in the small η ≪ 1 step-size limit. When the step-size η in (1.3) is adaptive, that is, η is replaced
with ηk with ηk → 0 as k → ∞, it is well-established that the iterates Xk converge the dynamics
of the ordinary differential equation
For instance, see [15], [4, Proposition 4.1–4.2] and [9, Chapter 2] (and references within). Closely
related to the asymptotic trajectory (1.10) are a range of results establishing that Xk converges
(almost surely) to minimizers of F , e.g. [43, 12] for functions F satisfying a Lojasiewicz inequality
in lieu of convexity.
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 3
1.2. Background for SGD with Constant η. While much is known about SGD with
adaptive step-sizes, far less is known in the constant step-size setting when F is non-convex. In
particular, regarding (1.3): Does the Markov chain Xk remain trapped in an energy well of F ?
Or explore all minima of F ? And if so, over what time-scales? Answers to these questions may
help to provide insight into the initial phase of adaptive step-size SGD.
The difficulty in establishing a general theory for either the Markov chain Xk , or the associated
probability laws µk , is highlighted by the generality of SGD. For instance, the dynamics (1.3)
include, as special cases:
• All continuous 1-dimensional deterministic iterative maps. This includes both the Logistic
map and Tent map. For instance, take n = 1 and F (x) = − 32 x2 + 43 x3 so that Xk+1 =
4Xk (1 − Xk ) when η = 1. Varying η ∈ (0, 1], the iterates Xk exhibit the classic behavior
of period doubling and the emergence of chaos;
• Random walks, e.g., set F (x) = 0 with f1 (x) = x and f2 (x) = −x;
• Infinite Bernoulli convolutions, which up to a linear change of variables, have the form
F (x) = x2 , with f1 (x) = (x−1)2 , f2 (x) = (x+1)2 . Erdős [16] showed that in this quadratic
setting the corresponding invariant measures may be singular. Quadratic models have
also found recent applications in biological settings [11], and remain an area of research
in dynamical systems [17, 5, 30, 31, 1, 2].
In light of the (extremely) broad class of dynamics given by (1.3), we focus on separable fi .
Continuous time partial differential equations (PDEs) provide one approach for approximating
the evolution (1.4). Treating η ≪ 1 as a small parameter, (1.10) can be viewed as a leading
order approximation to the dynamics for Xk ; this also yields a corresponding advection PDE
approximation for (1.4). Truncating formal expansions of the operator P at progressively higher
orders in the asymptotic parameter η [32] then yield successive improvements. For instance, a
second order expansions in η to P yields a diffusion approximation to (1.4) (see §3.1). The diffusion
approximation is known to accurately describes the evolution of µk for finite time [32, 27, 19], and
infinite time when F is convex [18]. The diffusion approximation has also been used to gain
insight into SGD dynamics and (in the regime for which it is valid) can estimate the probability
distribution of Xk near the minimum of F (cf. [38] for estimates and convergence rates when fi
are convex). As we discuss below, the diffusion approximation can fail significantly to capture the
correct long-time dynamics such as the number of, and regularity of, invariant measures (cf. [34]).
Variants of the diffusion equation have also been used to model SGD dynamics [33, 10].
Rigorous PDE theory, such as the existence and geometric convergence to the invariant measure,
have also been established for diffusion equation models [44]. Partial differential equation models
can also arise from SGD as limiting dynamics where the asymptotic parameter is the dimension
(not η!) [3].
One approach to establish convergence of P k µ to a unique invariant measure (on a state space
X), is through a Doeblin condition of the form
holds for some ν ∈ P(X), ϵ > 0. Some algorithmic variants of SGD, such as those that add random
noise at each step (e.g., stochastic gradient Langevin dynamics), or make strong assumptions on fi
which mimic random noise (e.g., [47]), closely resemble a stochastic ODE. In these cases, conditions
such as (1.11) may be verified to prove convergence to a unique invariant measure. The version of
SGD (1.3), in general, does not satisfy (1.11) (even when X is restricted to the absorbing sets).
Our main result establishes the convergence of the exact Markov chain dynamics (1.4). We
emphasize that we do not make use of concepts that often arise in Markov chains with random
noise, e.g., φ-irreducibility (which is a notion of irreducibility for general state spaces), detailed
balance/reversibility, or (1.11). We also make no continuous time approximation. Rather, the
starting point is to view (1.3) as a random iterated function system (IFS). We then show that
the SGD dynamics (1.3) satisfy the splitting conditions [14, 6] that guarantees convergence to an
invariant measure for IFS with monotone maps.
4 D. SHIROKOFF, P. ZALESKI
(j)
With this convention, fi is allowed to be 0, however to avoid trivialities fi ̸= 0.
For functions of the form (2.1), the maps φi : Rd → Rd become
(2.2) φi (x) = φ(1) (2) (d)
i (x1 ), φi (x2 ), . . . , φi (xd )
(j) d (j)
φi (s) = s − η f (s) (1 ≤ j ≤ d) .
ds i
(j)
For all non-zero fi (i = 1, . . . , n; j = 1, . . . , d) we assume that
(j) (j)
(A1) fi is continuously differentiable, i.e., fi ∈ C 1 (R).
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 5
(j)
(A2) fi has a finite number of critical points.
(j)
(A3) fi (x) → ∞ as |x| → ∞.
(j)
For each family of functions {fi }ni=1 the set of critical points
n
[ d (j) (j)
C (j) = x∈R : fi (x) = 0, fi ̸= 0
i=1
dx
where we define a and b to be the smallest and largest elements of C (j) . With this notation, the
general state space will be
d (j) d (j)
f (x) − f (y) ≤ K |x − y| for x, y ∈ I (j) and 1 ≤ i ≤ n .
dx i dy i
(j)
(A5) (Inconsistent optimization) For each 1 ≤ j ≤ d the functions fi share no common critical
point, i.e.,
n
\ d (j)
xj ∈ R : f (xj ) = 0 = ϕ.
i=1
dx i
The assumptions (A1)–(A4) (or minor variations of) are standard in the gradient descent
(j)
optimization literature. Together they ensure that when η < K −1 each φi is an increasing
function — a property we use in the proofs. Condition (A2) is made for simplicity to rule out
complications when fi admits an infinite number of critical points, however the condition can be
relaxed, for instance some of the theory we present generalizes to allow for some fi to be constant
on an interval.
Assumption (A5) is sometimes referred to as inconsistent optimization since it implies the
fi do not share a common minimizer. Condition (A5) also implies that for each j at least two
(j)
functions in the set {fi }ni=1 are non-zero (otherwise (A5) fails trivially).
While it may appear that (A5) is overly simplifying, it is necessary to establish both con-
vergence in a (strong) metric, and uniform geometric convergence rates in Theorem 2.2. When
Assumption (A5) is removed, the convergence theory we build on from [6, 14] no longer applies as
stated.
The necessity of Assumption (A5) in the main result is highlighted with the simple example:
set n = 1, d = 1 and F (x) = x2 . Taking η = 12 , the SGD dynamics revert to deterministic gradient
descent Xk+1 = 12 Xk . If X0 = 1, so that µ0 = δ1 is a Dirac mass, then µk → δ0 converges weakly
as k → 0, but does not converge in the metric used in the main result.
When Assumption (A5) is removed, additional techniques are required to establish conver-
gence. For instance, suppose x∗ is a simultaneous critical point of each fi so that δx∗ is an invariant
measure. If x∗ is a local minimizer for each fi , with fi being convex (not necessarily separable),
then an application of Hutchinson [28] can directly be used to show an initial measure µ0 converges
geometrically to δx∗ in the Wasserstein metric (as opposed to the metric used in the main result).
If x∗ is a local minimum for some fi and a saddle or local maximum for others, then general
sufficient conditions for convergence of µk to δx∗ are more subtle.
6 D. SHIROKOFF, P. ZALESKI
2.2. Main Result. In this section we outline our main result; some technical definitions are
deferred to §4.
Our starting point is to define formulas for disjoint closed rectangles Tm that, as we will show,
are absorbing sets. With positive probability the SGD dynamics Xk will reach one of these sets
within a finite number of steps. Each Tm will contain exactly one invariant measure for the SGD
Markov chain.
The first step is to define disjoint closed intervals Tm in dimension d = 1 in terms of sets L and
R characterizing left and right moving dynamics. For notational brevity we drop the superscripts
(1) (1)
in fi and φi in this d = 1 setting.
When d = 1, the maps φi have the property that, for all η > 0,
φi (x) > x if fi′ (x) < 0 and φi (x) < x if fi′ (x) > 0 .
Hence, φi maps the point x to the left when fi′ (x) > 0 and to the right when fi′ (x) < 0; φi (x) = x
is a fixed point when fi′ (x) = 0. This motivates defining the following left and right sets as
n
[
(2.5) L := {x ∈ R : fi′ (x) > 0} ,
i=1
and
n
[
(2.6) R := {x ∈ R : fi′ (x) < 0} .
i=1
Note that if Xk ∈ L then there is a non-zero probability that the SGD iterate can move to the
left, e.g., Xk+1 < Xk with positive probability (an analogous result holds for Xk ∈ R). The sets
L and R also characterize when the SGD dynamics move to the left or right with probability one:
A point x ∈ / L if and only if
and x ∈
/ R, if and only if
Several properties of L and R are established in § 6.1. We now define the sets Tm .
Definition 2.1 (The sets Tm ). For a collection of functions {fi }ni=1 in dimension d = 1
with L and R given in (2.5)–(2.6), define the sets Tm to be closed intervals [l, r] (l < r) satisfying
(l, r) ⊂ L ∩ R
where l ∈ ∂L, r ∈ ∂R. Let MT be the number of such sets and denumerate them as
Tm = [lm , rm ] m = 1, . . . , MT .
Intuitively, the Tm ’s are constructed first by taking the intersection L with R and keeping
only the intervals for which l ∈ ∂L and r ∈ ∂R. In the subsequent theorem, the definition of
Tm as the closure of (l, r) accounts for cases where fi′ may fail to change sign on either side of a
critical point.
Proposition 6.2 in §6.1 will establish that the sets Tm exist (MT ≥ 1), are disjoint, and without
loss of generality can be ordered, e.g., rm < lm+1 for 1 ≤ m ≤ MT − 1.
Building on the one-dimensional setting, we extend the definition of the sets Tm to the mul-
tivariate case of separable functions as follows. For each 1 ≤ j ≤ d, define the sets
(j) (j) (j)
T1 , T2 , . . . , TMj ,
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 7
n on
(j)
by applying the Definition 2.1 to the family of functions fi , where now we denote the
i=1
number of such sets by Mj (instead of MT ).
Let M := [M1 ] × [M2 ] . . . × [Md ] be a subset of integer tuples. For each
m = (m1 , m2 , . . . , md ) ∈ M we then define the rectangles Tm (see Figure 2.1) which are the
multivariate generalizations of Tm as
(1) (2) (d)
(2.9) Tm := Tm 1
× Tm 2
. . . × Tm d
⊂ Rd .
x2
(2)
T2 T12 T22
(2)
T1 T11 T21
x1
(1) (1)
T1 T2
In the subsequent proofs, it will also be useful to define the following subsets in R for each
variable xj
Mj
[
T (j) := (j)
Tm ⊂ R,
m=1
where
(i) Tm (m ∈ M) is positive invariant/absorbing, and contains at least one local mini-
mizer of F . The number of Tm is at most the number of local minima of F .
(ii) T is non-empty and is an attractor in the sense that there exists ℓ0 = ℓ0 (η) such that
for every x ∈ I there is a path p⃗ of length ℓ0 satisfying φp⃗ (x) ∈ T .
8 D. SHIROKOFF, P. ZALESKI
Here the metrics dαm and dF are defined in section 4 and ⌊·⌋ is the greatest integer
(floor) function. The measures µ⋆m are the only invariant measures supported in I, and
the number MT is bounded by the number of local minima of F .
(c) For any probably measure µ0 ∈ P(I), µk := P k µ0 converges geometrically to an invariant
measure of the form
!
X X
⋆ ⋆
µ = cm µm 0 ≤ cm ≤ 1, cm = 1 ,
m∈M m∈M
where ν A is the restriction of the finite measure ν to a Borel set A. Lastly, in the case
where d = 1, (2.15) becomes
⌊k/ℓ⌋
⋆ 1
dF (µk , µ ) ≤ 3 1 − ℓ k > 0.
n
• While the number of sets Tm (and invariant measures) is bounded in terms of F , the
definition of Tm depends on how F is split into the functions fi ;
• The number of invariant measures does not depend on η provided η < 1/K, and the
associated Markov chain for SGD exhibits no bifurcations as a function of η. The rate of
convergence to equilibrium does however depend on η;
• The SGD iterates Xk may traverse between two local minima of F only if both minima
are contained in the same Tm (somewhat akin to a “mountain pass” theorem);
• Somewhat surprisingly, the converse to Theorem 2.2(a)(i) need not hold: the local (and
even global) minima of F do not need to be in T (see §3). In other words, there are
instances of SGD where even if the iterates Xk are initialized to lie in a neighborhood of
the global minimum of F , they will eventually leave (with probability 1);
• The contraction estimate (2.14) is sometimes referred to as a spectral gap estimate (in
the metric dF );
• When d > 1, the invariant measures µ⋆m are not necessarily product measures. In addition,
the main result in dimension d > 1 does not follow as a corollary of the one dimensional
case;
• An explicit bound on ℓ in Theorem 2.2 determining the convergence rate can be obtained
(j)
in terms of fi , e.g., see §3.
3. Examples in 1D. This section provides examples highlighting Theorem 2.2 in d = 1. In
each example, we analyze an objective function F with the specific splitting
1
(3.1) F (x) = f1 (x) + f2 (x) ,
2
where
Here λ > 0 is a free parameter that modifies the functions f1 , f2 in the decomposition. In this
case, the maps φ1 and φ2 become
Notice that for this splitting, the SGD update becomes gradient descent plus an additional random
walk with step-size λη, and is also the one dimensional analog of the algorithm proposed in [29].
3.1. Diffusion Approximation Background. We collect here several basic facts regarding
the diffusion approximation in dimension d = 1 as we reference it in subsequent examples.
The diffusion approximation is a variable coefficient advection-diffusion equation of the form
∂ρ ∂ η ∂2
(3.3) = u(x)ρ + D(x)ρ in (x, t) ∈ R × (0, T ] ,
∂t ∂x 2 ∂x2
with initial data ρ(x, 0) = µ0 , where µ0 is the SGD initialization distribution.
The velocity u(x) and diffusion coefficient D(x) in (3.3) are given in terms of the SGD functions
fi (x) and F (x) as
d η
u(x) := Φ(x) where Φ(x) := F (x) + (F ′ (x))2 ,
dx 4
and
n 2 n
1 X ′ 1 X ′ 2 2
D(x) := fi (x) − F ′ (x) = fi (x) − F ′ (x) ≥ 0 .
n i=1 n i=1
Intuitively, the advective term in (3.3) evolves the probability ρ towards minizers of F , while the
diffusion term arises from the stochastic terms fi in SGD. For instance, formally setting η = 0
10 D. SHIROKOFF, P. ZALESKI
in (3.3) results in an advection equation for ρ, with characteristics defined by the gradient flow
ẋ = −F ′ (x).
The equation (3.3) arises as a formal asymptotic approximation to the (exact) discrete-in-time
Markov evolution µj+1 = Pµj in the small parameter η ≪ 1 by matching terms up order O(η).
When η ≪ 1, ρ(x, t) (t = ηj) approximates the SGD probability evolution µj for finite times (cf.
[32, 19, 18, 27]).
The stationary solutions of (3.3) satisfy
d η d2
(3.4) u(x)ρ + D(x)ρ = 0.
dx 2 dx2
Note that (3.4) is a singular ordinary differential equation whenever the diffusion coefficient D(x)
vanishes at a point x∗ , namely
Under Assumption (A5), D(x) ̸= 0 since the expression in the right of (3.5) holds nowhere.
Equation (3.4) admits the following unique solution in the space of probabilities densities
2
(3.6) ρ∗ (x) = Z −1 exp − V (x) ,
η
where
x
d 2
Z
V (x) := D−1 (x) Φ(x) + D(x) dx ,
dx η
provided
Z ∞ 2
Z := exp − V (x) dx < ∞ .
−∞ η
For the splitting given in (3.1) and (3.2), the stationary density of the diffusion approximation
given in (3.6) is simplifies to
2 2 η
(3.7) ρ⋆ (x) ∝ exp − 2 Φ(x) = exp − 2 F (x) + (F ′ (x))2 .
ηλ ηλ 4
In the following examples, we will compare the approximation (3.7) to the true invariant measures
of the SGD Markov operator P. We will see that for larger values of λ the Markov operator
P has a unique invariant measure µ⋆ which is approximated by the density ρ⋆ (x). However, as
λ decreases, the Markov operator P may have multiple invariant measures that differ from the
diffusion approximation.
3.2. SGD on a One Dimensional Double Well. This first example demonstrates Theo-
rem 2.2 with SGD applied to the double-well objective function
1
(3.8) F (x) := (1 − x2 )2 ,
4
with the splitting given in (3.1) and (3.2). The free parameter λ > 0 modifies the functions f1 , f2
in the decomposition (see Figure 3.1) and will lead to a bifurcation in the invariant measures of
the associated Markov operator.
We first establish several observations to invoke Theorem 2.2. The critical points of f2 , and
by symmetry f1 (letting x 7→ −x), are solutions to the cubic equation
x3 − x − λ = 0 .
1
1 1
F (x) 2 f1 (x) 2 f2 (x)
y 0 = +
-1
-1 0 1 -1 0 1 -1 0 1
x x x
1
1 1
F (x) 2 f1 (x) 2 f2 (x)
y 0 = +
-1
-1 0 1 -1 0 1 -1 0 1
x x x
Fig. 3.1. Visualization of the SGD model problem given by (3.1)–(3.2) and (3.8) for values (Top) λ = .55 > λc ,
and (Bottom) λ = .2 < λc . When λ > λc , the SGD iterates can cross over the barrier of F and there is a unique
invariant measure. When λ < λc the SGD iterates cannot cross over the barrier of F and there are two invariant
measures.
1 1 1
f1
x 0 f2 0 L 0 R
-1 -1 -1
0 λc .8 0 λc .8 0 λc .8
λ λ λ
Fig. 3.2. Model problem (3.8). Left: the critical points of the functions f1 (black) and f2 (blue) are plotted as
a function of λ. Solid curves represent local minima and dashed curves represent local maxima. Middle: for each
λ the vertical cross section of the filled in region represents the left moving set L. Right: for each λ the vertical
cross section of the filled in region represents the right moving set R.
.5
1
η .25 x 0 T
-1
0
0 .4 .8 0 λc .8
λ λ
Fig. 3.3. Model problem (3.8). Left: the bound on η for the validity of Theorem 2.2 given in (3.10) is plotted
as a function of λ. Right: for each λ the vertical cross section of the filled in region represents the absorbing set
T . For λ > λc we have one absorbing interval and for λ ≤ λc we have two absorbing intervals.
and a single T1 = I = [−x0 , x0 ] (see Figure 3.3) defined by the sets L and R presented in Figure 3.2.
By Theorem 2.2 the operator P has a unique invariant measure µ∗ with support contained in T1 .
Figure 3.4 visualizes the crude agreement between the diffusion approximation ρ⋆ (x) defined by
(3.7)–(3.8) and a numerical approximation to µ⋆ .
In addition, from Theorem 2.2 any initial µ0 supported on I = T1 converges via
⌊k/ℓ1 ⌋
1
(3.11) dF (µk , µ⋆ ) ≤ 1− dF (µ0 , µ⋆ ) .
2ℓ1
A bound on ℓ1 , and hence the convergence rate, can be obtained from Theorem 5.1 with the
two paths satisfying (5.4) taken to be ⃗i (resp. ⃗j) as the ℓ1 consecutive compositions of φ2 (resp.
φ1 ). For all x ∈ [−x0 , 0] the map φ2 moves the point x to the right by the amount
φ2 (x) − x = η x − x3 + λ
≥ η λ − λc ,
by taking
$ √ %
1+ 3λ
x0
(3.12) ℓ1 = 1 + ≤1+ .
η(λ − λc ) η(λ − λc )
Notice the upper bound on ℓ1 given in (3.12) approaches ∞ as λ → λc . While (3.11) together
with (3.12) is only a lower bound on the spectral gap, the critical slow down in convergence rate
as λ → λc is also observed numerically.
Case 2: λ ≤ λc . The functions f1 , f2 each have three critical points (see Figure 3.2) given by
and by symmetry f1′ (−xj ) = 0 for j = 0, 1, 2. When λ = λc the two critical points x1 = x2 , while
x2 < x1 when λ < λc . In addition, Figure 3.2 shows the construction of the sets L and R, which
leads to the absorbing sets displayed in Figure 3.3. In this case, there are two absorbing intervals
given by
T1 = [−x0 , x2 ] T2 = [−x2 , x0 ] .
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 13
1
µ⋆ Xk
Diffusion approx.
µ⋆
.5
0
-1 0 1
x
Fig. 3.4. Model problem (3.8). A comparison of the exact unique invariant measure µ⋆ (red) and diffusion
approximation (3.7)–(3.8) (black) for parameter values λ = 2 > λc , η = .0698 (satisfying (3.10)). Here Ulam’s
method [42] is used to numerically compute the exact invariant measure (red). For comparison a time histogram
of the iterates Xk are plotted (blue) showing good agreement with µ⋆ . Note that the lack of smoothness in µ⋆ (red)
is a property of the invariant measure and not a result of under resolved computations.
By Theorem 2.2 the operator P has two invariant measures µ⋆1 and µ⋆2 (see Figure 3.5) sup-
ported on T1 and T2 respectively. Furthermore, any measure µ0 supported on I converges to some
convex combination of µ∗1 and µ∗2 , i.e.,
dF µk , cµ⋆1 + (1 − c)µ⋆2 ≤ 3γ ⌊k/ℓ⌋
for some 0 ≤ c ≤ 1, 0 < γ < 1 and ℓ ∈ N. A bound on ℓ can be obtained via elementary means.
This example demonstrates the discrepancy between the exact invariant measures (e.g., for
which there are two, µ⋆1 and µ⋆2 ) and the single invariant measure ρ∗ (x) predicted by the diffusion
approximation. In particular, the exact SGD dynamics cannot escape the local minima of F , while
the diffusion approximation implies that iterates of SGD will eventually escape and travel between
local minima.
Lastly, the example also demonstrates that the number of invariant measures is at least one
and most two, since F (x) has two local minima. While the bounds on the spectral gap and the
invariant measures depend on η, the sets Tm (m = 1, 2) and the number of invariant measures
are independent of η (provided η < η0 ). As a result, there are no bifurcations in the dynamics
µk+1 = Pµk (i.e., creation or loss of new fixed points) in terms of η.
0
-1 0 1 -1 0 1 -1 0 1
x x
x
Fig. 3.5. Model problem (3.8). For λ = .38 < λc and η = .33 (satisfying (3.10)) there are two invariant mea-
sures µ⋆1 (left), µ⋆2 (middle), while the diffusion approximation incorrectly predicts a unique stationary distribution
ρ⋆ (right). Note that the lack of smoothness in µ⋆j (j = 1, 2) again is a property of the invariant measure and not
due to an underresolved computation.
3.3. An Example where SGD Does Not Sample the Global Minimum. Here we
provide an example where the global minimum of the objective function F is contained inside a
uniformly transient region and hence not contained in the support of an invariant measure. From
Theorem 2.2 this implies that as k → ∞ the iterates Xk do not sample the global minimum (even
for small η values and arbitrary initializations X0 ).
14 D. SHIROKOFF, P. ZALESKI
1
F (x) 1 2 f2 (x)
2 f1 (x)
y = +
x x x
1 1
F (x) 2 f1 (x) 2 f2 (x)
y = +
x x x
1 1
F (x) 2 f1 (x) 2 f2 (x)
y = +
x x x
Fig. 3.6. Visualization of the SGD model problem given by (3.1)–(3.2) and (3.13) for values (Top) λ > λ1 ,
(Middle) λ1 < λ < λ2 , and (Bottom) λ < λ1 . When λ > λ2 , the SGD iterates can cross over the barriers of F
and there is a unique invariant measure. When λ1 < λ < λ2 , for any x0 ∈ I, the SGD iterates get trapped in a
sub-optimal local minimum and there are two invariant measures. Lastly, when λ < λ1 there are three invariant
measures.
From the critical points of f1 and f2 we can construct the left set L and right set R shown
in Figure 3.7, along with the state space I as the interval from the smallest critical point of f1
to the largest critical point of f2 . The set T is then constructed via Definition 2.1 and shown in
Figure 3.8.
For all values
−1
(3.14) η < (max {Lip f1′ , Lip f2′ })
(see Figure 3.8) we can apply Theorem 2.2 to obtain the following results.
Case 1: λ > λ2 . The functions f1 , f2 have a unique critical point (see Figure 3.7) and a single
absorbing interval T1 (see Figure 3.8). By Theorem 2.2 the Markov operator P has a unique
invariant measure µ⋆ supported on T1 . Figure 3.9 compares µ⋆ (computed numerically using
Ulam’s method [42]) to a time histogram of the SGD iterates Xk and diffusion approximation ρ⋆
given by (3.7) and (3.13).
In addition, from Theorem 2.2 there exists an ℓ1 > 0 such that for any initial µ0 supported
on I = T1 we have
⌊k/ℓ1 ⌋
1
dF (µk , µ⋆ ) ≤ 1− dF (µ0 , µ⋆ ) .
2ℓ1
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 15
1 1 1
f1 L R
x 0 f2 0 0
-1 -1 -1
Fig. 3.7. Model problem from (3.13). Left: the critical points of the functions f1 (black) and f2 (blue) are
plotted as a function of λ. Solid curves represent local minimum and dashed curve represent local maximum.
Middle: for each λ the vertical cross section of the filled in region represents the left moving set L. Right: for each
λ the vertical cross section of the filled in region represents the right moving set R.
.03
1
η .015 x 0 T
-1
0
0 1 2 λ1 λ2
λ λ
Fig. 3.8. Model problem from (3.13). Left: the bound on η for the validity of Theorem 2.2 given in (3.14)
is plotted as a function of λ. Right: for each λ the vertical cross section of the filled in region represents the
absorbing set T . For λ > λ2 we have one absorbing interval, for λ1 < λ ≤ λ2 we have two absorbing intervals,
and for λ ≤ λ1 we have three absorbing intervals.
Case 2: λ2 ≥ λ > λ1 . The functions f1 , f2 each have three (distinct) critical points (see Figure 3.6
and Figure 3.7). Figure 3.7 shows the construction of the sets L and R, which leads to the sets T
displayed in Figure 3.8. In this case, there are two absorbing intervals T1 and T2 .
By Theorem 2.2 the Markov operator P has two invariant measures µ⋆1 and µ⋆2 (see Figure 3.10)
supported on T1 and T2 respectively. Furthermore, any measure µ0 supported on I converges to
some convex combination of µ⋆1 and µ⋆2 , i.e.,
(3.15) dF µk , cµ⋆1 + (1 − c)µ⋆2 ≤ 3γ ⌊k/ℓ⌋
Case 3: λ ≤ λ1 . The functions f1 , f2 each have five (distinct) critical points (see Figure 3.6 and
Figure 3.7). In this case, there are three absorbing intervals T1 , T2 , and T3 (see Figure 3.8). The
set T2 contains the global minimum x = 0, while T1 and T3 each contain the sub-optimal local
minimum x = −1.35 and x = 1.35 respectively.
16 D. SHIROKOFF, P. ZALESKI
Thus, from Theorem 2.2 the Markov operator P has three invariant measures µ⋆1 , µ⋆2 , and
µ⋆3each shown in Figure 3.11. From Figure 3.11 we can see that µ⋆1 and µ⋆2 do not agree with
the diffusion approximation, while µ⋆2 , whose support contains the global minimum of F (x), does
agree with the diffusion approximation.
1
µ⋆ Xk
Diffusion approx.
µ⋆
.5
0
-1 0 1
x
Fig. 3.9. Model problem from (3.13). For λ = 7 > λc and η = .007 (satisfying (3.14)), the probability
density function of the unique invariant measure µ⋆ is plotted. Ulam’s method [42] is used to estimate the exact
invariant measure (red). A histogram of the iterates Xk is plotted (blue). The stationary density of the diffusion
approximation given in (3.7) and (3.13) is plotted (black).
50 50
Xk Diffusion approx.
µ⋆1 , µ⋆2 2
25 25
µ⋆1 µ⋆2 1
0 0 0
-1.4 -1.3 1.3 1.4 -1 0 1
x x x
50 4
F
Diffusion approx.
µ⋆1 µ⋆1 , µ⋆2 µ⋆2
25 2
0 0
-1 0 1
x
Fig. 3.10. Model problem from (3.13). Bottom panel shows the relation of the exact invariant measures µ⋆j
(j = 1, 2) (vertical scale units on left) and diffusion approximation (vertical scale units on right) to the objective
function (dashed line, arbitrary units). This example demonstrates that the global minimizer of F is not contained
inside the support of either invariant measure µ⋆j (j = 1, 2) which are plotted in red (time-histograms of the SGD
iterates Xk are in blue). The supports of µ⋆j are each on a neighborhood of the sub-optimal minimizers of F . In
contrast, the diffusion approximation ρ⋆ (black) is localized to the neighborhood of the global minima of F and fails
to approximate the invariant measures. The top row plots an enlarged image of each measure.
µ⋆1 µ⋆3
50 50
25 25
0 0
-1.36 -1.34 1.34 1.36
x x
4
Xk
µ⋆2 Diffusion approx.
µ⋆1 , µ⋆2 , µ⋆3
2
0
-.3 0 .3
x
Fig. 3.11. Model problem from (3.13). For λ = .5 < λ1 and η = .015 (satisfying (3.14)), the probability
density function of µ⋆1 (top left), µ⋆3 (top right), µ⋆2 (bottom), and the diffusion approximation (black bottom) is
plotted. Ulam’s method [42] is used to estimate µ⋆1 , µ⋆2 , and µ⋆3 (red). In addition, a histogram of the iterates Xk
is also plotted (blue). Note that ρ⋆ approximates µ⋆2 but fails to approximate the other two invariant measures.
For a general state space, a Borel set T is absorbing [36, Section 4.2.2] if the Markov transition
kernel p(x, T ) = 1 for all x ∈ T . Identity (4.1) ensures that if the initial measure µ0 is supported
on T , then all successive iterations µk defined by (1.4) will be supported on T . Hence, for the
SGD dynamics, a Borel set T is absorbing if and only if it is positive invariant.
A Borel set B is uniformly transient [36, Chapter 8] if the function
∞
X
U (x) := (P n δx )(B) ,
n=0
is bounded on B, i.e., supx∈B U (x) < ∞. The function U (x) measures the expected number of
times the Markov chain initialized to X0 = x ∈ B visits B; uniformly transient sets are expected
to spend only a finite time in B. The inequality (2.12) implies U (x) is bounded by a geometric
series and hence B is uniformly transient.
Let α = (α1 , α2 , . . . , αd ) ∈ {−1, +1}d denote the d-tuple of signed unit coefficients, and
associate to each of the 2d vectors α the closed orthont of Rd
Here ej denotes the usual basis vectors in Rd . Each orthant Rdα then defines a partial ordering ⪯α
on Rd given by
A = y ∈ Rd : γ(y) ⪯α c ,
A ∈ Aα if
for some constant vector c ∈ Rd and continuous function γ : Rd → Rd that is monotone with
respect to Rdα . When α = (+1, +1, . . . , +1), the set Aα includes all (semi-infinite) rectangles of
the form (−∞, c1 ] × (−∞, c2 ] × . . . × (−∞, cd ], for c ∈ Rd , but also includes other sets as well.
For a closed and bounded Borel set I ⊂ Rd (which, in practice, we will take as a rectangle),
and two (Borel) probability measures µ, ν ∈ P(I), the metric of Bhattacharya and Lee is
For closed and bounded I, the metric space (P(I), dα ) is complete [6, 7].
In dimension d = 1, dα (µ, ν) reduces to the Komolgorov distance (which we write as dF ):
where Fµ (x) := µ ([a, x]) is the cumulative distributions function (CDF) and
∥f ∥∞ = sup |f (x)| .
x∈[a,b]
The metric dα is stronger than the Wasserstein metric and weak convergence. The converse
is not true, e.g., in d = 1, µk = δ k1 converges weakly to δ0 but not with respect to dF . On the
other hand, dα is weaker than the total variation metric dTV (µ, ν) which is defined as (4.3) where
the supremum is taken over all sets in the Borel σ-algebra.
The metric dα generalizes naturally to all finite non-negative measures M (I) with mass |µ|.
The following properties also hold:
The identities (4.4)–(4.7) follow directly from the definition of dα and the triangle inequality.
To represent trajectories taken by iterates of SGD, let
⃗i = i1 , i2 , . . . , im−1 , im ∈ [n]m ,
where each ik ∈ [n] (1 ≤ k ≤ m) and |⃗i| = m is the path length. The path defined by ⃗i is the
composition
φ⃗i (x) := φim · · · φ2 φ1 (x) · · · = φim ◦ φim−1 ◦ · · · ◦ φi1 (x) .
Two paths are distinct, ⃗i ̸= ⃗j, if they differ in at least one entry. For two paths ⃗j and ⃗i with
lengths |⃗j| = m1 and |⃗i| = m2 we write
⃗i ◦ ⃗j = j1 , . . . , jm1 , i1 , . . . , im2
a φi (b) φj (a) b
It may be that no pair of i, j satisfy (5.3). In this case, Theorem 5.1 can be applied to powers
of the operator (P)ℓ where (5.3) (equivalently (5.1)) can be replaced with a pair of distinct paths
of the same length satisfying
(5.4) φ⃗i (b) ≤ φ ⃗j (a) (⃗i ̸= ⃗j) |⃗i| = |⃗j| = ℓ .
Inequality (5.4) is exactly (5.3) with {φi : i ∈ [n]} replaced by {φ⃗i : ⃗i ∈ [n]ℓ }. Applying
Theorem 5.1 with condition (5.4) in lieu of (5.1) modifies the geometric factor appearing in the
convergence rate (5.2) to (1 − n−ℓ ).
The result of Bhattacharya and Lee extends Theorem 5.1 to dimensions d > 1 where the
splitting and monotonicity conditions are with respect to a cone.
Theorem 5.2 (Bhattacharya and Lee [6, Theorem 2.1 & Corollary 2.4] ). Set I = [0, 1]d and
let α be fixed. Suppose {φi }ni=1 : I → I are each continuous and monotone with respect to Rdα . If
there are two paths p⃗1 and p⃗2 of length ℓ satisfying
(5.5) φp⃗1 (I) ⪯α x0 and x0 ⪯α φp⃗1 (I)
for some x0 ∈ I, then the Markov operator P has exactly one invariant measure µ⋆ with support
contained in I. In addition, for any µ supported on I
⌊k/ℓ⌋
1
(5.6) dα (P k µ, µ⋆ ) ≤ 1 − ℓ k > 0.
n
20 D. SHIROKOFF, P. ZALESKI
There are a few minor differences from the exact theorem statement [6, Theorem 2.1 & Corol-
lary 2.4] and the version we state in Theorem 5.2. In [6], (5.6) is stated with µ replaced by δx (the
upper bound in (5.6) being uniform in x ∈ I). This is equivalent to (5.6) as written above.
The formulation in [6] also allows for the maps φi to be drawn from an infinite index set. In
this case, the convergence rate (5.6) (or (5.2)) is given in terms of the probability that the event
(5.5) holds. If κ pairs of distinct paths satisfy (5.6), then [6] yields the stronger geometric rate of
1 − κ/nℓ in (5.6).
Lastly, [6, Theorem 2.1 & Corollary 2.4] is stated only for α being the positive orthont.
However, the theorem extends trivially to the version stated in Theorem 5.2 defined over any cone
α by the change of variables: φ(x) → Λφ(Λx) where Λ = diag(α).
6. A Few Lemmas for One Dimensional SGD. Throughout this section we will assume
that d = 1 and build up a series of results for SGD. For notational convenience we refrain from
writing superscripts on the functions fi , the maps φi , and the sets Tm , e.g., we replace
(1) (1) (1)
fi → fi φi → φi (1 ≤ i ≤ n) and Tm → Tm (1 ≤ m ≤ MT ) .
(1)
Thus, in assumptions (A1)–(A5) each fi is simply fi .
6.1. Properties of the Sets L, R, and Tm . Here our goal is to collect and prove basic
properties of the sets L, R and Tm (see (2.5), (2.6), and Definition 2.1).
Proposition 6.1 (Basic properties of L and R). Let d = 1. Given assumptions (A1)–(A4)
and a, b defined by I in (2.3), then sets L and R defined in (2.5) and (2.6) are finite unions of
open intervals with ∂L, ∂R ⊂ I and
(6.2) ∂L ∩ ∂R = ϕ and L ∪ R = R.
Proof. The sets L and R are finite unions of open intervals since each fi′ is continuous with a
finite number of roots. The boundaries ∂L, ∂R are contained in the set of critical points C of the
fi′ ’s and confined to I.
The assumptions (A1)–(A3), with d = 1, imply (6.1) since
and similarly fi′ (x) > 0 for x > b. Lastly, x ∈ ∂L ∩ ∂R implies fi′ (x) = 0 for all i violating (A5).
Since for every x ∈ R from (A5), either fi′ (x) < 0 or fi′ (x) > 0 for some i. Thus, if x ∈ / L, then
x ∈ R and vice versa.
The next proposition in this section establishes basic properties of the sets Tm .
Proposition 6.2 (Properties of Tm ). Let d = 1. Under assumptions (A1)–(A5), the sets
Tm = [lm , rm ] in Definition 2.1 satisfy the following:
(a) There is at least one Tm , i.e., MT ≥ 1.
(b) The endpoints satisfy
(6.3) rm ∈ L , rm ∈
/R and lm ∈ R , lm ∈
/L for all 1 ≤ m ≤ MT .
Proof. (a) Through direct construction, we show MT ≥ 1. From identity (6.1) in Proposi-
tion 6.1, the set L contains an interval of the form (l, ∞) where l ∈ ∂L is finite (in fact l ≤ b). Since
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 21
equation (6.2) implies R and L cover R, R contains an interval of the form (r̄, r) with r̄ < l < r,
where r ∈ ∂R is finite. Thus (l, r) satisfies Definition 2.1.
(b) Since L and R are open (see Proposition 6.1) they do not contain their boundary points, hence
rm ∈/ R and lm ∈/ L. Since L ∪ R = R condition (6.3) holds.
(c) We show that F ′ (lm ) < 0 and F ′ (rm ) > 0. This implies that argminx∈Tm F (x) lies in the
interior of Tm , and hence must be a local minimizer.
Since lm ∈/ L and rm ∈ / R from part (b), we have
where each of the inequalities in (6.4) are strict for at least one i due to Assumption (A5). The
sign of F ′ at lm and rm then follows from the definition of F .
(d) First note the endpoints lm , rm are confined to the critical points of the fi′ ’s and hence must
lie in I. To show the sets Tm are pairwise disjoint, suppose by contradiction that Tm ̸= Tj and
Tm ∩ Tj ̸= ϕ. Since the sets are distinct closed intervals, an endpoint of one interval must lie in
the other and also differ from the same endpoint, i.e., without loss of generality the right endpoint
of Tm must differ from the right endpoint of Tj and also intersect Tj so that
rm ∈ [lj , rj ) .
The definition of Tj implies (lj , rj ) ⊂ R while part (b) implies lj ∈ R. Thus, rm ∈ [lj , rj ) ⊂ R,
however this contradicts (b).
6.2. Properties Related to the SGD Dynamics. The next proposition proves the fact
that intervals of L and R bound the regions for which SGD can move to the left and right
respectively.
Proposition 6.3 (Dynamics related to L and R). Let d = 1. Assume (A1)–(A3) hold with
L and R defined in (2.5)–(2.6), and let 0 < η < 1/K. Then the maps φi are monotone increasing
on I, i.e.,
Furthermore, if (l, r] ⊂ L ∩ I with l ∈ ∂L, there exists paths that map r arbitrarily close to l
(but no further):
combined with the Lipschitz bound on fi′ in (A4) and η < 1/K.
We prove (6.5) as (6.6) follows by an identical argument. Since l ∈
/ L, we have
Combining this with the fact that the maps φp⃗ are monotone on (l, r] (which is in I) implies
The function ∆(x) is continuous (since it is the pointwise maximum of a finite collection of con-
tinuous functions) and strictly bounded by ∆(x) > 0 on [α, r] (since the interval is contained in
L). By compactness, ∆(x) achieves its minimum ∆0 on [α, r] and satisfies
However, this yields a contraction by the definition of α: For every x ∈ [α, r] there is a map φi
satisfying
φi (x) ≤ x − η∆0 ,
which implies there is a path p⃗ of finite length for which φp⃗ (r) < α.
We show next that the sets Tm are positive invariant, or equivalently, absorbing.
Proposition 6.4 (Tm are positive invariant). Let d = 1. Assume (A1)–(A5) hold and
0 < η < 1/K. Then I and each Tm (1 ≤ m ≤ MT ) is positive invariant.
Proof. Write Tm = [lm , rm ]. Combining the facts that lm ∈
/ L and rm ∈
/ R (by Proposition 6.2),
φi is monotone on Tm (by Proposition 6.3), and (2.7) and (2.8) yields
Hence,
(6.7) φi (Tm ) ⊂ Tm
holds for all i and m. By the same argument I = [a, b] is also positive invariant.
We conclude this section with a proof that there is a fixed path length for which every point
x ∈ I outside T can be mapped into T . This will provide the basis for establishing that the set B
is uniformly transient in the main result.
Lemma 6.5 (Uniform path length in one dimension). Let d = 1. Assume (A1)–(A5) and
0 < η < 1/K hold and define T as in (2.10). Then there exists a uniform path length ℓ0 such that
for every x ∈ I there is a path p⃗ of length ℓ0 satisfying φp⃗ (x) ∈ T .
Proof. Step 1: We first show that for every x ∈ I there exists a path ⃗tx , whose length may
depend on x, such that
Condition (6.9) combined with (6.5)–(6.6) from Proposition 6.3 imply (6.8).
The remaining proof of (6.9) for Step 1 is visualized in Figure 6.1. Assume without loss
of generality that x ∈ (β1 , β2 ) ⊂ L where β1 and β2 defines the largest open sub-interval in L
containing x (i.e., both β1 , β2 lie on ∂L). If x ∈
/ L an identical assumption may be made regarding
x ∈ R.
If β1 = lm for some m, then we are done (see Figure 6.1(a)).
If β1 ̸= lm (for every m), then β1 ∈ R by (6.3) (see Figure 6.1(b)). Let (β1 , β3 ) ⊂ R be the
largest sub-interval of R with β3 ∈ ∂R. Then β2 < β3 , otherwise [β1 , β3 ] would define a Tm and
(6.9) would hold.
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 23
a) β1 = lm x β2
L
Tm R
rm
b)
β1 x β2 β4
Tm β3
Fig. 6.1. Visualization of two sub-cases for the proof in Step 1 of Lemma 6.5.
By construction, the interval [x, β3 ) ⊂ R. We now claim that β3 = rm for some m, in which
case we are done. The reason is that since R ∪ L covers I (e.g., (6.2)) with β3 ∈ ∂R, so there must
be an interval of the form (β4 , β3 ] ⊂ L with β4 ∈ ∂L. Lastly, [β4 , β3 ] satisfies Definition 2.1, since
β3 > β2 and β2 ∈ / L implies β4 ≥ β2 > β1 , which implies that (β4 , β3 ) ⊂ R.
Step 2: To complete the proof of the Lemma, it is sufficient to show that there exists an
ℓ = ℓ(η) for which I ⊂ Uℓ where
n o
(6.10) Uk := x ∈ R : φ ⃗t (x) ∈ int T and | ⃗t | = k ,
is the set of points that map into the interior of T after k steps. Let
∞
[
U= Uk ,
k=0
be the set of all points that can reach the interior of T (with an arbitrary path length). Note that
Uk is equivalently
n o
Uk = φ−1 ⃗
t
(int T ) : | ⃗t | = k .
Since φj is continuous and int T is open, Uk and hence U is open. By (6.8), the collection
{Uk : k ∈ Z} is an open cover of I and by compactness has a finite sub-cover, i.e.,
ℓ
[
I⊂ Uk ,
k=0
for some ℓ. Since φi is (strictly) monotone on I when η < 1/K, and T is positive invariant, we
have that each φi maps open subsets of T into open subsets of T , whence φi (int T ) ⊂ int T for
every 1 ≤ i ≤ n. Consequently, the sets I ∩ Uk ⊂ I ∩ Uk+1 are nested, and hence I ⊂ Uℓ .
7. Proof of the Main Result. Building on the one dimensional results, we provide the
proofs for each part of the main result.
7.1. Main Result Proof of Part (a). The proof of Theorem 2.2(a) makes use of the follow-
ing lemma — which is a general statement regarding uniformly transient sets for Markov operators
arising from iterated function systems. The lemma makes no assumptions on the regularity or
monotonicity of the maps φi .
Lemma 7.1 (Uniformly transient sets for iterated function systems). Consider the dynamics
(1.2) for a family of maps φi : I → I (1 ≤ i ≤ n, I) where I ⊂ Rd is non-empty and P is defined
in (1.7). Suppose the set T ⊂ I (B := I \ T ) is positive invariant and for each x ∈ B, there exists
a φi (x) ∈ T .
Then for any finite measure µ ∈ M (I)
1
(Pµ)(B) ≤ 1 − µ(B) .
n
24 D. SHIROKOFF, P. ZALESKI
The intuition behind Lemma 7.1 is that a fraction of the mass µ(B) travels into T every iteration.
The proof is provided here for completion.
Proof. For each i = 1, . . . , n let Bi := {x ∈ B : φi (x) ∈ T }, i.e., Bi ⊂ B with φi (Bi ) ⊂ T .
The sets Bi then satisfy the following identities:
(7.1) φ−1
i (B) ∩ I = B \ Bi ,
(j)
Proof of Main Result Theorem 2.2(a). (i) For each 1 ≤ j ≤ d the family of functions {fi }ni=1
satisfies (A1)–(A5). Proposition 6.2(a) implies for each j, Mj ≥ 1 so that MT ≥ 1 and T is non-
empty. Secondly, Proposition 6.2(c) implies that for each m = 1, . . . , Mj at least one local minima
of
n
(j)
X
g (j) (xj ) := fi (xj ) ,
i=1
(j) Pd
is contained in each Tm . Since we can write F (x) = j=1 g (j) (xj ) and Tm is a rectangle of the
form (2.9), it follows that each Tm (m ∈ M) contains a local minima of F .
(j)
Thirdly, Proposition 6.4 implies that Tm (1 ≤ m ≤ Mj ) is positive invariant with respect to
(j) n
the maps {φi }i=1 . Combining this fact with the form (2.2) of φi for separable functions implies
that both I and Tm (m ∈ M) are positive invariant with respect to the d dimensional maps
{φi }ni=1 .
Additional positive invariant sets can also be constructed via intersections and unions, for
instance, rectangles of the form I (1) × . . . × I (j−1) × T (j) × I (j+1) × . . . × I (d) are also positive
invariant. This will be used in the proof of Part (ii).
(ii) Apply Lemma 6.5 separately to each dimension. Then there exists an ℓ(j) such that for every
(j)
xj ∈ I (j) there exists a path ⃗tj (depending on xj ) of length ℓ(j) satisfying φ⃗t (xj ) ∈ T (j) . Now
j
let
d
X
(7.3) ℓ0 = ℓ(j) .
j=1
Then for every x = (x1 , x2 , . . . , xd ) ∈ I define the composition path p⃗ = p⃗1 ◦ p⃗2 ◦· · ·◦ p⃗d as follows:
Set p⃗d (of length ℓ(d) ) so that
Pick p⃗d−1 with length ℓ(d−1) to map the d − 1 component of φp⃗d (x) into T (d−1) . The positive
invariance of the set in the right hand side of (7.4) yields:
Continuing to define φp⃗j with length ℓ(j) to map the jth component of φp⃗j+1 ◦ · · · ◦ φp⃗d (x) into
T (j) , one obtains
(iii) Take ℓ0 as defined in (7.3) and apply Lemma 7.1 to the family of nℓ0 maps {φ⃗j : |⃗j| = ℓ0 }.
7.2. Main Result Proof of Part (b). Theorem 2.2(b) follows directly from Theorem 5.2,
where the key technical statement to prove is that the SGD dynamics satisfy the splitting condition
(5.5).
To prove (5.5), we proceed by induction on the dimension. In particular, the idea is to consider
the restriction of the SGD dynamics (for separable functions) into the first j coordinates. We then
show, that if (5.5) holds for the first j coordinates, then (5.5) holds for the first j + 1 coordinates
as well.
There is a subtlety in the proof in that the argument relies on the dynamics (1.3) to construct
α defining the orthant Rdα . The following example demonstrates that (5.5) does not necessarily
hold for every choice of α, but rather does hold for at least one α.
Example 1. Set d = 2, f1 (x1 , x2 ) = (x1 )2 + (x2 − 1)2 and f2 (x1 , x2 ) = (x1 − 1)2 + (x2 )2 so
that
(1) (2) (1) (2)
f1 (x1 ) = x21 f1 (x2 ) = (x2 − 1)2 f2 (x1 ) = (x1 − 1)2 f2 (x2 ) = (x2 )2 .
In this example, there is only a single set T = I = [0, 1]2 as defined in (2.9). Provided η < 1/K,
the following subsets of T are positive invariant: (i) the line segment y = 1 − x; (ii) the region
y > 1 − x; and the region (iii) y < 1 − x. As a result, the condition (5.5) with respect to
α = (+1, +1) is never satisfied (for any pair of paths of any length) since mappings of (0, 0) and
(1, 1) will stay separated by the line segment y = 1 − x; no pair of maps will flip the ordering of
(0, 0) and (1, 1).
The condition (5.5) is however readily verified for α = (+1, −1). Despite the fact that T has
three positive invariant sets, the main result implies any initial measure µ0 ∈ P(I) converges
to a unique invariant measure (which in this case is supported on the line segment y = 1 − x
(0 ≤ x ≤ 1).
Proof of Main Result Theorem 2.2(b). It is sufficient to show that {φi }ni=1 and the set Tm
satisfy the hypothesis in Theorem 5.2 (or equivalently Theorem 5.1 when d = 1).
Under the restriction η < 1/K, every map φi is monotone on I with respect to every orthont
(j)
Rdα . This is because each component φi of φi (see (2.2)) is an increasing function on I (j) via
Proposition 6.3.
It remains to show {φi }ni=1 satisfy (5.5). For notational simplicity, we assume without loss of
generality (e.g., after translation and scaling of Xk ), that Tm has the form
Tm = [0, 1]d .
Assume that (5.5) holds in Rd−1 . That is, for the family of maps {φ̂i }ni=1 given as
φ̂i (x1 , x2 , . . . , xd−1 ) = φ(1)
i (x 1 ), φ
(2)
i (x 2 ), . . . , φ
(d−1)
i (x d−1 ) ,
there exists an α̂ ∈ {+1, −1}d−1 , x̂0 ∈ Rd−1 and two paths p⃗1 and p⃗2 (each having the same length
ℓ)
b satisfying
φ̂p⃗1 [0, 1]d−1 ⪯α̂ x̂0 x̂0 ⪯α̂ φ̂p⃗2 [0, 1]d−1 .
(7.5) and
where φ̂i is the restriction of φi to the first d − 1 coordinates. By hypothesis {φ̂i }ni=1 satisfies (7.5)
for some α̂, x̂0 and paths p⃗1 and p⃗2 . We now construct α, x0 , and two paths for which φi satisfies
(5.5).
Turning attention to the dth coordinate, either
(d) (d) (d) (d)
φp⃗2 (0) ≥ φp⃗1 (1) or φp⃗2 (0) < φp⃗1 (1) .
(d) (d)
If φp⃗2 (0) ≥ φp⃗1 (1), then the dth coordinate satisfies the splitting condition in one dimension
(d) (d)
φp⃗1 [0, 1] ≤ m
b ≤ φp⃗2 [0, 1] ,
where
1 (d) (d)
(7.6) m
b := φp⃗2 (0) + φp⃗1 (1) .
2
Then (5.5) holds in dimension d for {φi }ni=1 with paths p⃗1 , p⃗2 , orthont α = (α̂, +1) and
b ∈ Rd .
(7.7) x0 := x̂0 , m
(d) (d)
If φp⃗2 (0) < φp⃗1 (1), then we construct two new paths (via concatenation) for which (5.5)
holds. Let K0 := (1 + ηK)ℓ and set
b
1 (d) (d)
ε := φp⃗1 (1) − φp⃗2 (0) > 0 .
2K0
The choice of K0 with Assumption (A4) then yields the following Lipschitz condition
(d) (d)
(7.8) φp⃗j (x) − φp⃗j (y) ≤ K0 |x − y| , j = 1, 2 and x, y ∈ [0, 1] .
Next, from Proposition 6.3 there exist two paths ⃗q1 and ⃗q2 for which
(d)
(7.9) φq⃗1 [0, 1] ⊂ [1 − ε, 1] ,
and
(d)
(7.10) φq⃗2 [0, 1] ⊂ [0, ε] .
(d)
Applying φp⃗1 to (7.9) yields
(d) (d)
(7.11) φp⃗1 ◦⃗q1 [0, 1] ⊂ φp⃗1 [1 − ε, 1] .
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 27
(d)
Using the fact that φp⃗1 is increasing, together with (7.8) implies
(d) (d)
(7.12) φp⃗1 (1) − φp⃗1 (1 − ε) ≤ K0 ε .
(d) (d)
Substituting the definition of ε into (7.12) yields φp⃗1 (1 − ε) ≥ m.
b Together with φp⃗1 (1) ≤ 1,
(7.11) implies
(d)
φp⃗1 ◦⃗q1 [0, 1] ⊂ [m,
b 1] .
(d)
By a similar argument, applying the φp⃗2 and (7.8) to (7.10) yields
(d)
φp⃗2 ◦⃗q2 [0, 1] ⊂ [0, m]
b .
Subsequently, the maps {φi }ni=1 satisfy condition (5.5) in dimension d with paths p⃗1 ◦⃗q1 and p⃗2 ◦⃗q2 ,
orthont α = (α̂, −1) and point x0 defined in (7.7).
7.3. Main Result Proof of Part (c). The proof of part (c) utilizes the metric d˜ defined
in (2.16) which satisfies the following: for any finite non-negative Borel measures µ1 , µ2 , ν1 , ν2
supported on I,
(7.13) ˜ 1 + µ2 , ν1 + ν2 ) ≤ d(µ
d(µ ˜ 1 , ν1 ) + d(µ
˜ 2 , ν2 ) ,
(7.14) ˜ 1 + µ2 , ν1 + ν2 ) ≤ ν1 (I) + µ1 (I) + d(µ
d(µ ˜ 2 , ν2 ).
Here (7.13) and (7.14) follow from (4.5) and (4.6) together with the fact that for m ∈ M, Tm ⊂ I
and B ⊂ I are pairwise disjoint.
Proof of Main Result Theorem 2.2(c). Let µ0 ∈ P(I). Then for each m ∈ M, µk (Tm ) is a
bounded increasing sequence since Tm is positive invariant; see Theorem 2.2(a). Thus,
Note that µ⋆ is the convex combination of invariant measures and hence is invariant.
We now show the result
d˜ P 2ℓk µ0 , µ⋆ ≤ 3γ k .
(7.15)
where
1
ℓ := ℓ0 ∨ max {ℓm } and γ := 1 − ,
γm ∈M nℓ
With these notations, and using the fact that Q2k µ0 = Qk µ̃k we can decompose the measure
X
Q2k µ0 = Qk µ̃k B + Qk µ̃k T
(7.16) ,
m
m∈M
d˜ Q2k µ0 , µ⋆ ≤ I1 + I2 + I3 ,
where
I1 = Qk µ̃k B (I) ,
X
cm − µ̃k (Tm ) µ⋆m (I) ,
I2 =
m∈M
!
I3 = d˜
X X
k
Q µ̃k Tm
, µ̃k (Tm ) µ⋆m .
m∈M m∈M
(7.18) I1 = µ̃k B
(I) = µ̃k (B) ≤ γ k .
The first equality in (7.18) follows since I is positive invariant and Q is a Markov operator; the
last inequality follows from (2.12).
Similarly, since µ⋆m (I) = 1 and the cm ’s sum to one, the second term is
X
cm − µ̃k (Tm ) = 1 − µ̃k (T ) = µ̃k (B) ≤ γ k .
I2 =
m∈M
Estimating the third term I3 follows from Theorem 2.2(b) which quantifies the convergence of the
(normalized) restriction µ̃k Tm to µ⋆m . In particular, for any non-negative measure ν supported
on Tm from Theorem 2.2(b) we obtain for all k ≥ 0 and m ∈ M:
d˜ Qk ν, |ν| µ⋆m = dαm P kℓm +k(ℓ−ℓm ) ν, |ν| µ⋆m
(7.19)
k
ν
≤ |ν| γm dαm P k(ℓ−ℓm ) , µ⋆m (By (4.7) and (2.13))
|ν|
≤ |ν| γ k .
REFERENCES
[1] C. Bandt, Finite orbits in multivalued maps and Bernoulli convolutions, Adv. Math., 324 (2018), pp. 437–
485.
[2] A. Batsis, Ergodic theory methods in Bernoulli convolutions for algebraia parameters and self-affine mea-
sures, PhD thesis, University of Manchester, 2021.
[3] G. Ben Arous, R. Gheissari, and A. Jagannath, High-dimensional limit theorems for SGD: Ef-
fective dynamics and critical scaling, in Advances in Neural Information Processing Systems,
S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran
Associates, Inc., 2022, pp. 25349–25362, https://proceedings.neurips.cc/paper files/paper/2022/file/
a224ff18cc99a71751aa2b79118604da-Paper-Conference.pdf.
[4] M. Benaı̈m, Dynamics of stochastic approximation algorithms, in Séminaire de Probabilités XXXIII,
J. Azéma, M. Émery, M. Ledoux, and M. Yor, eds., Berlin, Heidelberg, 1999, Springer Berlin Heidel-
berg, pp. 1–68.
[5] I. Benjamini and B. Solomyak, Spacings and pair correlations for finite Bernoulli convolutions, Nonlinearity,
22 (2009), pp. 381–393.
[6] R. N. Bhattacharya and O. Lee, Asymptotics of a class of Markov processes which are not in general
irreducible, Ann. Probab., 16 (1988), pp. 1333–1347.
[7] R. N. Bhattacharya and O. Lee, Correction: Asymptotics of a class of Markov processes which are not in
general irreducible, Ann. of Probab., 25 (1997), pp. 1541–1543.
[8] R. N. Bhattacharya and M. Majumdar, On a theorem of Dubins and Freedman, J. Theor. Probab., 12
(1999), pp. 1067–1087.
[9] V. S. Borkar, Stochastic Approximation: A Dynamical systems viewpoint, Hindustan Book Agency Gurgaon,
2008.
[10] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, Deep relaxation: partial differential
equations for optimizing deep neural networks, Research in the Mathematical Sciences, 5 (2018), pp. 1–30.
[11] E. Counterman and S. Lawley, What should patients do if they miss a dose of medication? A theoretical
approach, Journal of Pharmacokinetics and Pharmacodynamics, 48 (2021), pp. 873–892.
[12] S. Dereich and S. Kassing, Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes,
2024.
[13] P. Diaconis and D. Freedman, Iterated random functions, SIAM Review, 41 (1999), pp. 45–76.
[14] L. E. Dubins and D. A. Freedman, Invariant probabilities for certain Markov processes, Ann. Math. Stat.,
37 (1966), pp. 837–848.
[15] M. Duflo, Algorithmes stochastiques, vol. 23 of Mathématiques & Applications, Springer-Verlag, Berlin,
1996.
[16] P. Erdős, On a family of symmetric Bernoulli convolutions, Am. J. Math., 61 (1939), pp. 974–976.
[17] D. J. Feng and E. Olivier, Multifractal analysis of weak Gibbs measures and phase transition—application
to some Bernoulli convolutions, Ergod. Theory Dyn. Syst., 23 (2003), pp. 1751–1784.
[18] Y. Feng, T. Gao, L. Li, J.-G. Liu, and Y. Lu, Uniform-in-time weak error analysis for stochastic gradient
descent algorithms via diffusion approximation, Communications in Mathematical Sciences, 18 (2020),
pp. 163–188.
[19] Y. Feng, L. Li, and J.-G. Liu, Semi-groups of stochastic gradient descent and online principal component
analysis: properties and diffusion approximations, Commun. Math. Sci., 16 (2018), pp. 777–789.
30 D. SHIROKOFF, P. ZALESKI
[20] K. Gelfert and G. R. Salcedo, Contracting on average iterated function systems by metric change, Non-
linearity, 36 (2023), p. 6879.
[21] A. Gupta, H. Chen, J. Pi, and G. Tendolkar, Some limit properties of Markov chains induced by recursive
stochastic algorithms, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 967–1003.
[22] A. Gupta and W. B. Haskell, Convergence of recursive stochastic algorithms using Wasserstein divergence,
SIAM Journal on Mathematics of Data Science, 3 (2021), pp. 1141–1167.
[23] M. Hairer, Convergence of Markov processes, Lecture notes, Imperial College London, (2021).
[24] U. Herkenrath and M. Iosifescu, On a contractibility condition for iterated random functions, Revue.
Roumaine Math. Pures Appl., 52 (2007), pp. 563–571.
[25] H. A. Hopenhayn and E. C. Prescott, Invariant distributions for monotone Markov processes, Discussion
Paper, Center for Economic Research, Department of Economics, University of Minnesota, 242 (1987),
pp. 1–33.
[26] H. A. Hopenhayn and E. C. Prescott, Stochastic monotonicity and stationary distributions for dynamic
economies, Econometrica, 60 (1992), pp. 1387–1406.
[27] W. Hu, C. J. Li, L. Li, and J. Liu, On the diffusion approximation of nonconvex stochastic gradient descent,
Annals of Mathematical Sciences and Applications, 4 (2019), pp. 3–32.
[28] J. E. Hutchinson, Fractals and self similarity, Indiana Univ. Math. J., 30 (1981), pp. 713–747.
[29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, How to escape saddle points efficiently,
Proc. Int. Conf. Mach. Learn., (2017), pp. 1724–1732.
[30] T. Jordan, P. Shmerkin, and B. Solomyak, Multifractal structure of Bernoulli convolutions, Math. Proc.
Camb. Philos. Soc., 151 (2011), pp. 521–539.
[31] T. Kempton and T. Persson, Bernoulli convolutions and 1D dynamics, Nonlinearity, 28 (2015), pp. 3921–
3934.
[32] Q. Li, C. Tai, and W. E, Stochastic modified equations and adaptive stochastic gradient algorithms, in
Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds.,
vol. 70 of Proceedings of Machine Learning Research, PMLR, 06–11 Aug 2017, pp. 2101–2110.
[33] S. Mandt, M. D. Hoffman, and D. M. Blei, Continuous-time limit of stochastic gradient descent revisited,
NIPS-2015, (2015).
[34] W. J. McCann, Stationary probability distributions of stochastic gradient descent and the success and failure
of the diffusion approximation, master’s thesis, New Jersey Institute of Technology, Newark, NJ, 2021.
[35] S. Meyn and R. Tweedie, The Doeblin decomposition, Doeblin and Modern Probability, 149 (1993), p. 211.
[36] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer, London, 1993.
[37] J. Myjak and T. Szarek, Attractors of iterated function systems and Markov operators, Abstr. Appl. Anal.,
2003 (2003), pp. 479–502.
[38] D. Needell, R. Ward, and N. Srebro, Stochastic gradient descent, weighted sampling, and the ran-
domized kaczmarz algorithm, in Advances in Neural Information Processing Systems, Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds., vol. 27, Curran Associates, Inc., 2014, https:
//proceedings.neurips.cc/paper files/paper/2014/file/f29c21d4897f78948b91f03172341b7b-Paper.pdf.
[39] H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22
(1951), pp. 400–407.
[40] D. Steinsaltz, Locally contractive iterated function systems, Ann. Probab., 27 (1999), pp. 1952–1979.
[41] O. Stenflo, A survey of average contractive iterated function systems, J. Differ. Equ. Appl., 18 (2012),
pp. 1355–1380.
[42] S. M. Ulam, A collection of mathematical problems, New York: Interscience, 1964.
[43] S. Wojtowytsch, Stochastic gradient descent with noise of machine learning type. Part I: Discrete time
analysis, Journal of Nonlinear Science, 33 (2023), pp. 1–52.
[44] S. Wojtowytsch, Stochastic gradient descent with noise of machine learning type. Part II: Continuous time
analysis, Journal of Nonlinear Science, 34 (2023), pp. 1–45.
[45] W. B. Wu and X. Shao, Limit theorems for iterated random functions, J. Appl. Probab., 41 (2004), pp. 425–
436.
[46] J. A. Yahav, On a fixed point theorem and its stochastic equivalent, J. Appl. Prob., 12 (1975), pp. 605–611.
[47] L. Yu, K. Balasubramanian, S. Volgushev, and M. A. Erdogdu, An analysis of constant step size
SGD in the non-convex regime: Asymptotic normality and bias, in Advances in Neural Information
Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.,
vol. 34, Curran Associates, Inc., 2021, pp. 4234–4248, https://proceedings.neurips.cc/paper files/paper/
2021/file/21ce689121e39821d07d04faab328370-Paper.pdf.