0% found this document useful (0 votes)

76 views30 pages

Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions

Interesting to learn

Uploaded by

Savvas Nesseris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views30 pages

Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions

Interesting to learn

Uploaded by

Savvas Nesseris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

CONVERGENCE OF MARKOV CHAINS FOR CONSTANT STEP-SIZE

STOCHASTIC GRADIENT DESCENT WITH SEPARABLE FUNCTIONS∗

DAVID SHIROKOFF† AND PHILIP ZALESKI‡

Abstract. Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that
arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space.
Focusing on a class of separable (non-convex) objective functions, we establish a “Doeblin-type decomposition,” in
that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the
arXiv:2409.12243v1 [math.OC] 18 Sep 2024

absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull.
Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric
convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation
to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside
the support of the invariant measures (i.e., even if initialized at the global minimum, SGD iterates will leave); and
(3) bifurcations may enable the SGD iterates to transition between two local minima. Key ingredients in the theory
involve viewing the SGD dynamics as a monotone iterated function system and establishing a “splitting condition”
of Dubins and Freedman 1966 and Bhattacharya and Lee 1988.

Key words. Stochastic gradient descent, Diffusion approximation, Doeblin-type decomposition, Markov
chains, Spectral gap, Constant step-size, Bifurcations, Iterated function systems

AMS subject classifications. 68W20, 68W40, 37A30, 60J20

1. Introduction. In recent years, stochastic gradient descent (SGD) [39] has become an
immensely popular algorithm for minimizing objective functions F : Rd →
− R of the form
n
1X
(1.1) F (x) = fi (x) , where f i : Rd →
− R (fi ̸= 0) .
n i=1

Define the maps φi : Rd →

− Rd as
(1.2) φi (x) := x − η∇fi (x) (1 ≤ i ≤ n) ,
where the parameter η > 0 is the step-size (or learning rate).
In its simplest form, the constant step-size SGD generates a random sequence of iterates
{X0 , X1 , X2 , . . .} via the following dynamics:
Given X0 ∈ Rd , sampled from an initial distribution µ0 , update
(1.3) Xk+1 = φik (Xk ) where ik ∈ {1, 2, 3, . . . , n} =: [n]
is drawn independently and identically from a uniform distribution (i.i.d.); each map φi has
probability 1/n of being applied in (1.3).
Since the values of ik are i.i.d., the random variable Xk+1 satisfies the Markov property and
the sequence {Xk }k≥0 forms a Markov chain taking on values in the general state-space Rd (see
[23, 36] for background on Markov chains in general state spaces).
Associated to each Xk is the corresponding law of probability µk ∈ P(Rd ) characterizing the
probability distribution of Xk
µk (A) := Prob(Xk ∈ A) (A ∈ B(Rd )) .

Here P(Rd ) is the space of probability measures over the Borel σ-algebra B(Rd ).
The probability laws for a Markov chain then evolve via deterministic linear dynamics accord-
ing to a Markov operator
(1.4) µk+1 = Pµk .
∗ Submitted to the editors September 20, 2024.
† Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ (shirokof@njit.edu).
‡ Corresponding author, Department of Mathematical Sciences, New Jersey Institute of Technology, Newark, NJ

(pz85@njit.edu).
1
2 D. SHIROKOFF, P. ZALESKI

For general state spaces, the Markov operator is defined as

Z
(1.5) (Pµ)(A) := p(x, A) dµ(x) (A ∈ B(Rd ))
Rd

where p : Rd × B(Rd ) →
− [0, 1] is the transitional kernel

p(x, A) := Prob (Xk+1 ∈ A | Xk = x) .

For each x ∈ Rd , p(x, ·) is a probability measure, while for each Borel set A, p(·, A) is a measureable
function so that P is well defined to act on probability measures (or more generally finite measures).
Intuitively, the value of p(x, A) measures the probability that the Markov chain transitions
from a point x into the set A (which is the infinite dimensional analogue to matrix elements of a
Markov matrix). Hence, for the SGD Markov chain (1.3), p is the fraction of maps φi that map
x into A
n
1X
(1.6) p(x, A) = χA φi (x) ,
n i=1

where χA is the characteristic function of the set A.

For SGD (1.3), P then takes the form
n
1X
µ φ−1

(1.7) (Pµ)(A) = i (A) ,
n i=1

where φi (x) are defined in (1.2) and φ−1

i (A) is the preimage of A.
A probability measure µ⋆ is invariant (or stationary) with respect to the operator P if

(1.8) Pµ⋆ = µ⋆ .

The purpose of this work is to establish the convergence of the probability measures µk for
constant step-size SGD. We restrict our attention to separable (but non-convex) objective functions
F and fi . This has the advantage of enabling a simple, yet relatively complete, characterization
of the how the probability measures µk converge to a convex combination of the extreme points
of the invariant measures. Our results make use of techniques from the theory of iterated function
systems for monotone maps. A technical part of the proofs involves verifying a splitting condition
[14, 6] associated to Markov operators for these maps.
Analyzing the exact dynamics defined by the operator P then enables a rigorous study of
bifurcations. In particular, we provide an exact bifurcation study of whether iterates of SGD
escape a local minima of F — contradicting predictions inferred by the diffusion approximation.
1.1. Background on SGD with Adaptive ηk → 0 . An intuitive motivation for the
dynamics (1.3) is that the expectation of Xk+1 is a gradient step of F evaluated at Xk , i.e.,

(1.9) E(Xk+1 ) = Xk − η∇F (Xk ) .

Equation (1.9) implies that on average, one expects the iterates Xk to move towards local minima
of F (x). This observation can be made precise and has lead to a significant body of work on SGD
in the small η ≪ 1 step-size limit. When the step-size η in (1.3) is adaptive, that is, η is replaced
with ηk with ηk → 0 as k → ∞, it is well-established that the iterates Xk converge the dynamics
of the ordinary differential equation

(1.10) ẋ = −∇F (x) .

For instance, see [15], [4, Proposition 4.1–4.2] and [9, Chapter 2] (and references within). Closely
related to the asymptotic trajectory (1.10) are a range of results establishing that Xk converges
(almost surely) to minimizers of F , e.g. [43, 12] for functions F satisfying a Lojasiewicz inequality
in lieu of convexity.
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 3

1.2. Background for SGD with Constant η. While much is known about SGD with
adaptive step-sizes, far less is known in the constant step-size setting when F is non-convex. In
particular, regarding (1.3): Does the Markov chain Xk remain trapped in an energy well of F ?
Or explore all minima of F ? And if so, over what time-scales? Answers to these questions may
help to provide insight into the initial phase of adaptive step-size SGD.
The difficulty in establishing a general theory for either the Markov chain Xk , or the associated
probability laws µk , is highlighted by the generality of SGD. For instance, the dynamics (1.3)
include, as special cases:
• All continuous 1-dimensional deterministic iterative maps. This includes both the Logistic
map and Tent map. For instance, take n = 1 and F (x) = − 32 x2 + 43 x3 so that Xk+1 =
4Xk (1 − Xk ) when η = 1. Varying η ∈ (0, 1], the iterates Xk exhibit the classic behavior
of period doubling and the emergence of chaos;
• Random walks, e.g., set F (x) = 0 with f1 (x) = x and f2 (x) = −x;
• Infinite Bernoulli convolutions, which up to a linear change of variables, have the form
F (x) = x2 , with f1 (x) = (x−1)2 , f2 (x) = (x+1)2 . Erdős [16] showed that in this quadratic
setting the corresponding invariant measures may be singular. Quadratic models have
also found recent applications in biological settings [11], and remain an area of research
in dynamical systems [17, 5, 30, 31, 1, 2].
In light of the (extremely) broad class of dynamics given by (1.3), we focus on separable fi .
Continuous time partial differential equations (PDEs) provide one approach for approximating
the evolution (1.4). Treating η ≪ 1 as a small parameter, (1.10) can be viewed as a leading
order approximation to the dynamics for Xk ; this also yields a corresponding advection PDE
approximation for (1.4). Truncating formal expansions of the operator P at progressively higher
orders in the asymptotic parameter η [32] then yield successive improvements. For instance, a
second order expansions in η to P yields a diffusion approximation to (1.4) (see §3.1). The diffusion
approximation is known to accurately describes the evolution of µk for finite time [32, 27, 19], and
infinite time when F is convex [18]. The diffusion approximation has also been used to gain
insight into SGD dynamics and (in the regime for which it is valid) can estimate the probability
distribution of Xk near the minimum of F (cf. [38] for estimates and convergence rates when fi
are convex). As we discuss below, the diffusion approximation can fail significantly to capture the
correct long-time dynamics such as the number of, and regularity of, invariant measures (cf. [34]).
Variants of the diffusion equation have also been used to model SGD dynamics [33, 10].
Rigorous PDE theory, such as the existence and geometric convergence to the invariant measure,
have also been established for diffusion equation models [44]. Partial differential equation models
can also arise from SGD as limiting dynamics where the asymptotic parameter is the dimension
(not η!) [3].
One approach to establish convergence of P k µ to a unique invariant measure (on a state space
X), is through a Doeblin condition of the form

(1.11) p(x, A) ≥ ϵν(A) for all A ∈ B(X), x ∈ X

holds for some ν ∈ P(X), ϵ > 0. Some algorithmic variants of SGD, such as those that add random
noise at each step (e.g., stochastic gradient Langevin dynamics), or make strong assumptions on fi
which mimic random noise (e.g., [47]), closely resemble a stochastic ODE. In these cases, conditions
such as (1.11) may be verified to prove convergence to a unique invariant measure. The version of
SGD (1.3), in general, does not satisfy (1.11) (even when X is restricted to the absorbing sets).
Our main result establishes the convergence of the exact Markov chain dynamics (1.4). We
emphasize that we do not make use of concepts that often arise in Markov chains with random
noise, e.g., φ-irreducibility (which is a notion of irreducibility for general state spaces), detailed
balance/reversibility, or (1.11). We also make no continuous time approximation. Rather, the
starting point is to view (1.3) as a random iterated function system (IFS). We then show that
the SGD dynamics (1.3) satisfy the splitting conditions [14, 6] that guarantees convergence to an
invariant measure for IFS with monotone maps.
4 D. SHIROKOFF, P. ZALESKI

1.3. Background on Iterated Function Systems. There is a long history of iterated

function systems theory in dynamical systems — notably as means to construct fractals. Two
broad approaches to establish existence and convergence to invariant measures in IFS are: (i) to
show that the maps φi satisfy a contraction condition, or (ii), which is the approach we adopt,
to show that the maps are monotone and satisfy a splitting condition. For example, when φi are
contractions (which can occur when fi is convex and η is constant but small enough), a result
by Hutchinson [28, Section 4.4, Theorem 1] shows that P is also a contraction on the space of
probabilities and µk converges geometrically to a unique invariant measure.
Convergence results for IFS have also been generalized to allow for weaker notions of con-
tractivity [13, 20, 24, 37, 40, 41, 45]. Contraction conditions have recently been used to establish
convergence in specific instances (e.g., when fi are convex) of SGD [22, 21]. Unfortunately, con-
tractivity conditions can be difficult to verify (or possibly fail) in practical settings. For instance,
when fi is non-convex, the map φi (for small η) is never contractive in the standard Euclidean
norm. Since we are particularly interested in settings where fi and F are non-convex we avoid
the use of contractions and instead make use of monotone maps [14, 6, 46, 8].
1.4. Contributions and Organization of the Paper. For constant step-size SGD with
separable (non-convex) functions, the main contributions of the paper are:
• A decomposition of the Markov chain state space for (1.4) into a uniformly transient set
and disjoint union of absorbing sets — each supporting a unique invariant measure (these
are the extreme points in the set of invariant measures);
• The set of invariant measures are attractors with P k µ converging geometrically to a convex
combination of the extreme points;
• Bounds on the number and support of the invariant measures;
• Examples demonstrating the failure of the diffusion approximation; that the invariant
measures may be supported outside a neighborhood of the global minima of F ; and a
rigorous bifurcation study of the dynamics (1.4).
The manuscript starts in §2 by introducing the assumptions on fi and presenting the main result.
Examples are then provided in §3. The remainder of the paper is devoted to establishing the main
result: §4 introduces the required mathematical notation while §5 reviews convergence theory for
Markov operators with monotone maps φi . Several theoretical results for SGD in one dimension
are established in §6, which are then expanded on in a non-trivial way to prove the main result in
§7.
2. Main Result. In this section we introduce the assumptions and main result.
2.1. Assumptions and Problem Setting. Throughout, we assume that each of the func-
tions fi (and consequently F ) are separable, that is, they are the sum of single variable functions
(j)
fi : R → R
d
(j)
X
(2.1) fi (x) = fi (xj ) where x = (x1 , x2 , . . . , xd ) ∈ Rd .
j=1

(j)
With this convention, fi is allowed to be 0, however to avoid trivialities fi ̸= 0.
For functions of the form (2.1), the maps φi : Rd → Rd become

(2.2) φi (x) = φ(1) (2) (d)
i (x1 ), φi (x2 ), . . . , φi (xd )

where each of the components of φi is a single variable function of xj

(j) d (j)
φi (s) = s − η f (s) (1 ≤ j ≤ d) .
ds i
(j)
For all non-zero fi (i = 1, . . . , n; j = 1, . . . , d) we assume that
(j) (j)
(A1) fi is continuously differentiable, i.e., fi ∈ C 1 (R).
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 5
(j)
(A2) fi has a finite number of critical points.
(j)
(A3) fi (x) → ∞ as |x| → ∞.
(j)
For each family of functions {fi }ni=1 the set of critical points
n
[ d (j) (j)
C (j) = x∈R : fi (x) = 0, fi ̸= 0
i=1
dx

is then finite and contained in an interval of the form

(2.3) I (j) := [a, b] ,

where we define a and b to be the smallest and largest elements of C (j) . With this notation, the
general state space will be

(2.4) I := I (1) × I (2) × . . . × I (d) ⊂ Rd .

We further assume that

(j)
(A4) The derivatives of fi are K-Lipschitz on I (j) , i.e., for some K > 0

d (j) d (j)
f (x) − f (y) ≤ K |x − y| for x, y ∈ I (j) and 1 ≤ i ≤ n .
dx i dy i
(j)
(A5) (Inconsistent optimization) For each 1 ≤ j ≤ d the functions fi share no common critical
point, i.e.,
n
\ d (j)
xj ∈ R : f (xj ) = 0 = ϕ.
i=1
dx i

The assumptions (A1)–(A4) (or minor variations of) are standard in the gradient descent
(j)
optimization literature. Together they ensure that when η < K −1 each φi is an increasing
function — a property we use in the proofs. Condition (A2) is made for simplicity to rule out
complications when fi admits an infinite number of critical points, however the condition can be
relaxed, for instance some of the theory we present generalizes to allow for some fi to be constant
on an interval.
Assumption (A5) is sometimes referred to as inconsistent optimization since it implies the
fi do not share a common minimizer. Condition (A5) also implies that for each j at least two
(j)
functions in the set {fi }ni=1 are non-zero (otherwise (A5) fails trivially).
While it may appear that (A5) is overly simplifying, it is necessary to establish both con-
vergence in a (strong) metric, and uniform geometric convergence rates in Theorem 2.2. When
Assumption (A5) is removed, the convergence theory we build on from [6, 14] no longer applies as
stated.
The necessity of Assumption (A5) in the main result is highlighted with the simple example:
set n = 1, d = 1 and F (x) = x2 . Taking η = 12 , the SGD dynamics revert to deterministic gradient
descent Xk+1 = 12 Xk . If X0 = 1, so that µ0 = δ1 is a Dirac mass, then µk → δ0 converges weakly
as k → 0, but does not converge in the metric used in the main result.
When Assumption (A5) is removed, additional techniques are required to establish conver-
gence. For instance, suppose x∗ is a simultaneous critical point of each fi so that δx∗ is an invariant
measure. If x∗ is a local minimizer for each fi , with fi being convex (not necessarily separable),
then an application of Hutchinson [28] can directly be used to show an initial measure µ0 converges
geometrically to δx∗ in the Wasserstein metric (as opposed to the metric used in the main result).
If x∗ is a local minimum for some fi and a saddle or local maximum for others, then general
sufficient conditions for convergence of µk to δx∗ are more subtle.
6 D. SHIROKOFF, P. ZALESKI

2.2. Main Result. In this section we outline our main result; some technical definitions are
deferred to §4.
Our starting point is to define formulas for disjoint closed rectangles Tm that, as we will show,
are absorbing sets. With positive probability the SGD dynamics Xk will reach one of these sets
within a finite number of steps. Each Tm will contain exactly one invariant measure for the SGD
Markov chain.
The first step is to define disjoint closed intervals Tm in dimension d = 1 in terms of sets L and
R characterizing left and right moving dynamics. For notational brevity we drop the superscripts
(1) (1)
in fi and φi in this d = 1 setting.
When d = 1, the maps φi have the property that, for all η > 0,

φi (x) > x if fi′ (x) < 0 and φi (x) < x if fi′ (x) > 0 .

Hence, φi maps the point x to the left when fi′ (x) > 0 and to the right when fi′ (x) < 0; φi (x) = x
is a fixed point when fi′ (x) = 0. This motivates defining the following left and right sets as
n
[
(2.5) L := {x ∈ R : fi′ (x) > 0} ,
i=1

and
n
[
(2.6) R := {x ∈ R : fi′ (x) < 0} .
i=1

Note that if Xk ∈ L then there is a non-zero probability that the SGD iterate can move to the
left, e.g., Xk+1 < Xk with positive probability (an analogous result holds for Xk ∈ R). The sets
L and R also characterize when the SGD dynamics move to the left or right with probability one:
A point x ∈ / L if and only if

(2.7) φi (x) ≥ x , for all 1 ≤ i ≤ n ,

and x ∈
/ R, if and only if

(2.8) φi (x) ≤ x , for all 1 ≤ i ≤ n .

Several properties of L and R are established in § 6.1. We now define the sets Tm .
Definition 2.1 (The sets Tm ). For a collection of functions {fi }ni=1 in dimension d = 1
with L and R given in (2.5)–(2.6), define the sets Tm to be closed intervals [l, r] (l < r) satisfying

(l, r) ⊂ L ∩ R

where l ∈ ∂L, r ∈ ∂R. Let MT be the number of such sets and denumerate them as

Tm = [lm , rm ] m = 1, . . . , MT .

Intuitively, the Tm ’s are constructed first by taking the intersection L with R and keeping
only the intervals for which l ∈ ∂L and r ∈ ∂R. In the subsequent theorem, the definition of
Tm as the closure of (l, r) accounts for cases where fi′ may fail to change sign on either side of a
critical point.
Proposition 6.2 in §6.1 will establish that the sets Tm exist (MT ≥ 1), are disjoint, and without
loss of generality can be ordered, e.g., rm < lm+1 for 1 ≤ m ≤ MT − 1.
Building on the one-dimensional setting, we extend the definition of the sets Tm to the mul-
tivariate case of separable functions as follows. For each 1 ≤ j ≤ d, define the sets
(j) (j) (j)
T1 , T2 , . . . , TMj ,
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 7
n on
(j)
by applying the Definition 2.1 to the family of functions fi , where now we denote the
i=1
number of such sets by Mj (instead of MT ).
Let M := [M1 ] × [M2 ] . . . × [Md ] be a subset of integer tuples. For each
m = (m1 , m2 , . . . , md ) ∈ M we then define the rectangles Tm (see Figure 2.1) which are the
multivariate generalizations of Tm as
(1) (2) (d)
(2.9) Tm := Tm 1
× Tm 2
. . . × Tm d
⊂ Rd .

The union of all such sets is

[
(2.10) T := Tm ⊂ Rd .
m∈M

(2)
T2 T12 T22

(2)
T1 T11 T21

x1
(1) (1)
T1 T2

Fig. 2.1. Sketch of the rectangles Tm .

In the subsequent proofs, it will also be useful to define the following subsets in R for each
variable xj
Mj
[
T (j) := (j)
Tm ⊂ R,
m=1

With this notation, the set T ∈ Rd admits an alternative form

(2.11) T = T (1) × T (2) . . . × T (d) .

We can now state the main result.

Theorem 2.2 (Main Result). Given the Markov chain (1.4) and (1.7) corresponding to the
SGD dynamics (1.1)–(1.3), assume (A1)–(A5) hold. Let I and Tm be defined as in (2.4) and
(2.9), and let η be any value 0 < η < 1/K. Then,
(a) The statespace I is positive invariant and decomposes into the disjoint union
!
[
I =B∪ Tm (B := I \ T ) ,
m∈M

where
(i) Tm (m ∈ M) is positive invariant/absorbing, and contains at least one local mini-
mizer of F . The number of Tm is at most the number of local minima of F .
(ii) T is non-empty and is an attractor in the sense that there exists ℓ0 = ℓ0 (η) such that
for every x ∈ I there is a path p⃗ of length ℓ0 satisfying φp⃗ (x) ∈ T .
8 D. SHIROKOFF, P. ZALESKI

(iii) B is uniformly transient with

1
(2.12) (P ℓ0 µ)(B) ≤ 1− µ(B) ,
n ℓ0

for any probability measure µ ∈ P(I).

(b) For each m ∈ M, there exists a unique invariant measure µ⋆m with support contained in
Tm . Moreover, there exist an ℓm ∈ N and αm ∈ {+1, −1}d such that for any µ with
support contained in Tm , we have
⌊k/ℓm ⌋
k 1
(2.13) µ, µ⋆m

dαm P ≤ 1 − ℓm k > 0.
n

In the case of d = 1, the convergence (2.13) is a (stronger) contraction

1
ℓm
dF P µ, µm ≤ 1 − ℓm dF (µ, µ⋆m ) .
⋆

(2.14)
n

Here the metrics dαm and dF are defined in section 4 and ⌊·⌋ is the greatest integer
(floor) function. The measures µ⋆m are the only invariant measures supported in I, and
the number MT is bounded by the number of local minima of F .
(c) For any probably measure µ0 ∈ P(I), µk := P k µ0 converges geometrically to an invariant
measure of the form
!
X X
⋆ ⋆
µ = cm µm 0 ≤ cm ≤ 1, cm = 1 ,
m∈M m∈M

in the sense that

⌊k/ℓ⌋
(2.15) ˜ k , µ⋆ ) ≤ 3 1 − 1
d(µ k > 0.
nℓ

Here ℓ := 2 (ℓ0 ∨ maxm∈M ℓm ), and the metric d˜ is defined on probability measures µ, ν ∈

P(I) as

˜ ν) := dTV µ|B , ν|B +

X
(2.16) d(µ, dαm µ|Tm , ν|Tm ,
m∈M

where ν A is the restriction of the finite measure ν to a Borel set A. Lastly, in the case
where d = 1, (2.15) becomes
⌊k/ℓ⌋
⋆ 1
dF (µk , µ ) ≤ 3 1 − ℓ k > 0.
n

A few remarks are in order:

• The decomposition of I in Theorem 2.2(a) is in the spirit of a “Doeblin decomposition”
(cf. [35]);
• The theorem bounds the support of µ⋆m to lie in Tm and does not imply the support is
equal to Tm . For instance, when F is quadratic the invariant measure may be supported
on a set of Lebesgue measure zero (cf. infinite Bernoulli convolutions §1.2);
• If we let W be the set of invariant measures of P with support in I, then the extreme
points of W are exactly µ⋆m (m ∈ M);
• Since the Tm are disjoint closed rectangles, Theorem 2.2 implies the supports of every
pair of distinct invariant measures may be separated by a hyperplane, i.e., the supports
of any pair of invariant measures are not “interlaced;”
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 9

• While the number of sets Tm (and invariant measures) is bounded in terms of F , the
definition of Tm depends on how F is split into the functions fi ;
• The number of invariant measures does not depend on η provided η < 1/K, and the
associated Markov chain for SGD exhibits no bifurcations as a function of η. The rate of
convergence to equilibrium does however depend on η;
• The SGD iterates Xk may traverse between two local minima of F only if both minima
are contained in the same Tm (somewhat akin to a “mountain pass” theorem);
• Somewhat surprisingly, the converse to Theorem 2.2(a)(i) need not hold: the local (and
even global) minima of F do not need to be in T (see §3). In other words, there are
instances of SGD where even if the iterates Xk are initialized to lie in a neighborhood of
the global minimum of F , they will eventually leave (with probability 1);
• The contraction estimate (2.14) is sometimes referred to as a spectral gap estimate (in
the metric dF );
• When d > 1, the invariant measures µ⋆m are not necessarily product measures. In addition,
the main result in dimension d > 1 does not follow as a corollary of the one dimensional
case;
• An explicit bound on ℓ in Theorem 2.2 determining the convergence rate can be obtained
(j)
in terms of fi , e.g., see §3.
3. Examples in 1D. This section provides examples highlighting Theorem 2.2 in d = 1. In
each example, we analyze an objective function F with the specific splitting
1
(3.1) F (x) = f1 (x) + f2 (x) ,
2
where

(3.2) f1 (x) := F (x) + λx , and f2 (x) := F (x) − λx .

Here λ > 0 is a free parameter that modifies the functions f1 , f2 in the decomposition. In this
case, the maps φ1 and φ2 become

φ1 (x) = x − ηF ′ (x) − λη , and φ2 (x) := x − ηF ′ (x) + λη .

Notice that for this splitting, the SGD update becomes gradient descent plus an additional random
walk with step-size λη, and is also the one dimensional analog of the algorithm proposed in [29].
3.1. Diffusion Approximation Background. We collect here several basic facts regarding
the diffusion approximation in dimension d = 1 as we reference it in subsequent examples.
The diffusion approximation is a variable coefficient advection-diffusion equation of the form

∂ρ ∂ η ∂2
(3.3) = u(x)ρ + D(x)ρ in (x, t) ∈ R × (0, T ] ,
∂t ∂x 2 ∂x2
with initial data ρ(x, 0) = µ0 , where µ0 is the SGD initialization distribution.
The velocity u(x) and diffusion coefficient D(x) in (3.3) are given in terms of the SGD functions
fi (x) and F (x) as
d η
u(x) := Φ(x) where Φ(x) := F (x) + (F ′ (x))2 ,
dx 4
and
n 2 n
1 X ′ 1 X ′ 2 2
D(x) := fi (x) − F ′ (x) = fi (x) − F ′ (x) ≥ 0 .
n i=1 n i=1

Intuitively, the advective term in (3.3) evolves the probability ρ towards minizers of F , while the
diffusion term arises from the stochastic terms fi in SGD. For instance, formally setting η = 0
10 D. SHIROKOFF, P. ZALESKI

in (3.3) results in an advection equation for ρ, with characteristics defined by the gradient flow
ẋ = −F ′ (x).
The equation (3.3) arises as a formal asymptotic approximation to the (exact) discrete-in-time
Markov evolution µj+1 = Pµj in the small parameter η ≪ 1 by matching terms up order O(η).
When η ≪ 1, ρ(x, t) (t = ηj) approximates the SGD probability evolution µj for finite times (cf.
[32, 19, 18, 27]).
The stationary solutions of (3.3) satisfy

d η d2
(3.4) u(x)ρ + D(x)ρ = 0.
dx 2 dx2
Note that (3.4) is a singular ordinary differential equation whenever the diffusion coefficient D(x)
vanishes at a point x∗ , namely

(3.5) D(x∗ ) = 0 ⇐⇒ f1′ (x∗ ) = f2′ (x∗ ) = . . . = fn′ (x∗ ) = F ′ (x∗ ) .

Under Assumption (A5), D(x) ̸= 0 since the expression in the right of (3.5) holds nowhere.
Equation (3.4) admits the following unique solution in the space of probabilities densities
2
(3.6) ρ∗ (x) = Z −1 exp − V (x) ,
η
where
x
d 2
Z
V (x) := D−1 (x) Φ(x) + D(x) dx ,
dx η
provided
Z ∞ 2
Z := exp − V (x) dx < ∞ .
−∞ η

For the splitting given in (3.1) and (3.2), the stationary density of the diffusion approximation
given in (3.6) is simplifies to

2 2 η
(3.7) ρ⋆ (x) ∝ exp − 2 Φ(x) = exp − 2 F (x) + (F ′ (x))2 .
ηλ ηλ 4

In the following examples, we will compare the approximation (3.7) to the true invariant measures
of the SGD Markov operator P. We will see that for larger values of λ the Markov operator
P has a unique invariant measure µ⋆ which is approximated by the density ρ⋆ (x). However, as
λ decreases, the Markov operator P may have multiple invariant measures that differ from the
diffusion approximation.
3.2. SGD on a One Dimensional Double Well. This first example demonstrates Theo-
rem 2.2 with SGD applied to the double-well objective function
1
(3.8) F (x) := (1 − x2 )2 ,
4
with the splitting given in (3.1) and (3.2). The free parameter λ > 0 modifies the functions f1 , f2
in the decomposition (see Figure 3.1) and will lead to a bifurcation in the invariant measures of
the associated Markov operator.
We first establish several observations to invoke Theorem 2.2. The critical points of f2 , and
by symmetry f1 (letting x 7→ −x), are solutions to the cubic equation

x3 − x − λ = 0 .

Denote x0 = x0 (λ) > 0 as the largest solution; thus I = [−x0 , x0 ].

SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 11

1
1 1
F (x) 2 f1 (x) 2 f2 (x)

y 0 = +

-1
-1 0 1 -1 0 1 -1 0 1
x x x
1
1 1
F (x) 2 f1 (x) 2 f2 (x)

y 0 = +

-1
-1 0 1 -1 0 1 -1 0 1
x x x

Fig. 3.1. Visualization of the SGD model problem given by (3.1)–(3.2) and (3.8) for values (Top) λ = .55 > λc ,
and (Bottom) λ = .2 < λc . When λ > λc , the SGD iterates can cross over the barrier of F and there is a unique
invariant measure. When λ < λc the SGD iterates cannot cross over the barrier of F and there are two invariant
measures.

The number of critical points in each of f1 , f2 change as a function of λ from 1 to 3 (see

Figure 3.2) with the critical value being
2
λc := √ .
3 3
Note also that for all λ > 0 the critical points of f1 and f2 are distinct.
The Lipschitz constant of fi′ on I is:
(3.9) Lip fi′ = 3x20 − 1 (i = 1, 2) ,
which sets the upper bound on η (see Figure 3.3) given in Theorem 2.2 as
−1
(3.10) η0 (λ) = 3x20 − 1 .
Thus, Theorem 2.2 applies for all λ > 0 and 0 < η < η0 (λ).

1 1 1

f1
x 0 f2 0 L 0 R

-1 -1 -1

0 λc .8 0 λc .8 0 λc .8
λ λ λ

Fig. 3.2. Model problem (3.8). Left: the critical points of the functions f1 (black) and f2 (blue) are plotted as
a function of λ. Solid curves represent local minima and dashed curves represent local maxima. Middle: for each
λ the vertical cross section of the filled in region represents the left moving set L. Right: for each λ the vertical
cross section of the filled in region represents the right moving set R.

Case 1: λ > λc . The functions f1 , f2 have a unique critical point given by

f1′ (−x0 ) = 0 f2′ (x0 ) = 0 ,

12 D. SHIROKOFF, P. ZALESKI

.5
1

η .25 x 0 T

-1
0
0 .4 .8 0 λc .8
λ λ

Fig. 3.3. Model problem (3.8). Left: the bound on η for the validity of Theorem 2.2 given in (3.10) is plotted
as a function of λ. Right: for each λ the vertical cross section of the filled in region represents the absorbing set
T . For λ > λc we have one absorbing interval and for λ ≤ λc we have two absorbing intervals.

and a single T1 = I = [−x0 , x0 ] (see Figure 3.3) defined by the sets L and R presented in Figure 3.2.
By Theorem 2.2 the operator P has a unique invariant measure µ∗ with support contained in T1 .
Figure 3.4 visualizes the crude agreement between the diffusion approximation ρ⋆ (x) defined by
(3.7)–(3.8) and a numerical approximation to µ⋆ .
In addition, from Theorem 2.2 any initial µ0 supported on I = T1 converges via
⌊k/ℓ1 ⌋
1
(3.11) dF (µk , µ⋆ ) ≤ 1− dF (µ0 , µ⋆ ) .
2ℓ1

A bound on ℓ1 , and hence the convergence rate, can be obtained from Theorem 5.1 with the
two paths satisfying (5.4) taken to be ⃗i (resp. ⃗j) as the ℓ1 consecutive compositions of φ2 (resp.
φ1 ). For all x ∈ [−x0 , 0] the map φ2 moves the point x to the right by the amount

φ2 (x) − x = η x − x3 + λ

≥ η λ − λc ,

which follows by substituting the minimum x = − √13 .

Hence, by symmetry of the maps, condition (5.4) is satisfied as

φ⃗i (−x0 ) ≥ 0 ≥ φ⃗j (x0 ) ,

by taking
$ √ %
1+ 3λ

x0
(3.12) ℓ1 = 1 + ≤1+ .
η(λ − λc ) η(λ − λc )

Notice the upper bound on ℓ1 given in (3.12) approaches ∞ as λ → λc . While (3.11) together
with (3.12) is only a lower bound on the spectral gap, the critical slow down in convergence rate
as λ → λc is also observed numerically.
Case 2: λ ≤ λc . The functions f1 , f2 each have three critical points (see Figure 3.2) given by

f2′ (xj ) = 0 j = 0, 1, 2 where x2 ≤ x1 < 0 < x0 ,

and by symmetry f1′ (−xj ) = 0 for j = 0, 1, 2. When λ = λc the two critical points x1 = x2 , while
x2 < x1 when λ < λc . In addition, Figure 3.2 shows the construction of the sets L and R, which
leads to the absorbing sets displayed in Figure 3.3. In this case, there are two absorbing intervals
given by

T1 = [−x0 , x2 ] T2 = [−x2 , x0 ] .
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 13

1
µ⋆ Xk
Diffusion approx.
µ⋆
.5

0
-1 0 1
x

Fig. 3.4. Model problem (3.8). A comparison of the exact unique invariant measure µ⋆ (red) and diffusion
approximation (3.7)–(3.8) (black) for parameter values λ = 2 > λc , η = .0698 (satisfying (3.10)). Here Ulam’s
method [42] is used to numerically compute the exact invariant measure (red). For comparison a time histogram
of the iterates Xk are plotted (blue) showing good agreement with µ⋆ . Note that the lack of smoothness in µ⋆ (red)
is a property of the invariant measure and not a result of under resolved computations.

By Theorem 2.2 the operator P has two invariant measures µ⋆1 and µ⋆2 (see Figure 3.5) sup-
ported on T1 and T2 respectively. Furthermore, any measure µ0 supported on I converges to some
convex combination of µ∗1 and µ∗2 , i.e.,

dF µk , cµ⋆1 + (1 − c)µ⋆2 ≤ 3γ ⌊k/ℓ⌋

for some 0 ≤ c ≤ 1, 0 < γ < 1 and ℓ ∈ N. A bound on ℓ can be obtained via elementary means.
This example demonstrates the discrepancy between the exact invariant measures (e.g., for
which there are two, µ⋆1 and µ⋆2 ) and the single invariant measure ρ∗ (x) predicted by the diffusion
approximation. In particular, the exact SGD dynamics cannot escape the local minima of F , while
the diffusion approximation implies that iterates of SGD will eventually escape and travel between
local minima.
Lastly, the example also demonstrates that the number of invariant measures is at least one
and most two, since F (x) has two local minima. While the bounds on the spectral gap and the
invariant measures depend on η, the sets Tm (m = 1, 2) and the number of invariant measures
are independent of η (provided η < η0 ). As a result, there are no bifurcations in the dynamics
µk+1 = Pµk (i.e., creation or loss of new fixed points) in terms of η.

µ⋆1 µ⋆2 Diffusion approx.

4 Xk
µ⋆1 , µ⋆2

0
-1 0 1 -1 0 1 -1 0 1
x x
x

Fig. 3.5. Model problem (3.8). For λ = .38 < λc and η = .33 (satisfying (3.10)) there are two invariant mea-
sures µ⋆1 (left), µ⋆2 (middle), while the diffusion approximation incorrectly predicts a unique stationary distribution
ρ⋆ (right). Note that the lack of smoothness in µ⋆j (j = 1, 2) again is a property of the invariant measure and not
due to an underresolved computation.

3.3. An Example where SGD Does Not Sample the Global Minimum. Here we
provide an example where the global minimum of the objective function F is contained inside a
uniformly transient region and hence not contained in the support of an invariant measure. From
Theorem 2.2 this implies that as k → ∞ the iterates Xk do not sample the global minimum (even
for small η values and arbitrary initializations X0 ).
14 D. SHIROKOFF, P. ZALESKI

Let the objective function F (x) be the eighth order polynomial,

(3.13) F (x) = c4 x4 + c6 x6 + c8 x8 ,
where c4 ≈ 2.84, c6 ≈ −2.94, and c8 ≈ 0.78 are chosen so that F (x) has two local minima at
x = ±1.35 and two local maxima at x = ±1. The global minimum lies at x = 0, which locally is
a quartic and is also the “flatest” minima, e.g., see Figure 3.6.
We again split F as in (3.1) and (3.2) depending on the paramter λ. From Figure 3.7 there
are two critical values of λ:

λ1 ≈ 1.47, and λ2 ≈ 1.85 ,

for which the critical points of f1 and f2 have a bifurcation.

1
F (x) 1 2 f2 (x)
2 f1 (x)
y = +

x x x

1 1
F (x) 2 f1 (x) 2 f2 (x)
y = +

x x x

1 1
F (x) 2 f1 (x) 2 f2 (x)
y = +

x x x

Fig. 3.6. Visualization of the SGD model problem given by (3.1)–(3.2) and (3.13) for values (Top) λ > λ1 ,
(Middle) λ1 < λ < λ2 , and (Bottom) λ < λ1 . When λ > λ2 , the SGD iterates can cross over the barriers of F
and there is a unique invariant measure. When λ1 < λ < λ2 , for any x0 ∈ I, the SGD iterates get trapped in a
sub-optimal local minimum and there are two invariant measures. Lastly, when λ < λ1 there are three invariant
measures.

From the critical points of f1 and f2 we can construct the left set L and right set R shown
in Figure 3.7, along with the state space I as the interval from the smallest critical point of f1
to the largest critical point of f2 . The set T is then constructed via Definition 2.1 and shown in
Figure 3.8.
For all values
−1
(3.14) η < (max {Lip f1′ , Lip f2′ })
(see Figure 3.8) we can apply Theorem 2.2 to obtain the following results.

Case 1: λ > λ2 . The functions f1 , f2 have a unique critical point (see Figure 3.7) and a single
absorbing interval T1 (see Figure 3.8). By Theorem 2.2 the Markov operator P has a unique
invariant measure µ⋆ supported on T1 . Figure 3.9 compares µ⋆ (computed numerically using
Ulam’s method [42]) to a time histogram of the SGD iterates Xk and diffusion approximation ρ⋆
given by (3.7) and (3.13).
In addition, from Theorem 2.2 there exists an ℓ1 > 0 such that for any initial µ0 supported
on I = T1 we have
⌊k/ℓ1 ⌋
1
dF (µk , µ⋆ ) ≤ 1− dF (µ0 , µ⋆ ) .
2ℓ1
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 15

1 1 1

f1 L R
x 0 f2 0 0

-1 -1 -1

0 1 λ1 λ2 2.5 0 1 λ1 λ2 2.5 0 1 λ1 λ2 2.5

λ λ λ

Fig. 3.7. Model problem from (3.13). Left: the critical points of the functions f1 (black) and f2 (blue) are
plotted as a function of λ. Solid curves represent local minimum and dashed curve represent local maximum.
Middle: for each λ the vertical cross section of the filled in region represents the left moving set L. Right: for each
λ the vertical cross section of the filled in region represents the right moving set R.

.03
1

η .015 x 0 T

-1
0
0 1 2 λ1 λ2
λ λ

Fig. 3.8. Model problem from (3.13). Left: the bound on η for the validity of Theorem 2.2 given in (3.14)
is plotted as a function of λ. Right: for each λ the vertical cross section of the filled in region represents the
absorbing set T . For λ > λ2 we have one absorbing interval, for λ1 < λ ≤ λ2 we have two absorbing intervals,
and for λ ≤ λ1 we have three absorbing intervals.

Case 2: λ2 ≥ λ > λ1 . The functions f1 , f2 each have three (distinct) critical points (see Figure 3.6
and Figure 3.7). Figure 3.7 shows the construction of the sets L and R, which leads to the sets T
displayed in Figure 3.8. In this case, there are two absorbing intervals T1 and T2 .
By Theorem 2.2 the Markov operator P has two invariant measures µ⋆1 and µ⋆2 (see Figure 3.10)
supported on T1 and T2 respectively. Furthermore, any measure µ0 supported on I converges to
some convex combination of µ⋆1 and µ⋆2 , i.e.,

(3.15) dF µk , cµ⋆1 + (1 − c)µ⋆2 ≤ 3γ ⌊k/ℓ⌋

for some 0 ≤ c ≤ 1, 0 < γ < 1, and ℓ ∈ N.

From Figure 3.10 we see that both µ⋆1 and µ⋆2 are supported around the sub-optimal local
minimum of F (x) at x = −1.35 and x = 1.35 respectively, and hence both do no agree with the
stationary density of the diffusion approximation given in (3.7). In addition, both T1 and T2 do
not contain x = 0 (see Figure 3.8), which is the global minimum of F (x). Thus, from Theorem 2.2
the global minimum x = 0 is not contained in the support of invariant measures µ⋆1 and µ⋆2 , and
from (3.15) we have that the random iterates Xk do not sample the global minimum as k → ∞.

Case 3: λ ≤ λ1 . The functions f1 , f2 each have five (distinct) critical points (see Figure 3.6 and
Figure 3.7). In this case, there are three absorbing intervals T1 , T2 , and T3 (see Figure 3.8). The
set T2 contains the global minimum x = 0, while T1 and T3 each contain the sub-optimal local
minimum x = −1.35 and x = 1.35 respectively.
16 D. SHIROKOFF, P. ZALESKI

Thus, from Theorem 2.2 the Markov operator P has three invariant measures µ⋆1 , µ⋆2 , and
µ⋆3each shown in Figure 3.11. From Figure 3.11 we can see that µ⋆1 and µ⋆2 do not agree with
the diffusion approximation, while µ⋆2 , whose support contains the global minimum of F (x), does
agree with the diffusion approximation.

1
µ⋆ Xk
Diffusion approx.
µ⋆
.5

0
-1 0 1
x

Fig. 3.9. Model problem from (3.13). For λ = 7 > λc and η = .007 (satisfying (3.14)), the probability
density function of the unique invariant measure µ⋆ is plotted. Ulam’s method [42] is used to estimate the exact
invariant measure (red). A histogram of the iterates Xk is plotted (blue). The stationary density of the diffusion
approximation given in (3.7) and (3.13) is plotted (black).

50 50
Xk Diffusion approx.
µ⋆1 , µ⋆2 2

25 25
µ⋆1 µ⋆2 1

0 0 0
-1.4 -1.3 1.3 1.4 -1 0 1
x x x
50 4
F
Diffusion approx.
µ⋆1 µ⋆1 , µ⋆2 µ⋆2
25 2

0 0
-1 0 1
x

Fig. 3.10. Model problem from (3.13). Bottom panel shows the relation of the exact invariant measures µ⋆j
(j = 1, 2) (vertical scale units on left) and diffusion approximation (vertical scale units on right) to the objective
function (dashed line, arbitrary units). This example demonstrates that the global minimizer of F is not contained
inside the support of either invariant measure µ⋆j (j = 1, 2) which are plotted in red (time-histograms of the SGD
iterates Xk are in blue). The supports of µ⋆j are each on a neighborhood of the sub-optimal minimizers of F . In
contrast, the diffusion approximation ρ⋆ (black) is localized to the neighborhood of the global minima of F and fails
to approximate the invariant measures. The top row plots an enlarged image of each measure.

4. Mathematical Background. This section collects the mathematical background used

throughout the paper.
We say that a non-empty set T ⊂ Rd is positive invariant for the SGD dynamics, if

(4.1) φi (T ) ⊂ T , for all 1 ≤ i ≤ n .

SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 17

µ⋆1 µ⋆3
50 50

25 25

0 0
-1.36 -1.34 1.34 1.36
x x
4
Xk
µ⋆2 Diffusion approx.
µ⋆1 , µ⋆2 , µ⋆3
2

0
-.3 0 .3
x

Fig. 3.11. Model problem from (3.13). For λ = .5 < λ1 and η = .015 (satisfying (3.14)), the probability
density function of µ⋆1 (top left), µ⋆3 (top right), µ⋆2 (bottom), and the diffusion approximation (black bottom) is
plotted. Ulam’s method [42] is used to estimate µ⋆1 , µ⋆2 , and µ⋆3 (red). In addition, a histogram of the iterates Xk
is also plotted (blue). Note that ρ⋆ approximates µ⋆2 but fails to approximate the other two invariant measures.

For a general state space, a Borel set T is absorbing [36, Section 4.2.2] if the Markov transition
kernel p(x, T ) = 1 for all x ∈ T . Identity (4.1) ensures that if the initial measure µ0 is supported
on T , then all successive iterations µk defined by (1.4) will be supported on T . Hence, for the
SGD dynamics, a Borel set T is absorbing if and only if it is positive invariant.
A Borel set B is uniformly transient [36, Chapter 8] if the function
∞
X
U (x) := (P n δx )(B) ,
n=0

is bounded on B, i.e., supx∈B U (x) < ∞. The function U (x) measures the expected number of
times the Markov chain initialized to X0 = x ∈ B visits B; uniformly transient sets are expected
to spend only a finite time in B. The inequality (2.12) implies U (x) is bounded by a geometric
series and hence B is uniformly transient.
Let α = (α1 , α2 , . . . , αd ) ∈ {−1, +1}d denote the d-tuple of signed unit coefficients, and
associate to each of the 2d vectors α the closed orthont of Rd

Rdα := {c1 (α1 e1 ) + c2 (α2 e2 ) . . . + cd (αd ed ) : cj ≥ 0 for 1 ≤ j ≤ d} .

Here ej denotes the usual basis vectors in Rd . Each orthant Rdα then defines a partial ordering ⪯α
on Rd given by

x ⪯α y if and only if y − x ∈ Rdα .

For two (non-empty) sets A, B ⊂ Rd , we write x ∈ A \ B if x ∈ A and x ∈

/ B, and also A ⪯α B if
a ⪯α b for all a ∈ A and b ∈ B.
A map γ : S → S (S ⊂ Rd ) is said to be monotone with respect to the cone Rdα on a set S if
for any x, y ∈ S

(4.2) x ⪯α y implies γ(x) ⪯α γ(y) .

18 D. SHIROKOFF, P. ZALESKI

Note that if x ⪯α y then y ⪯−α x. To avoid this redundancy, we take α1 = +1.

The metric used in the convergence result of Bhattacharya and Lee makes use of the following
restricted class of sets. Define Aα to be the family of sublevel sets of continuous monotone maps

A = y ∈ Rd : γ(y) ⪯α c ,

A ∈ Aα if

for some constant vector c ∈ Rd and continuous function γ : Rd → Rd that is monotone with
respect to Rdα . When α = (+1, +1, . . . , +1), the set Aα includes all (semi-infinite) rectangles of
the form (−∞, c1 ] × (−∞, c2 ] × . . . × (−∞, cd ], for c ∈ Rd , but also includes other sets as well.
For a closed and bounded Borel set I ⊂ Rd (which, in practice, we will take as a rectangle),
and two (Borel) probability measures µ, ν ∈ P(I), the metric of Bhattacharya and Lee is

(4.3) dα (µ, ν) := sup |µ(A ∩ I) − ν(A ∩ I)| .

A∈Aα

For closed and bounded I, the metric space (P(I), dα ) is complete [6, 7].
In dimension d = 1, dα (µ, ν) reduces to the Komolgorov distance (which we write as dF ):

dF (µ, ν) = ∥Fµ (x) − Fν (x)∥∞ µ, ν ∈ P(I) (I = [a, b]) ,

where Fµ (x) := µ ([a, x]) is the cumulative distributions function (CDF) and

∥f ∥∞ = sup |f (x)| .
x∈[a,b]

The metric dα is stronger than the Wasserstein metric and weak convergence. The converse
is not true, e.g., in d = 1, µk = δ k1 converges weakly to δ0 but not with respect to dF . On the
other hand, dα is weaker than the total variation metric dTV (µ, ν) which is defined as (4.3) where
the supremum is taken over all sets in the Borel σ-algebra.
The metric dα generalizes naturally to all finite non-negative measures M (I) with mass |µ|.
The following properties also hold:

(4.4) dα (µ, ν) ≤ max{|µ|, |ν|} , (µ, ν ∈ M (I)) ;

and, for all µ1 , µ2 , ν1 , ν2 ∈ M (I) and c > 0,

(4.5) dα (µ1 + µ2 , ν1 + ν2 ) ≤ |µ1 | + |ν1 | + dα (µ2 , ν2 ) ,

(4.6) dα (µ1 + µ2 , ν1 + ν2 ) ≤ dα (µ1 , ν1 ) + dα (µ2 , ν2 ) ,
(4.7) dα (c µ1 , c ν2 ) = c dα (µ1 , ν1 ) .

The identities (4.4)–(4.7) follow directly from the definition of dα and the triangle inequality.
To represent trajectories taken by iterates of SGD, let
⃗i = i1 , i2 , . . . , im−1 , im ∈ [n]m ,

where each ik ∈ [n] (1 ≤ k ≤ m) and |⃗i| = m is the path length. The path defined by ⃗i is the
composition

φ⃗i (x) := φim · · · φ2 φ1 (x) · · · = φim ◦ φim−1 ◦ · · · ◦ φi1 (x) .

Two paths are distinct, ⃗i ̸= ⃗j, if they differ in at least one entry. For two paths ⃗j and ⃗i with
lengths |⃗j| = m1 and |⃗i| = m2 we write
⃗i ◦ ⃗j = j1 , . . . , jm1 , i1 , . . . , im2

to be the concatenation of the paths, so that

φ⃗i◦⃗j (x) := φ⃗i φ ⃗j (x) = φ⃗i ◦ φ ⃗j (x) .
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 19

5. Markov Operators for Iterated Function Systems with Monotone Maps. We

state here a result of Dubins and Freedman [14] and the higher dimensional extension by Bhat-
tacharya and Lee [6, 7]; note that a similar result was independently proven by Hopenhayn and
Prescott [25] (cf. [26]). The theorems yield sufficient conditions for convergence of Markov op-
erators arising from iterated function systems with monotone maps. The original theorems are
stated in greater generality than we need, and so we state them as they would be applied to the
notation and dynamics in (1.3).
Theorem 5.1 (Dubins and Freedman [14, Theorem 5.10]). Assume φi : [a, b] → R are
continuous and strictly monotone (increasing or decreasing) on [a, b] and that [a, b] is positive
invariant. If there exists distinct pairs i ̸= j, 1 ≤ i, j ≤ n, and a point x ∈ [a, b] for which

(5.1) φi [a, b] ⊂ [a, x] and φj [a, b] ⊂ [x, b] ,
then there exists a unique invariant measure µ⋆ to the operator P given in (1.7).
Furthermore, for any measure µ supported on [a, b] we have,

⋆ 1
(5.2) dF (Pµ, µ ) ≤ 1 − dF (µ, µ⋆ ) ,
n
where (1 − n1 ) is the geometric rate of convergence.
Dubins and Freedman referred to (5.1) as a splitting condition since φi and φj map [a, b] into
disjoint subintervals — split by x. If the maps φi are monotone increasing, we can re-state (5.1)
as
(5.3) φi (b) ≤ φj (a) (i ̸= j) ,
for some distinct pairs 1 ≤ i, j ≤ n (see Figure 5.1).

a φi (b) φj (a) b

Fig. 5.1. Visualization of condition (5.1) and (5.3).

It may be that no pair of i, j satisfy (5.3). In this case, Theorem 5.1 can be applied to powers
of the operator (P)ℓ where (5.3) (equivalently (5.1)) can be replaced with a pair of distinct paths
of the same length satisfying
(5.4) φ⃗i (b) ≤ φ ⃗j (a) (⃗i ̸= ⃗j) |⃗i| = |⃗j| = ℓ .

Inequality (5.4) is exactly (5.3) with {φi : i ∈ [n]} replaced by {φ⃗i : ⃗i ∈ [n]ℓ }. Applying
Theorem 5.1 with condition (5.4) in lieu of (5.1) modifies the geometric factor appearing in the
convergence rate (5.2) to (1 − n−ℓ ).
The result of Bhattacharya and Lee extends Theorem 5.1 to dimensions d > 1 where the
splitting and monotonicity conditions are with respect to a cone.
Theorem 5.2 (Bhattacharya and Lee [6, Theorem 2.1 & Corollary 2.4] ). Set I = [0, 1]d and
let α be fixed. Suppose {φi }ni=1 : I → I are each continuous and monotone with respect to Rdα . If
there are two paths p⃗1 and p⃗2 of length ℓ satisfying
(5.5) φp⃗1 (I) ⪯α x0 and x0 ⪯α φp⃗1 (I)
for some x0 ∈ I, then the Markov operator P has exactly one invariant measure µ⋆ with support
contained in I. In addition, for any µ supported on I
⌊k/ℓ⌋
1
(5.6) dα (P k µ, µ⋆ ) ≤ 1 − ℓ k > 0.
n
20 D. SHIROKOFF, P. ZALESKI

There are a few minor differences from the exact theorem statement [6, Theorem 2.1 & Corol-
lary 2.4] and the version we state in Theorem 5.2. In [6], (5.6) is stated with µ replaced by δx (the
upper bound in (5.6) being uniform in x ∈ I). This is equivalent to (5.6) as written above.
The formulation in [6] also allows for the maps φi to be drawn from an infinite index set. In
this case, the convergence rate (5.6) (or (5.2)) is given in terms of the probability that the event
(5.5) holds. If κ pairs of distinct paths satisfy (5.6), then [6] yields the stronger geometric rate of
1 − κ/nℓ in (5.6).
Lastly, [6, Theorem 2.1 & Corollary 2.4] is stated only for α being the positive orthont.
However, the theorem extends trivially to the version stated in Theorem 5.2 defined over any cone
α by the change of variables: φ(x) → Λφ(Λx) where Λ = diag(α).
6. A Few Lemmas for One Dimensional SGD. Throughout this section we will assume
that d = 1 and build up a series of results for SGD. For notational convenience we refrain from
writing superscripts on the functions fi , the maps φi , and the sets Tm , e.g., we replace
(1) (1) (1)
fi → fi φi → φi (1 ≤ i ≤ n) and Tm → Tm (1 ≤ m ≤ MT ) .

(1)
Thus, in assumptions (A1)–(A5) each fi is simply fi .
6.1. Properties of the Sets L, R, and Tm . Here our goal is to collect and prove basic
properties of the sets L, R and Tm (see (2.5), (2.6), and Definition 2.1).
Proposition 6.1 (Basic properties of L and R). Let d = 1. Given assumptions (A1)–(A4)
and a, b defined by I in (2.3), then sets L and R defined in (2.5) and (2.6) are finite unions of
open intervals with ∂L, ∂R ⊂ I and

(6.1) (−∞, a) ⊂ R \ L and (b, ∞) ⊂ L \ R .

If in addition (A5) holds, then

(6.2) ∂L ∩ ∂R = ϕ and L ∪ R = R.

Proof. The sets L and R are finite unions of open intervals since each fi′ is continuous with a
finite number of roots. The boundaries ∂L, ∂R are contained in the set of critical points C of the
fi′ ’s and confined to I.
The assumptions (A1)–(A3), with d = 1, imply (6.1) since

fi′ (x) < 0 if x < a for all 1 ≤ i ≤ n,

and similarly fi′ (x) > 0 for x > b. Lastly, x ∈ ∂L ∩ ∂R implies fi′ (x) = 0 for all i violating (A5).
Since for every x ∈ R from (A5), either fi′ (x) < 0 or fi′ (x) > 0 for some i. Thus, if x ∈ / L, then
x ∈ R and vice versa.
The next proposition in this section establishes basic properties of the sets Tm .
Proposition 6.2 (Properties of Tm ). Let d = 1. Under assumptions (A1)–(A5), the sets
Tm = [lm , rm ] in Definition 2.1 satisfy the following:
(a) There is at least one Tm , i.e., MT ≥ 1.
(b) The endpoints satisfy

(6.3) rm ∈ L , rm ∈
/R and lm ∈ R , lm ∈
/L for all 1 ≤ m ≤ MT .

(c) Each Tm contains at least one local minimum of F . In particular, MT ≤ MF where MF

is the number of local minima of F .
(d) The sets Tm are pairwise disjoint and contained in I.

Proof. (a) Through direct construction, we show MT ≥ 1. From identity (6.1) in Proposi-
tion 6.1, the set L contains an interval of the form (l, ∞) where l ∈ ∂L is finite (in fact l ≤ b). Since
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 21

equation (6.2) implies R and L cover R, R contains an interval of the form (r̄, r) with r̄ < l < r,
where r ∈ ∂R is finite. Thus (l, r) satisfies Definition 2.1.
(b) Since L and R are open (see Proposition 6.1) they do not contain their boundary points, hence
rm ∈/ R and lm ∈/ L. Since L ∪ R = R condition (6.3) holds.
(c) We show that F ′ (lm ) < 0 and F ′ (rm ) > 0. This implies that argminx∈Tm F (x) lies in the
interior of Tm , and hence must be a local minimizer.
Since lm ∈/ L and rm ∈ / R from part (b), we have

(6.4) fi′ (lm ) ≤ 0 and fi′ (rm ) ≥ 0 for all 1 ≤ i ≤ n ,

where each of the inequalities in (6.4) are strict for at least one i due to Assumption (A5). The
sign of F ′ at lm and rm then follows from the definition of F .
(d) First note the endpoints lm , rm are confined to the critical points of the fi′ ’s and hence must
lie in I. To show the sets Tm are pairwise disjoint, suppose by contradiction that Tm ̸= Tj and
Tm ∩ Tj ̸= ϕ. Since the sets are distinct closed intervals, an endpoint of one interval must lie in
the other and also differ from the same endpoint, i.e., without loss of generality the right endpoint
of Tm must differ from the right endpoint of Tj and also intersect Tj so that

rm ∈ [lj , rj ) .

The definition of Tj implies (lj , rj ) ⊂ R while part (b) implies lj ∈ R. Thus, rm ∈ [lj , rj ) ⊂ R,
however this contradicts (b).
6.2. Properties Related to the SGD Dynamics. The next proposition proves the fact
that intervals of L and R bound the regions for which SGD can move to the left and right
respectively.
Proposition 6.3 (Dynamics related to L and R). Let d = 1. Assume (A1)–(A3) hold with
L and R defined in (2.5)–(2.6), and let 0 < η < 1/K. Then the maps φi are monotone increasing
on I, i.e.,

φi (x) < φi (y) for all x<y∈I and 1 ≤ i ≤ n.

Furthermore, if (l, r] ⊂ L ∩ I with l ∈ ∂L, there exists paths that map r arbitrarily close to l
(but no further):

(6.5) inf φp⃗ (r) = l .

p
⃗∈Q

Analogously, [l, r) ⊂ R ∩ I with r ∈ ∂R then

(6.6) sup φp⃗ (l) = r .

p
⃗∈Q

Here Q is the set of all paths of arbitrary length.

Proof. The monotonicity of φi follows from

φi (x) < φi (y) ⇐⇒ fi′ (y) − fi′ (x) < η −1 (y − x) ,

combined with the Lipschitz bound on fi′ in (A4) and η < 1/K.
We prove (6.5) as (6.6) follows by an identical argument. Since l ∈
/ L, we have

φi (l) ≥ l for all 1 ≤ i ≤ n.

Combining this with the fact that the maps φp⃗ are monotone on (l, r] (which is in I) implies

α := inf φp⃗ (r) ≥ inf φp⃗ (l) ≥ l .

p
⃗∈Q p
⃗∈Q
22 D. SHIROKOFF, P. ZALESKI

Suppose now by contradiction that α > l. Introduce the function

∆(x) := max fi′ (x) .

1≤i≤n

The function ∆(x) is continuous (since it is the pointwise maximum of a finite collection of con-
tinuous functions) and strictly bounded by ∆(x) > 0 on [α, r] (since the interval is contained in
L). By compactness, ∆(x) achieves its minimum ∆0 on [α, r] and satisfies

∆(x) ≥ ∆0 > 0 for x ∈ [α, r] .

However, this yields a contraction by the definition of α: For every x ∈ [α, r] there is a map φi
satisfying

φi (x) ≤ x − η∆0 ,

which implies there is a path p⃗ of finite length for which φp⃗ (r) < α.
We show next that the sets Tm are positive invariant, or equivalently, absorbing.
Proposition 6.4 (Tm are positive invariant). Let d = 1. Assume (A1)–(A5) hold and
0 < η < 1/K. Then I and each Tm (1 ≤ m ≤ MT ) is positive invariant.
Proof. Write Tm = [lm , rm ]. Combining the facts that lm ∈
/ L and rm ∈
/ R (by Proposition 6.2),
φi is monotone on Tm (by Proposition 6.3), and (2.7) and (2.8) yields

lm ≤ φi (lm ) ≤ φi (x) ≤ φi (rm ) ≤ rm for all 1≤i≤n and x ∈ Tm .

Hence,

(6.7) φi (Tm ) ⊂ Tm

holds for all i and m. By the same argument I = [a, b] is also positive invariant.
We conclude this section with a proof that there is a fixed path length for which every point
x ∈ I outside T can be mapped into T . This will provide the basis for establishing that the set B
is uniformly transient in the main result.
Lemma 6.5 (Uniform path length in one dimension). Let d = 1. Assume (A1)–(A5) and
0 < η < 1/K hold and define T as in (2.10). Then there exists a uniform path length ℓ0 such that
for every x ∈ I there is a path p⃗ of length ℓ0 satisfying φp⃗ (x) ∈ T .
Proof. Step 1: We first show that for every x ∈ I there exists a path ⃗tx , whose length may
depend on x, such that

(6.8) φ ⃗tx (x) ∈ int T ,

where int T is the interior of T .

To prove (6.8) we show for each x ∈ I, there exists a Tm = [lm , rm ] satisfying Definition 2.1
such that either

(6.9) [x, rm ) ⊂ R or (lm , x] ⊂ L .

Condition (6.9) combined with (6.5)–(6.6) from Proposition 6.3 imply (6.8).
The remaining proof of (6.9) for Step 1 is visualized in Figure 6.1. Assume without loss
of generality that x ∈ (β1 , β2 ) ⊂ L where β1 and β2 defines the largest open sub-interval in L
containing x (i.e., both β1 , β2 lie on ∂L). If x ∈
/ L an identical assumption may be made regarding
x ∈ R.
If β1 = lm for some m, then we are done (see Figure 6.1(a)).
If β1 ̸= lm (for every m), then β1 ∈ R by (6.3) (see Figure 6.1(b)). Let (β1 , β3 ) ⊂ R be the
largest sub-interval of R with β3 ∈ ∂R. Then β2 < β3 , otherwise [β1 , β3 ] would define a Tm and
(6.9) would hold.
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 23

a) β1 = lm x β2
L
Tm R
rm
b)
β1 x β2 β4
Tm β3

Fig. 6.1. Visualization of two sub-cases for the proof in Step 1 of Lemma 6.5.

By construction, the interval [x, β3 ) ⊂ R. We now claim that β3 = rm for some m, in which
case we are done. The reason is that since R ∪ L covers I (e.g., (6.2)) with β3 ∈ ∂R, so there must
be an interval of the form (β4 , β3 ] ⊂ L with β4 ∈ ∂L. Lastly, [β4 , β3 ] satisfies Definition 2.1, since
β3 > β2 and β2 ∈ / L implies β4 ≥ β2 > β1 , which implies that (β4 , β3 ) ⊂ R.

Step 2: To complete the proof of the Lemma, it is sufficient to show that there exists an
ℓ = ℓ(η) for which I ⊂ Uℓ where
n o
(6.10) Uk := x ∈ R : φ ⃗t (x) ∈ int T and | ⃗t | = k ,

is the set of points that map into the interior of T after k steps. Let
∞
[
U= Uk ,
k=0

be the set of all points that can reach the interior of T (with an arbitrary path length). Note that
Uk is equivalently
n o
Uk = φ−1 ⃗
t
(int T ) : | ⃗t | = k .

Since φj is continuous and int T is open, Uk and hence U is open. By (6.8), the collection
{Uk : k ∈ Z} is an open cover of I and by compactness has a finite sub-cover, i.e.,
ℓ
[
I⊂ Uk ,
k=0

for some ℓ. Since φi is (strictly) monotone on I when η < 1/K, and T is positive invariant, we
have that each φi maps open subsets of T into open subsets of T , whence φi (int T ) ⊂ int T for
every 1 ≤ i ≤ n. Consequently, the sets I ∩ Uk ⊂ I ∩ Uk+1 are nested, and hence I ⊂ Uℓ .
7. Proof of the Main Result. Building on the one dimensional results, we provide the
proofs for each part of the main result.
7.1. Main Result Proof of Part (a). The proof of Theorem 2.2(a) makes use of the follow-
ing lemma — which is a general statement regarding uniformly transient sets for Markov operators
arising from iterated function systems. The lemma makes no assumptions on the regularity or
monotonicity of the maps φi .
Lemma 7.1 (Uniformly transient sets for iterated function systems). Consider the dynamics
(1.2) for a family of maps φi : I → I (1 ≤ i ≤ n, I) where I ⊂ Rd is non-empty and P is defined
in (1.7). Suppose the set T ⊂ I (B := I \ T ) is positive invariant and for each x ∈ B, there exists
a φi (x) ∈ T .
Then for any finite measure µ ∈ M (I)

1
(Pµ)(B) ≤ 1 − µ(B) .
n
24 D. SHIROKOFF, P. ZALESKI

The intuition behind Lemma 7.1 is that a fraction of the mass µ(B) travels into T every iteration.
The proof is provided here for completion.
Proof. For each i = 1, . . . , n let Bi := {x ∈ B : φi (x) ∈ T }, i.e., Bi ⊂ B with φi (Bi ) ⊂ T .
The sets Bi then satisfy the following identities:

(7.1) φ−1
i (B) ∩ I = B \ Bi ,

holds since T and I are positive invariant, and

n
[
(7.2) Bi = B ,
i=1

holds since for each x ∈ B one of the maps φi (x) ∈ T .

By direct calculation, since µ is supported on I we then have
n n
1X 1X
µ φ−1

(Pµ)(B) = i (B) ∩ I = µ(B) − µ(Bi ) , (Via (7.1)) ,
n i=1 n i=1
1
≤ µ(B) − µ (∪ni=1 Bi ) .
n
Substituting (7.2) into the last line yields the desired result.

(j)
Proof of Main Result Theorem 2.2(a). (i) For each 1 ≤ j ≤ d the family of functions {fi }ni=1
satisfies (A1)–(A5). Proposition 6.2(a) implies for each j, Mj ≥ 1 so that MT ≥ 1 and T is non-
empty. Secondly, Proposition 6.2(c) implies that for each m = 1, . . . , Mj at least one local minima
of
n
(j)
X
g (j) (xj ) := fi (xj ) ,
i=1

(j) Pd
is contained in each Tm . Since we can write F (x) = j=1 g (j) (xj ) and Tm is a rectangle of the
form (2.9), it follows that each Tm (m ∈ M) contains a local minima of F .
(j)
Thirdly, Proposition 6.4 implies that Tm (1 ≤ m ≤ Mj ) is positive invariant with respect to
(j) n
the maps {φi }i=1 . Combining this fact with the form (2.2) of φi for separable functions implies
that both I and Tm (m ∈ M) are positive invariant with respect to the d dimensional maps
{φi }ni=1 .
Additional positive invariant sets can also be constructed via intersections and unions, for
instance, rectangles of the form I (1) × . . . × I (j−1) × T (j) × I (j+1) × . . . × I (d) are also positive
invariant. This will be used in the proof of Part (ii).

(ii) Apply Lemma 6.5 separately to each dimension. Then there exists an ℓ(j) such that for every
(j)
xj ∈ I (j) there exists a path ⃗tj (depending on xj ) of length ℓ(j) satisfying φ⃗t (xj ) ∈ T (j) . Now
j
let
d
X
(7.3) ℓ0 = ℓ(j) .
j=1

Then for every x = (x1 , x2 , . . . , xd ) ∈ I define the composition path p⃗ = p⃗1 ◦ p⃗2 ◦· · ·◦ p⃗d as follows:
Set p⃗d (of length ℓ(d) ) so that

(7.4) φp⃗d (x) ⊂ I (1) × . . . × I (d−1) × T (d) .

SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 25

Pick p⃗d−1 with length ℓ(d−1) to map the d − 1 component of φp⃗d (x) into T (d−1) . The positive
invariance of the set in the right hand side of (7.4) yields:

φp⃗d−1 ◦ φp⃗d (x) ⊂ I (1) × . . . × I (d−2) × T (d−1) × T (d) .

Continuing to define φp⃗j with length ℓ(j) to map the jth component of φp⃗j+1 ◦ · · · ◦ φp⃗d (x) into
T (j) , one obtains

φp⃗ (x) ⊂ T (1) × T (2) × · · · × T (d) = T .

(iii) Take ℓ0 as defined in (7.3) and apply Lemma 7.1 to the family of nℓ0 maps {φ⃗j : |⃗j| = ℓ0 }.

7.2. Main Result Proof of Part (b). Theorem 2.2(b) follows directly from Theorem 5.2,
where the key technical statement to prove is that the SGD dynamics satisfy the splitting condition
(5.5).
To prove (5.5), we proceed by induction on the dimension. In particular, the idea is to consider
the restriction of the SGD dynamics (for separable functions) into the first j coordinates. We then
show, that if (5.5) holds for the first j coordinates, then (5.5) holds for the first j + 1 coordinates
as well.
There is a subtlety in the proof in that the argument relies on the dynamics (1.3) to construct
α defining the orthant Rdα . The following example demonstrates that (5.5) does not necessarily
hold for every choice of α, but rather does hold for at least one α.
Example 1. Set d = 2, f1 (x1 , x2 ) = (x1 )2 + (x2 − 1)2 and f2 (x1 , x2 ) = (x1 − 1)2 + (x2 )2 so
that
(1) (2) (1) (2)
f1 (x1 ) = x21 f1 (x2 ) = (x2 − 1)2 f2 (x1 ) = (x1 − 1)2 f2 (x2 ) = (x2 )2 .

In this example, there is only a single set T = I = [0, 1]2 as defined in (2.9). Provided η < 1/K,
the following subsets of T are positive invariant: (i) the line segment y = 1 − x; (ii) the region
y > 1 − x; and the region (iii) y < 1 − x. As a result, the condition (5.5) with respect to
α = (+1, +1) is never satisfied (for any pair of paths of any length) since mappings of (0, 0) and
(1, 1) will stay separated by the line segment y = 1 − x; no pair of maps will flip the ordering of
(0, 0) and (1, 1).
The condition (5.5) is however readily verified for α = (+1, −1). Despite the fact that T has
three positive invariant sets, the main result implies any initial measure µ0 ∈ P(I) converges
to a unique invariant measure (which in this case is supported on the line segment y = 1 − x
(0 ≤ x ≤ 1).
Proof of Main Result Theorem 2.2(b). It is sufficient to show that {φi }ni=1 and the set Tm
satisfy the hypothesis in Theorem 5.2 (or equivalently Theorem 5.1 when d = 1).
Under the restriction η < 1/K, every map φi is monotone on I with respect to every orthont
(j)
Rdα . This is because each component φi of φi (see (2.2)) is an increasing function on I (j) via
Proposition 6.3.
It remains to show {φi }ni=1 satisfy (5.5). For notational simplicity, we assume without loss of
generality (e.g., after translation and scaling of Xk ), that Tm has the form

Tm = [0, 1]d .

We prove (5.5) by induction on the dimension d.

Suppose d = 1. By the definition of Tm and Proposition 6.2(b) we have [0, 1) ⊂ R and
(0, 1] ⊂ L. Thus, (6.5)–(6.6) in Proposition 6.3 imply there exists paths ⃗t1 and ⃗t2 (which can be
chosen to be the same length) that map φ⃗t1 (0) > x and φ⃗t2 (1) < x where x is any point in the
interior of Tm = [0, 1]. Thus, (5.5) holds with α = +1 and x0 = 21 .
26 D. SHIROKOFF, P. ZALESKI

Assume that (5.5) holds in Rd−1 . That is, for the family of maps {φ̂i }ni=1 given as

φ̂i (x1 , x2 , . . . , xd−1 ) = φ(1)
i (x 1 ), φ
(2)
i (x 2 ), . . . , φ
(d−1)
i (x d−1 ) ,

there exists an α̂ ∈ {+1, −1}d−1 , x̂0 ∈ Rd−1 and two paths p⃗1 and p⃗2 (each having the same length
ℓ)
b satisfying

φ̂p⃗1 [0, 1]d−1 ⪯α̂ x̂0 x̂0 ⪯α̂ φ̂p⃗2 [0, 1]d−1 .

(7.5) and

Next consider the case of dimension d. For each map φi write

φi (x) = φ̂i , φi(d) (xd ) ,

where φ̂i is the restriction of φi to the first d − 1 coordinates. By hypothesis {φ̂i }ni=1 satisfies (7.5)
for some α̂, x̂0 and paths p⃗1 and p⃗2 . We now construct α, x0 , and two paths for which φi satisfies
(5.5).
Turning attention to the dth coordinate, either
(d) (d) (d) (d)
φp⃗2 (0) ≥ φp⃗1 (1) or φp⃗2 (0) < φp⃗1 (1) .

(d) (d)
If φp⃗2 (0) ≥ φp⃗1 (1), then the dth coordinate satisfies the splitting condition in one dimension

(d) (d)
φp⃗1 [0, 1] ≤ m
b ≤ φp⃗2 [0, 1] ,

where
1 (d) (d)

(7.6) m
b := φp⃗2 (0) + φp⃗1 (1) .
2
Then (5.5) holds in dimension d for {φi }ni=1 with paths p⃗1 , p⃗2 , orthont α = (α̂, +1) and

b ∈ Rd .

(7.7) x0 := x̂0 , m

(d) (d)
If φp⃗2 (0) < φp⃗1 (1), then we construct two new paths (via concatenation) for which (5.5)
holds. Let K0 := (1 + ηK)ℓ and set
b

1 (d) (d)

ε := φp⃗1 (1) − φp⃗2 (0) > 0 .
2K0

The choice of K0 with Assumption (A4) then yields the following Lipschitz condition

(d) (d)
(7.8) φp⃗j (x) − φp⃗j (y) ≤ K0 |x − y| , j = 1, 2 and x, y ∈ [0, 1] .

Next, from Proposition 6.3 there exist two paths ⃗q1 and ⃗q2 for which
(d)
(7.9) φq⃗1 [0, 1] ⊂ [1 − ε, 1] ,

and
(d)
(7.10) φq⃗2 [0, 1] ⊂ [0, ε] .

(d)
Applying φp⃗1 to (7.9) yields

(d) (d)
(7.11) φp⃗1 ◦⃗q1 [0, 1] ⊂ φp⃗1 [1 − ε, 1] .
SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 27
(d)
Using the fact that φp⃗1 is increasing, together with (7.8) implies

(d) (d)
(7.12) φp⃗1 (1) − φp⃗1 (1 − ε) ≤ K0 ε .

(d) (d)
Substituting the definition of ε into (7.12) yields φp⃗1 (1 − ε) ≥ m.
b Together with φp⃗1 (1) ≤ 1,
(7.11) implies
(d)
φp⃗1 ◦⃗q1 [0, 1] ⊂ [m,
b 1] .

(d)
By a similar argument, applying the φp⃗2 and (7.8) to (7.10) yields

(d)
φp⃗2 ◦⃗q2 [0, 1] ⊂ [0, m]
b .

Hence, the dth coordinate satisfies the splitting condition

(d) (d)
φp⃗2 ◦⃗q2 [0, 1] ≤ m
b ≤ φp⃗1 ◦⃗q1 [0, 1] .

Subsequently, the maps {φi }ni=1 satisfy condition (5.5) in dimension d with paths p⃗1 ◦⃗q1 and p⃗2 ◦⃗q2 ,
orthont α = (α̂, −1) and point x0 defined in (7.7).
7.3. Main Result Proof of Part (c). The proof of part (c) utilizes the metric d˜ defined
in (2.16) which satisfies the following: for any finite non-negative Borel measures µ1 , µ2 , ν1 , ν2
supported on I,

(7.13) ˜ 1 + µ2 , ν1 + ν2 ) ≤ d(µ
d(µ ˜ 1 , ν1 ) + d(µ
˜ 2 , ν2 ) ,
(7.14) ˜ 1 + µ2 , ν1 + ν2 ) ≤ ν1 (I) + µ1 (I) + d(µ
d(µ ˜ 2 , ν2 ).

Here (7.13) and (7.14) follow from (4.5) and (4.6) together with the fact that for m ∈ M, Tm ⊂ I
and B ⊂ I are pairwise disjoint.
Proof of Main Result Theorem 2.2(c). Let µ0 ∈ P(I). Then for each m ∈ M, µk (Tm ) is a
bounded increasing sequence since Tm is positive invariant; see Theorem 2.2(a). Thus,

µk (Tm ) → cm for some cm ∈ [0, 1] .

Since T is the disjoint union of the rectangles Tm , Theorem 2.2(a)(iii) implies

X
cm = 1 .
m∈M

Next introduce the probability measure

X
µ⋆ := cm µ⋆m .
m∈M

Note that µ⋆ is the convex combination of invariant measures and hence is invariant.
We now show the result

d˜ P 2ℓk µ0 , µ⋆ ≤ 3γ k .

(7.15)

where
1
ℓ := ℓ0 ∨ max {ℓm } and γ := 1 − ,
γm ∈M nℓ

with ℓ0 and ℓm defined in Theorem 2.2(a)-(b).

28 D. SHIROKOFF, P. ZALESKI

For notational brevity, also introduce

µ̃k := µℓk and Q := P ℓ .

With these notations, and using the fact that Q2k µ0 = Qk µ̃k we can decompose the measure
X
Q2k µ0 = Qk µ̃k B + Qk µ̃k T

(7.16) ,
m
m∈M

into its components in B and separately in Tm . We can also write

X X
µ⋆ = cm − µ̃k (Tm ) µ⋆m + µ̃k (Tm ) µ⋆m ,

(7.17)
m∈M m∈M

measuring the discrepancy between the mass on Tm and cm .

Substituting (7.16) and (7.17) into d˜ and using the inequality (7.14) yields:

d˜ Q2k µ0 , µ⋆ ≤ I1 + I2 + I3 ,

where

I1 = Qk µ̃k B (I) ,

X
cm − µ̃k (Tm ) µ⋆m (I) ,

I2 =
m∈M
!

I3 = d˜
X X
k
Q µ̃k Tm
, µ̃k (Tm ) µ⋆m .
m∈M m∈M

The result (7.15) is then proved provided Ii ≤ γ k for each i = 1, 2, 3.

The first two terms I1 , I2 measure the mass of µ0 that remains in the uniformly transient
region B after k iterations of Q. We have,

(7.18) I1 = µ̃k B
(I) = µ̃k (B) ≤ γ k .

The first equality in (7.18) follows since I is positive invariant and Q is a Markov operator; the
last inequality follows from (2.12).
Similarly, since µ⋆m (I) = 1 and the cm ’s sum to one, the second term is
X
cm − µ̃k (Tm ) = 1 − µ̃k (T ) = µ̃k (B) ≤ γ k .

I2 =
m∈M

Estimating the third term I3 follows from Theorem 2.2(b) which quantifies the convergence of the
(normalized) restriction µ̃k Tm to µ⋆m . In particular, for any non-negative measure ν supported
on Tm from Theorem 2.2(b) we obtain for all k ≥ 0 and m ∈ M:

d˜ Qk ν, |ν| µ⋆m = dαm P kℓm +k(ℓ−ℓm ) ν, |ν| µ⋆m

(7.19)
k
ν
≤ |ν| γm dαm P k(ℓ−ℓm ) , µ⋆m (By (4.7) and (2.13))
|ν|
≤ |ν| γ k .

Applying (7.19) to I3 gives

X
I3 ≤ d˜ Qk µ̃k Tm
, µ̃k (Tm ) µ⋆m (By (7.13)) ,
m∈M
X
≤ µ̃k (Tm ) γ k (By (7.19) with ν = µ̃k Tm
),
m∈M

= µ̃k (T ) γ k ≤ γ k . (Since µ̃k (T ) ≤ 1) ,

SGD MARKOV CHAINS WITH SEPARABLE FUNCTIONS 29

which proves (7.15) and hence (2.15).

Lastly, in the case where d = 1 and I = [a, b], note that for any two probability measures µ
and ν,

dF (µ, ν) = sup |µ([a, x]) − ν([a, x])|

x∈[a,b]
X
≤ sup |µ([a, x] ∩ B) − ν([a, x] ∩ B)| + |µ([a, x] ∩ Tm ) − ν([a, x] ∩ Tm )|
x∈[a,b] m∈M
˜ ν) ,
≤ d(µ,

since B ∪ (∪m∈M Tm ) = I and are disjoint.

Acknowledgments. The authors would like to thank William Joseph McCann for helpful
discussions, and Sean Lawley for highlighting the connection between Infinite Bernoulli convolu-
tions and SGD.
This material is based upon work partially supported by the National Science Foundation
under Grants No. DMS–2012268 and DMS–2309727. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors and do not necessarily reflect
the views of the National Science Foundation.

REFERENCES

[1] C. Bandt, Finite orbits in multivalued maps and Bernoulli convolutions, Adv. Math., 324 (2018), pp. 437–
485.
[2] A. Batsis, Ergodic theory methods in Bernoulli convolutions for algebraia parameters and self-affine mea-
sures, PhD thesis, University of Manchester, 2021.
[3] G. Ben Arous, R. Gheissari, and A. Jagannath, High-dimensional limit theorems for SGD: Ef-
fective dynamics and critical scaling, in Advances in Neural Information Processing Systems,
S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran
Associates, Inc., 2022, pp. 25349–25362, https://proceedings.neurips.cc/paper files/paper/2022/file/
a224ff18cc99a71751aa2b79118604da-Paper-Conference.pdf.
[4] M. Benaı̈m, Dynamics of stochastic approximation algorithms, in Séminaire de Probabilités XXXIII,
J. Azéma, M. Émery, M. Ledoux, and M. Yor, eds., Berlin, Heidelberg, 1999, Springer Berlin Heidel-
berg, pp. 1–68.
[5] I. Benjamini and B. Solomyak, Spacings and pair correlations for finite Bernoulli convolutions, Nonlinearity,
22 (2009), pp. 381–393.
[6] R. N. Bhattacharya and O. Lee, Asymptotics of a class of Markov processes which are not in general
irreducible, Ann. Probab., 16 (1988), pp. 1333–1347.
[7] R. N. Bhattacharya and O. Lee, Correction: Asymptotics of a class of Markov processes which are not in
general irreducible, Ann. of Probab., 25 (1997), pp. 1541–1543.
[8] R. N. Bhattacharya and M. Majumdar, On a theorem of Dubins and Freedman, J. Theor. Probab., 12
(1999), pp. 1067–1087.
[9] V. S. Borkar, Stochastic Approximation: A Dynamical systems viewpoint, Hindustan Book Agency Gurgaon,
2008.
[10] P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, Deep relaxation: partial differential
equations for optimizing deep neural networks, Research in the Mathematical Sciences, 5 (2018), pp. 1–30.
[11] E. Counterman and S. Lawley, What should patients do if they miss a dose of medication? A theoretical
approach, Journal of Pharmacokinetics and Pharmacodynamics, 48 (2021), pp. 873–892.
[12] S. Dereich and S. Kassing, Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes,
2024.
[13] P. Diaconis and D. Freedman, Iterated random functions, SIAM Review, 41 (1999), pp. 45–76.
[14] L. E. Dubins and D. A. Freedman, Invariant probabilities for certain Markov processes, Ann. Math. Stat.,
37 (1966), pp. 837–848.
[15] M. Duflo, Algorithmes stochastiques, vol. 23 of Mathématiques & Applications, Springer-Verlag, Berlin,
1996.
[16] P. Erdős, On a family of symmetric Bernoulli convolutions, Am. J. Math., 61 (1939), pp. 974–976.
[17] D. J. Feng and E. Olivier, Multifractal analysis of weak Gibbs measures and phase transition—application
to some Bernoulli convolutions, Ergod. Theory Dyn. Syst., 23 (2003), pp. 1751–1784.
[18] Y. Feng, T. Gao, L. Li, J.-G. Liu, and Y. Lu, Uniform-in-time weak error analysis for stochastic gradient
descent algorithms via diffusion approximation, Communications in Mathematical Sciences, 18 (2020),
pp. 163–188.
[19] Y. Feng, L. Li, and J.-G. Liu, Semi-groups of stochastic gradient descent and online principal component
analysis: properties and diffusion approximations, Commun. Math. Sci., 16 (2018), pp. 777–789.
30 D. SHIROKOFF, P. ZALESKI

[20] K. Gelfert and G. R. Salcedo, Contracting on average iterated function systems by metric change, Non-
linearity, 36 (2023), p. 6879.
[21] A. Gupta, H. Chen, J. Pi, and G. Tendolkar, Some limit properties of Markov chains induced by recursive
stochastic algorithms, SIAM Journal on Mathematics of Data Science, 2 (2020), pp. 967–1003.
[22] A. Gupta and W. B. Haskell, Convergence of recursive stochastic algorithms using Wasserstein divergence,
SIAM Journal on Mathematics of Data Science, 3 (2021), pp. 1141–1167.
[23] M. Hairer, Convergence of Markov processes, Lecture notes, Imperial College London, (2021).
[24] U. Herkenrath and M. Iosifescu, On a contractibility condition for iterated random functions, Revue.
Roumaine Math. Pures Appl., 52 (2007), pp. 563–571.
[25] H. A. Hopenhayn and E. C. Prescott, Invariant distributions for monotone Markov processes, Discussion
Paper, Center for Economic Research, Department of Economics, University of Minnesota, 242 (1987),
pp. 1–33.
[26] H. A. Hopenhayn and E. C. Prescott, Stochastic monotonicity and stationary distributions for dynamic
economies, Econometrica, 60 (1992), pp. 1387–1406.
[27] W. Hu, C. J. Li, L. Li, and J. Liu, On the diffusion approximation of nonconvex stochastic gradient descent,
Annals of Mathematical Sciences and Applications, 4 (2019), pp. 3–32.
[28] J. E. Hutchinson, Fractals and self similarity, Indiana Univ. Math. J., 30 (1981), pp. 713–747.
[29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan, How to escape saddle points efficiently,
Proc. Int. Conf. Mach. Learn., (2017), pp. 1724–1732.
[30] T. Jordan, P. Shmerkin, and B. Solomyak, Multifractal structure of Bernoulli convolutions, Math. Proc.
Camb. Philos. Soc., 151 (2011), pp. 521–539.
[31] T. Kempton and T. Persson, Bernoulli convolutions and 1D dynamics, Nonlinearity, 28 (2015), pp. 3921–
3934.
[32] Q. Li, C. Tai, and W. E, Stochastic modified equations and adaptive stochastic gradient algorithms, in
Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh, eds.,
vol. 70 of Proceedings of Machine Learning Research, PMLR, 06–11 Aug 2017, pp. 2101–2110.
[33] S. Mandt, M. D. Hoffman, and D. M. Blei, Continuous-time limit of stochastic gradient descent revisited,
NIPS-2015, (2015).
[34] W. J. McCann, Stationary probability distributions of stochastic gradient descent and the success and failure
of the diffusion approximation, master’s thesis, New Jersey Institute of Technology, Newark, NJ, 2021.
[35] S. Meyn and R. Tweedie, The Doeblin decomposition, Doeblin and Modern Probability, 149 (1993), p. 211.
[36] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, Springer, London, 1993.
[37] J. Myjak and T. Szarek, Attractors of iterated function systems and Markov operators, Abstr. Appl. Anal.,
2003 (2003), pp. 479–502.
[38] D. Needell, R. Ward, and N. Srebro, Stochastic gradient descent, weighted sampling, and the ran-
domized kaczmarz algorithm, in Advances in Neural Information Processing Systems, Z. Ghahramani,
M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, eds., vol. 27, Curran Associates, Inc., 2014, https:
//proceedings.neurips.cc/paper files/paper/2014/file/f29c21d4897f78948b91f03172341b7b-Paper.pdf.
[39] H. Robbins and S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics, 22
(1951), pp. 400–407.
[40] D. Steinsaltz, Locally contractive iterated function systems, Ann. Probab., 27 (1999), pp. 1952–1979.
[41] O. Stenflo, A survey of average contractive iterated function systems, J. Differ. Equ. Appl., 18 (2012),
pp. 1355–1380.
[42] S. M. Ulam, A collection of mathematical problems, New York: Interscience, 1964.
[43] S. Wojtowytsch, Stochastic gradient descent with noise of machine learning type. Part I: Discrete time
analysis, Journal of Nonlinear Science, 33 (2023), pp. 1–52.
[44] S. Wojtowytsch, Stochastic gradient descent with noise of machine learning type. Part II: Continuous time
analysis, Journal of Nonlinear Science, 34 (2023), pp. 1–45.
[45] W. B. Wu and X. Shao, Limit theorems for iterated random functions, J. Appl. Probab., 41 (2004), pp. 425–
436.
[46] J. A. Yahav, On a fixed point theorem and its stochastic equivalent, J. Appl. Prob., 12 (1975), pp. 605–611.
[47] L. Yu, K. Balasubramanian, S. Volgushev, and M. A. Erdogdu, An analysis of constant step size
SGD in the non-convex regime: Asymptotic normality and bias, in Advances in Neural Information
Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.,
vol. 34, Curran Associates, Inc., 2021, pp. 4234–4248, https://proceedings.neurips.cc/paper files/paper/
2021/file/21ce689121e39821d07d04faab328370-Paper.pdf.

Numerical Methods For Stochastic Control Problems in Continuous Time (PDFDrive)
100% (1)
Numerical Methods For Stochastic Control Problems in Continuous Time (PDFDrive)
480 pages
Design Guides For Offshore Structures, Volume 1 - Welded Tubular Joints
No ratings yet
Design Guides For Offshore Structures, Volume 1 - Welded Tubular Joints
323 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
No ratings yet
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
19 pages
Ca07 RgIto Text
No ratings yet
Ca07 RgIto Text
18 pages
SDE Book
No ratings yet
SDE Book
119 pages
Notes Co905 11
No ratings yet
Notes Co905 11
44 pages
Ergodic Properties of Markov Processes
No ratings yet
Ergodic Properties of Markov Processes
39 pages
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
No ratings yet
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
29 pages
Examples in Markov Decision Processes by A B Piunovskiy
No ratings yet
Examples in Markov Decision Processes by A B Piunovskiy
308 pages
Stochastic Hamiltonian Gradient Methods For Smooth Games
No ratings yet
Stochastic Hamiltonian Gradient Methods For Smooth Games
31 pages
Pavliotis Book
No ratings yet
Pavliotis Book
155 pages
Dynamics of Stochastic Approximation Algorithms
No ratings yet
Dynamics of Stochastic Approximation Algorithms
69 pages
STAT333 Lecture Notes Book Version
No ratings yet
STAT333 Lecture Notes Book Version
71 pages
Stochastic Approximations and Differential Inclusions, Part II: Applications
No ratings yet
Stochastic Approximations and Differential Inclusions, Part II: Applications
23 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
Section 1 - What This Course Is About
No ratings yet
Section 1 - What This Course Is About
15 pages
Ojolibroorigen
No ratings yet
Ojolibroorigen
300 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
Intro To Sdes
No ratings yet
Intro To Sdes
28 pages
Almost None
No ratings yet
Almost None
347 pages
Hybrid Switching Diffusions Properties and Applications
No ratings yet
Hybrid Switching Diffusions Properties and Applications
394 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
No ratings yet
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
97 pages
Notes Co905
No ratings yet
Notes Co905
51 pages
Stat 333 Master
No ratings yet
Stat 333 Master
85 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Langevin SGD
No ratings yet
Langevin SGD
70 pages
Applied Stochastic Process
No ratings yet
Applied Stochastic Process
132 pages
Softmax Policy Gradient Methods Can Take Exponenti
No ratings yet
Softmax Policy Gradient Methods Can Take Exponenti
65 pages
An Introduction To Stochastic Processes in Continuous Time
No ratings yet
An Introduction To Stochastic Processes in Continuous Time
145 pages
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
No ratings yet
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
14 pages
Stochasticprocess
100% (1)
Stochasticprocess
316 pages
MIT18 S096F13 Lecnote1
No ratings yet
MIT18 S096F13 Lecnote1
67 pages
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
No ratings yet
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
34 pages
5 Why Does SGD Prefer Flat Minim
No ratings yet
5 Why Does SGD Prefer Flat Minim
15 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Main
No ratings yet
Main
93 pages
Bellman Filtering For State Space Models
No ratings yet
Bellman Filtering For State Space Models
26 pages
An Adaptive Simulated Annealing Algorithm PDF
No ratings yet
An Adaptive Simulated Annealing Algorithm PDF
9 pages
Stoch Proc Notes PDF
No ratings yet
Stoch Proc Notes PDF
239 pages
Stat 220 Notes
No ratings yet
Stat 220 Notes
109 pages
1980 - 20 - Theory of Stochastic Processes
No ratings yet
1980 - 20 - Theory of Stochastic Processes
7 pages
MATH37012 Course Notes: Discrete Time: DR Jonathan Bagley
No ratings yet
MATH37012 Course Notes: Discrete Time: DR Jonathan Bagley
29 pages
Master LN
No ratings yet
Master LN
135 pages
Combes - An Introduction To Stochastic Approximation - 2013
No ratings yet
Combes - An Introduction To Stochastic Approximation - 2013
9 pages
Hidden Markov Models: Ramon Van Handel
No ratings yet
Hidden Markov Models: Ramon Van Handel
123 pages
Brownian Montion
No ratings yet
Brownian Montion
13 pages
Bellman Filtering and Smoothing For State-Space Models
No ratings yet
Bellman Filtering and Smoothing For State-Space Models
60 pages
181 - Vggray R. - Probability, Random Processes, and Ergodic Properties
No ratings yet
181 - Vggray R. - Probability, Random Processes, and Ergodic Properties
5 pages
Sampling Is As Easy As Learning The Score: Theory For Diffusion Models With Minimal Data Assumptions
No ratings yet
Sampling Is As Easy As Learning The Score: Theory For Diffusion Models With Minimal Data Assumptions
29 pages
Emp Proc Lecture Notes
No ratings yet
Emp Proc Lecture Notes
172 pages
Lectures On Stochastic Processes
No ratings yet
Lectures On Stochastic Processes
207 pages
Stochastic Dynamic Programming: 4.1 The Axiomatic Approach To Probability: Basic Con-Cepts of Measure Theory
No ratings yet
Stochastic Dynamic Programming: 4.1 The Axiomatic Approach To Probability: Basic Con-Cepts of Measure Theory
17 pages
Better Theory For SGD in The Nonconvex World
No ratings yet
Better Theory For SGD in The Nonconvex World
33 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Large Data Global Existence For Coupled Massive-Massless Wave-Type Systems
No ratings yet
Large Data Global Existence For Coupled Massive-Massless Wave-Type Systems
58 pages
On Equivalence of Gauge-Invariant Models For Massive Integer-Spin Fields
No ratings yet
On Equivalence of Gauge-Invariant Models For Massive Integer-Spin Fields
40 pages
Sin87 FSW90 FSW90 FV21
No ratings yet
Sin87 FSW90 FSW90 FV21
76 pages
Existence of Minimizers For The Dirac-Fock Model of Crystals
No ratings yet
Existence of Minimizers For The Dirac-Fock Model of Crystals
45 pages
EE3503 Control Systems Reg 2021 (2 Marks)
No ratings yet
EE3503 Control Systems Reg 2021 (2 Marks)
2 pages
Detailed African Math Heritage Presentation
No ratings yet
Detailed African Math Heritage Presentation
15 pages
For Cable Pulling
No ratings yet
For Cable Pulling
6 pages
Crude Vacuum Tower Wash Bed Optimization: Chemical Engineering
No ratings yet
Crude Vacuum Tower Wash Bed Optimization: Chemical Engineering
6 pages
11 Physics Imp Ch14 Marks 5
No ratings yet
11 Physics Imp Ch14 Marks 5
23 pages
DR Mousami Forensic Ballistics Online Class
No ratings yet
DR Mousami Forensic Ballistics Online Class
94 pages
11th Chemistry EM 1st Mid Term Exam 2023 Model Question Paper English Medium PDF Download
No ratings yet
11th Chemistry EM 1st Mid Term Exam 2023 Model Question Paper English Medium PDF Download
2 pages
RT Acceptance Criteria For Pressure Vessel
No ratings yet
RT Acceptance Criteria For Pressure Vessel
1 page
02 - MEM Construction
No ratings yet
02 - MEM Construction
22 pages
Introduction To Behavior of Concrete and Steel Structures
No ratings yet
Introduction To Behavior of Concrete and Steel Structures
11 pages
Exercises Part 1
No ratings yet
Exercises Part 1
4 pages
Introduction To Aerosol Modelling: From Theory To Code 1st Edition David L. Toppinginstant Download
100% (2)
Introduction To Aerosol Modelling: From Theory To Code 1st Edition David L. Toppinginstant Download
50 pages
Cambridge IGCSE ™: Physics 0625/43 October/November 2022
No ratings yet
Cambridge IGCSE ™: Physics 0625/43 October/November 2022
16 pages
Field Experience IV - Gases Unit Plan
No ratings yet
Field Experience IV - Gases Unit Plan
7 pages
75 Civil Engineering Interview Questions
No ratings yet
75 Civil Engineering Interview Questions
13 pages
PY
No ratings yet
PY
7 pages
Class 10 ICSE Math Question Bank Sample
No ratings yet
Class 10 ICSE Math Question Bank Sample
6 pages
Technical Proposal - 181120-114811-8 Rev.04
100% (1)
Technical Proposal - 181120-114811-8 Rev.04
41 pages
RW Ti Pds Prorox SL 950 en UM
No ratings yet
RW Ti Pds Prorox SL 950 en UM
1 page
Block 5
No ratings yet
Block 5
101 pages
Chapter 11
No ratings yet
Chapter 11
23 pages
Drill Bits: Supervisor
No ratings yet
Drill Bits: Supervisor
12 pages
Flight Control of A 1-DOF Helicopter System Using A Sliding Mode Controller For Disturbance Rejection
No ratings yet
Flight Control of A 1-DOF Helicopter System Using A Sliding Mode Controller For Disturbance Rejection
7 pages
CH 5
No ratings yet
CH 5
165 pages
CBR MR
No ratings yet
CBR MR
17 pages
Beams, Plates, and Shells
No ratings yet
Beams, Plates, and Shells
4 pages
Energy Work Power Document Document
No ratings yet
Energy Work Power Document Document
22 pages
FM Upload Question
No ratings yet
FM Upload Question
9 pages
Nrrel 2809 Blades
No ratings yet
Nrrel 2809 Blades
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.