0% found this document useful (0 votes)

27 views

Introduction To Stochastic Approximation Algorithms

Stochastic approximation algorithms are recursive update rules that can be used to solve optimization problems and fixed point equations when data is noisy. They work by iteratively updating parameters based on noisy observations of a function. The updates use a decreasing step size to average observations over time. The algorithm trajectories are shown to converge to those of an ordinary differential equation under appropriate conditions. Examples of applications include finding roots of functions and modeling collective dynamics.

Uploaded by

Massimo Tormen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Introduction To Stochastic Approximation Algorithms

Uploaded by

Massimo Tormen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chapter 15

Introduction to Stochastic
Approximation Algorithms

1
Stochastic approximation algorithms are recursive update rules that can be
used, among other things, to solve optimization problems and fixed point equa-
tions (including standard linear systems) when the collected data is subject to
noise. In engineering, optimization problems are often of this type, when you
do not have a mathematical model of the system (which can be too complex)
but still would like to optimize its behavior by adjusting certain parameters.
For this purpose, you can do experiments or run simulations to evaluate the
performance of the system at given values of the parameters. Stochastic ap-
proximation algorithms have also been used in the social sciences to describe
collective dynamics: fictitious play in learning theory and consensus algorithms
can be studied using their theory. In short, it is hard to overemphasized their
usefulness. In addition, the theory of stochastic approximation algorithms, at
least when approached using the ODE method as done here, is a beautiful mix
of dynamical systems theory and probability theory. We only have time to give
you a flavor of this theory but hopefully this will motivate you to explore fur-
ther on your own. For our purpose, essentially all approximate DP algorithms
encountered in the following chapters are stochastic approximation algorithms.
We will not have time to give formal convergence proofs for all of them, but this
chapter should give you a starting point to understand the basic mechanisms
involved. Most of the material discussed here is taken from [Bor08].

15.1 Example: The Robbins-Monro Algorithm

Suppose we wish to find the root θ̄ of the function f : R → R. We can use
Newton’s procedure, which generates the sequence of iterates

f (θn )
θn+1 = θn − .
f ! (θn )
1 This version: October 31 2009

129
Suppose we also know a neighborhood of θ̄, where f (θ) < 0 for θ < θ̄, f (θ) > 0
for θ > θ̄, and f in nondecreasing in this neighborhood. Then if we start
at θ0 close enough of θ̄, the following simpler (but less efficient) scheme also
converges to θ̄, and does not require the derivative of f :

θn+1 = θn − αf (θn ), (15.1)

for some fixed and sufficiently small α > 0. Note that if f is itself the derivative
of a function F , these schemes correspond to Newton’s method and a fixed-
step gradient descent procedure for minimizing F , respectively (more precisely,
finding a critical point of F or root of the gradient of F ).
Very often in applications, we do not have access to the mathematical model
f , but we can do experiments or simulations to sample the function at particular
values of θ. These samples are typically noisy however, so that we can assume
that we have a black-box at our disposal (the simulator, the lab where we do
the experiments, etc.), which on input xθ returns the value y = f (θ) + d, where
d is a noise, which will soon be assumed to be random. The point is that we
only have access to the value y, and we have no way of removing the noise from
it, i.e., of isolating the exact value of f (θ). Now suppose that we still want to
find a root of f as in the problem above, with access only to this noisy black
box.
Assume for now that we know that the noise is i.i.d. and zero-mean. A first
approach to the problem could be, for a given value of θ, to sample sufficient
many time at the same point θ and get values y1 , . . . , yN , and then form an
estimate of f (θ) using the empirical average
N
1 !
f (θ) ≈ yi . (15.2)
N i=1

With sufficiently many samples at every iterate θn of (15.1), we can reasonably

hope to find approximately the root of f . The problem is that we might spend
a lot of time taking samples at points θ that are far from θ̄ and are not really
relevant, except for telling us in which direction to move next. This can be a
real issue if obtaining each sample is time-consuming or costly.
An alternative procedure, studied by Robbins and Monro [RM51]2 , is to
simply use directly the noisy version of f in a slightly modified version of
algorithm (15.1):
θn+1 = θn − γn yn , (15.3)
where
" γn is a sequence of positive numbers converging to 0 and such that
n n = ∞ (for example, γn = 1/(n + 1)), and yn = f (θn ) + dn is the noisy
γ
version of f (θn ). Note that the iterates θn are now random variables.
The intuition behing the decreasing step size γn is that it provides a sort
of averaging of the observations. For an analogy in a simpler setting, suppose
2 In fact, recursive stochastic algorithms have been used in signal processing (e.g., for

smoothing radar returns) even before the work of Robbins and Monro. However, there was
apparently no general asymptotic theory.

130
we have i.i.d. observations ξ1 , . . . , ξN of a random variable and wish to form
their empirical average as in (15.2). A recursive alternative to (15.2), extremely
useful in settings where the samples become available progressively with time
(recall for example the Kalman filter), is to form

θ1 = ξ1 , θn+1 = θn − γn [θn − ξn+1 ],

"n
with γn = 1/(n + 1). One can immediately verify that θn = ( i=1 ξi )/n, for
all n.
This chapter is concerned with recurrences generalizing (15.3) of the form:

θn+1 = θn + γn [f (θn ) + bn + Dn+1 ] (15.4)

where θ0 ∈ Rd is possibly random, f is a function Rd → Rd , bn is a small sys-

tematic perturbation term, such as a bias in our estimator of f (θn ), and Dn+1
is a random noise with zero mean (conditioned on the past). The assumptions
and exact definitions of these terms will be made precise in section 15.3. In
applications, we are typically first interested in the asymptotic behavior of the
sequence {θn }.

15.2 The ODE Approach and More Application

Examples
The ODE (Ordinary Differential Equation) method says roughly that if the
step sizes γn are appropriately chosen, the bias terms bn decrease appropriately,
and the noise Dn is zero-mean, then the iterates (15.4) asymptotically track
the trajectories of the dynamical system3

θ̇ = f (θ).

We will give a more formal proof of this fact in the basic case in section 15.3.
Typically for the simplest proofs γn must be decreasing to 0 and satisfy
! !
γn = ∞, γn2 < ∞.
n n

However other choices are possible, including constant small step sizes in some
cases, and in practice the choice of step sizes requires experimentation because
it controls the convergence rate. Some theoretical results regarding convergence
rates are also available but will not be covered here. The ODE is extremely
useful in any case, even if another technique is chosen for formal convergence
proofs, in order to get a quick idea of the behavior of an algorithm. Moreover,
another big advantage of this method is that it can be used to easily create new
stochastic approximation algorithms from convergent ODEs. We now describe
a few more classes of problems where these algorithms arise.
3 By definition, ẋ := d
x(t)
dt

131
Figure 15.1: Consider a flow on a circle that moves clockwise everywhere ex-
cept at a single rest point. This rest point is the unique ω-limit point of the
flow. Now suppose the flow represents the expected motion of some underlying
stochastic process. If the stochastic process reaches the rest point, its expected
motion is zero. Nevertheless, actual motion may occur with positive probability
and in particular the process can jump past the rest point and begin another
circuit. Therefore in the long run all regions of the circle are visited infinitely
often. The long run behavior is captured by the notion of chain recurrence, as
all points on the circle are chain recurrent under the flow.

Brief Review of Some Concepts from Dynamical Systems

Consider an (autonomous) ordinary differential equation (ODE)
ẋ(t) = f (x(t)), x(0) = x0 , x(t) ∈ Rd , t ∈ R. (15.5)
We assume that the ODE is well-posed, i.e., for each initial condition x0 ∈ Rd it
has a unique solution x(·) defined for all t ≥ 0 and the map associating an initial
condition x0 to its corresponding solution x(·) ∈ C([0, ∞), Rd ) is continuous
(for the topology of uniform convergence on compacts). One sufficient condition
for this is to assume that f is Lipschitz, i.e., there exists L > 0 such that
'f (x) − f (y)' ≤ L'x − y', ∀x, y ∈ Rd .
A closed set A ⊂ Rd is an invariant set for this ODE if any trajectory
x(t), −∞ < t < +∞ with x(0) ∈ A satisfies x(t) ∈ A for all t ∈ R. In
the basic convergence theorem in section 15.3, the concept of chain transitiv-
ity appears. A close set A ⊂ Rd set is said to be internally chain transitive
for the ODE if for any x, y ∈ A and any & > 0, T > 0, there exists points
x0 = x, x1 , . . . , xn−1 , xn = y in A, for some n ≥ 1, such that the trajectory of
(15.5) starting at xi , for 0 ≤ i < n meets with the &-neighborhood of xi+1 after
a time greater or equal to T (take x = y in this definition to obtain the notion
of chain recurrence). The small jumps at the points of the chain is a natural
assumption for stochastic approximations, where the noise pushes the iterates
away from the trajectories of the ODE, see Fig. 15.1.
Given a trajectory x(·) of (15.5), the set Ω = ∩t>0 {x(s) : s > t}, i.e., the
set of its limit points as t → ∞, is called its ω-limit set. Note that Ω depends
on the actual trajectory. It is easy to verify that Ω is an invariant set for the
ODE.

132
Def of Lyapunov function for a CT system.
Lasalle’s invariance principle.

Stochastic Gradient Algorithms

The simplest set-up where stochastic approximation algorithms arise is in the
context of noisy versions of optimization algorithms. Consider the Robbins-
Monro scheme, but not the function for which we wish to find a root is itself
the gradient of another function f . That is, we consider a gradient descent
iteration of the type
xn+1 = xn + γn [−∇f (xn ) + Dn+1 ],
where f is a continuously differentiable function we want to minimize. We do
not have access to the gradient of f directly however, only to a noisy version
of it. The limiting ODE is then
ẋ(t) = −∇f (x(t)), (15.6)
i.e., describes a gradient flow, and such dynamical system are among the sim-
plest ones to study. Indeed, f itself serves as a Lyapunov function to study
convergence:
d
f (x(t)) = −'∇f (x(t))'2 ≤ 0,
dt
where the inequality is strict when ∇f (x(t)) -= 0. The set of equilibria of
(15.6) is H = {x : ∇f (x) = 0}. By Lasalle’s invariance principle, the only
limit sets that can occur as ω-limit sets for (15.6) are subsets of H, and the
ODE method tells us that the iterates converge almost surely (a.s.) to such an
invariant set. Moreover, they avoid convergence to critical points ∇f (x) = 0
that are either maxima or saddle-points, as these represent unstable equilibria
of the ODE. In particular if f has only isolated local minima, we can expect
that the iterates {xn } converge to one of them. In another variation, f is not
smooth and the noisy gradients must be replaced by noisy subgradients. The
theory uses a limiting differential inclusion instead of a limiting ODE to prove
a.s. convergence.
Often we do not even have access to the gradient of f , but must compute
it approximately, say using finite differences. We obtain then an algorithm of
the type
xn+1 = xn + γn [−∇f (xn ) + bn + Dn+1 ],
where {bn } is the additional error in the gradient estimation. If we have
supn 'bn ' < &0 for some small &0 , then the iterates converge a.s. to a small
neighborhood of some point in H, in fact of a local minimum. The first such
scheme goes back to Kiefer and Wolfowitz [KW52], who used a central differ-
ence approximation. Denoting v i the ith coordinate of a vector v ∈ Rd , and ei
the ith unit vector in Rd , we have
# $ % &
f (xn + δei ) − f (xn − δei )
xin+1 = xin + γn − + Dn+1
i
,
2δ

133
where δ > 0 is a small positive scalar. An issue with this algorithm is that it
requires 2d function evaluations, and using one-sided differences still requires
d+1 function evaluations, which might still be too costly. A nice development in
this context is the simultaneous perturbation stochastic approximation (SPSA)
due to Spall. A basic version of this method considers random variables ∆n ∈
Rd i.i.d., with ∆n independent of D1 , . . . , Dn+1 and x0 , . . . , xn and P (∆im =
1) = P (∆im = −1) = 12 . Then replace the algorithm above by
# $ % &
f (xn + δ∆n ) − f (xn )
xin+1 = xin + γn − + D i
n+1 ,
δ∆in
which requires only two function evaluations. By Taylor’s theorem, for each i,
f (xn + δ∆n ) − f (xn ) ∂f ! ∂f ∆j
≈ (xn ) + (xn ) ni .
i
δ∆n ∂xi ∂xi ∆n
j#=i

Now the expected value of the second term above is zero, and so it acts just like
another noise term that can be included in Dn+1 for the purpose of analysis.
A type of applications quite close to our subject considers the optimization
of an expected performance measure
J(θ) = Eθ [f (X)],
where X is a random variable with a distribution Fθ that depends on a pa-
rameter θ to be adjusted in order to minimize J(θ) (in our context, θ is a
policy). Now it is typically difficult to compute J(θ), but if we fix θ = θn ,
we can generate samples f (X) with X distributed according to Fθn . Suppose
that the laws µθ corresponding to Fθ (i.e., µθ ([−∞, x)) = Fθ (x) for real values
random variables) are all uniformly continuous with respect to a probability
measure µ, i.e., dµθ (x) = Λθ (x)dµ(x), where the likelihood ratio Λθ (x) (or
Radon-Nykodym derivative) is continuously differentiable in θ. Then
' '
J(θ) = f (x)dµθ (x) = f (x)Λθ (x)dµ(x).

If the interchange of expectation and differentiation can be justified, then

'
d d
J(θ) = f Λθ dµ,
dθ dθ
and the stochastic approximation
d
θn+1 = θn + γn [f (Xn+1 ) Λθ (Xn+1 )|θ=θn ]
dθ
will track the ODE
d
θ̇(t) = J(θ),
dθ
which is again a gradient flow converging asymptotically to a local minimum of
J. This method is called the likelihood ratio method and is used in stochastic
control to do gradient descent in the space of policies, see section 17.6. Another
close idea is infinitesimal perturbation analysis (IPA).

134
Stochastic Fixed Point Iterations

A stochastic approximation of the form

xn+1 = xn + γn [F (xn ) − xn + Dn+1 ] (15.7)

can be used to converge to a solution x∗ of the equation F (x∗ ) = x∗ , i.e., to a

fixed point of F . The limiting ODE of (15.7) is

ẋ(t) = F (x(t)) − x(t). (15.8)

We consider the case where F is an α-contraction (0 ≤ α < 1) with respect to

a weighted norm on Rd

( d )1/p
!
'x'p,w := wi |xi | ,
i=1
or 'x'∞,w := max wi |xi |,

where w = [w1 , . . . , wd ]T with wi ≥ 0 for all i. Recall the Banach fixed point
theorem 6.4.1 which says that a contraction has a unique fixed point. To
analyze the behavior of the ODE (15.8), where F is an α-contraction with
fixed point x∗ , we consider the Lyapunov function V (x) = 'x − x∗ 'p,w for
x ∈ Rd (the notation includes the case p = ∞). Note that the only equilibrium
of (15.8) is x∗ and the only constant trajectory is x(·) ≡ x∗ .

Theorem 15.2.1. The function t → V (x(t)) is a strictly decreasing function

of t for any non-constant trajectory of (15.8).

Corollary 15.2.2. x∗ is the unique globally asymptotically stable equilibrium

of (15.8).

Proof of the theorem. We start with the case 1 < p < ∞. Define sgn(x) =
+1, −1, or 0 depending on whether x > 0, x < 0, or x = 0. For x(t) -= x∗ , we

i=1
( d )
!
− 'x(t) − x∗ '1−p
p,w wi |xi (t) − x∗i |p−1 sgn(xi (t) − x∗i )(xi (t) − x∗i )
i=1
( d )
!
='x(t) − x∗ '1−p
p,w wi sgn(xi (t) − x∗i )|xi (t) − x∗i |p−1 (Fi (x(t)) − Fi (x )∗

i=1
− 'x(t) − x∗ 'p,w
p,w 'x(t) − x 'p,w 'F (x(t)) − F (x )'p,w − 'x(t) − x 'p,w
≤'x(t) − x∗ '1−p ∗ p−1 ∗ ∗

≤ − (1 − α)'x(t) − x∗ 'p,w ,
where the first inequality is obtained using Hölder’s inequality, valid for 1 <
p < ∞. Hence the time derivative is strictly negative for x(t) -= x∗ , which
proves the claim for 1 < p < ∞. The inequality can be written, for t > s ≥ 0,
as ' t
'x(t) − x∗ 'p,w ≤ 'x(s) − x∗ 'p,w − (1 − α) 'x(τ ) − x∗ 'p,w dτ.
s
The claim then follows for p = 1 and p = ∞ by continuity of p → 'x'p,w on
[1, ∞].

Explanation of Collective Behaviors

Learning in Games One well studied learning mechanism for games is the
“fictitious play” model introduced by Brown [Bro51]. In the simplest setting,
let us consider two agents that play repeatedly a game in which two strategy
choices are available for each of them at each time, say {s1 , t1 } for agent 1
and {s2 , t2 } for agent 2. If the (noncooperative) agents choose a strategy pair
(ξn1 , ξn2 ) at time n, agent i receives a payoff hi (ξn1 , ξn2 ), for i = 1, 2. Define the
empirical frequency for each player
"n
1{ξti = si }
νi (n) := t=1 , i = 1, 2; n ≥ 0,
n

136
i.e., νi (n) is the frequency with which player i played strategy 1 up to time
n. In the fictitious play model, an agent records the empirical frequency of its
opponent and plays at each stage the best response assuming the the opponent
chooses its strategy randomly according to its empirical frequency4 . This best
response for player i is a map fi (p−i ) : [0, 1] → [0, 1] which, based on the
one stage game, prescribes the probability with which player i should choose
its strategy si if the probability that its opponent chooses s−i is p−i . In the
model, the empirical frequencies then evolve according to
1
νi (n + 1) = νi (n) + (1{ξn+1
i
= si } − νi (n)), i = 1, 2
n+1
and the corresponding ODE is

ν̇1 (t) = f1 (ν2 (t)) − ν1 (t)

ν̇2 (t) = f2 (ν1 (t)) − ν2 (t).

An equilibrium of this ODE is by definition a Nash equilibrium, and so the goal

is to understand under which circumstances the fictitious play model converges
to the players playing a Nash equilibrium. The 2 player 2 strategy case is
fairly well understood, but in general the ODEs obtained from game theoretical
models can have quite complex dynamics and further assumptions of the right
hand side must typically be made.

Averaging (Consensus) Under Stochastic Perturbations Another well-

studied algorithm is the averaging algorithm in a multiagent system. We have
n agents starting with an initial value xi (0), i = 1, . . . , n. Often the problem
is motivated by saying that the agents should asymptotically one a common
value, but from an engineering perspective this is not well defined. First we
need to rule out the trivial solution that has all agents agree on (say) 0. In the
distributed algorithm literature, this is usually done by requiring that the final
value be one of the initial value. Then in the synchronous setting considered
here, there is again a trivial algorithm that chooses the maximum of the initial
values. Most of the recent related literature instead studies variants of the
following successive averaging scheme
!
xi (k + 1) = xi (k) + & (xj (k) − xi (k)), i = 1, . . . , n, (15.9)
j∈Ni

where Ni represents the neighbors of i as specified by a graph for example,

and & is a small positive constant used to obtain convergence. This variant is
often justified by saying that terminal value is required to be the average of
the initial values, but perhaps a more convincing argument is it see this simple
update rule as again an explanation of opinion formation in social systems,
much like fictitious play, instead of a practical engineering tool.
4 This is clearly not an optimal strategy. The point is that the economics literature

attempts to argue that it is a reasonable model of observed behavior.

137
Now we can consider many variations of the basic averaging rule (15.9).
For example, suppose that at period k the communication link from j to i fails
with probability 1 − pij . This probability can be made dependent on the past
and time dependent without change, but for simplicity, let us assume here that
the failures are i.i.d. Moreover, let’s assume that the difference xj (k) − xi (k) in
(15.9) is also perturbed by a zero mean noise νij (k) (due say to quantization or
communication errors), independent of the random link failures. The perturbed
averaging rule becomes then
!
xi (k + 1) = xi (k) + &k [δij (k + 1)(xj (k) − xi (k) + νij (k + 1))], i = 1, . . . , n.
j∈Ni

where {δij (k)}k is i.i.d. Bernoulli, with P (δij (k) = 1) = pij , and we allow for a
time-varying (typically diminishing) step size &k . Under broad conditions, this
stochastic approximation tracks asymptotically the corresponding ODE
!
ẋi (t) = pij (xj (t) − xi (t)).
j∈Ni

Note that the set of equilibria of this equation is the one-dimensional subspace
x1 = . . . = xn under reasonable conditions on the underlying connectivity
graph and failure probabilities, hence consensus is obtained asymptotically.
However, the choice of step sizes, as often in such simple stochastic approxi-
mation algorithms, has a strong influence on the practical (transient) behavior
of the trajectories, see Fig. 15.2. One can also study asynchronous versions of
the averaging algorithm using the ODE method, which is perhaps more useful
from an engineering point of view.

15.3 Basic Convergence Analysis via the ODE Method

We will discuss a basic convergence analysis result, first for a special case of
the stochastic recurrence (15.4) with no bias term bn
xn+1 = xn + γn [f (xn ) + Dn+1 ], n ≥ 0, x0 prescribed (x0 can be random).
(15.10)
The following assumptions are made for the analysis
1. The map f : Rd → Rd is Lipschitz: 'h(x) − h(y)' ≤ L'x − y' for some
0 < L < ∞.
2. The stepsizes are positive scalars satisfying
! !
γn = ∞, γn2 < ∞.
n n

3. {Dn } is a martingale difference sequence with respect to the increasing

family of σ-fields (filtration, or history generated by the sequence of ran-
dom variables)
Fn = σ(xm , Dm , m ≤ n) = σ(x0 , D0 , . . . , Dn ), n ≥ 0.

138
(a) !k = 10−3 . (b) !k = 10−2 .

(c) !k = 10−2 /(1 + 0.01k). (d) !k = 10−2 /(1 + 0.05k).

Figure 15.2: Transient behavior of the local averaging algorithm for different
choices of step sizes. If we choose a constant step size, increasing it improves the
convergence speed but the communication noise is not well filtered. Decreasing
step sizes for larger values of k improves the asymptotic filtering property of
the algorithm, but can also reduce the convergence speed if decreasing too
fast. In fact for constant step sizes in this problem, we only obtain asymptotic
convergence in a neighborhood of the limit set of the ODE.

This means that

E[Dn+1 |Fn ] = 0 a.s., n ≥ 0. (15.11)
Furthermore {Dn } are square-integrable with
E['Dn+1 '2 |Fn ] ≤ K(1 + 'xn '2 ) a.s., n ≥ 0,
for some constant K.
4. The iterates of (15.10) remain bounded a.s., i.e.,
sup 'xn ' < ∞, a.s.
n

The “assumption” 4 is not easy to establish in general, and specific tech-

niques must be developed to verify it for different problems. Sometimes the

139
analysis is done by artificially forcing the iterates to remain bounded in (15.10)
(say by truncation), which can actually be a useful device in applications. This
requires to consider a limiting ODE with reflection terms on the domain bound-
ary [KY03]. But for general unbounded state spaces this is a stability assump-
tion that must be proved separately perhaps via other means than the ODE
method, e.g. a method based on stochastic Lyapunov functions [KY03]. Under
the stability assumption, the iterates (15.10) are expected to track asymptoti-
cally the ODE
ẋ(t) = f (x(t)), t ≥ 0. (15.12)
Assumption 1 ensures that this ODE has a unique solution for each x(0), which
depends continuously on x(0). The martingale difference assumption (15.11) is
a more precise definition of our earlier assumption of zero-mean noise Dn . We
allow conditioning on the past iterates, so this is a quite general set-up. Any
deterministic trend in the noise should be captured in f or the bias terms bn
in (15.4).
To make the idea that the stochastic approximation asymptotically tracks
the trajectories of the ODE more formal, first define the sequence of times
n−1
!
t0 = 0, tn = γm .
m=0

We construct a continuous time trajectory x̄(t) interpolating the iterates {xn }

at times {tn } and show that this trajectory almost surely approaches the solu-
tion set of the ODE (15.12). That is,
x̄(tn ) = xn , n ≥ 0,

"∞ defines it on the intervals [tn , tn+1 ]. We see

and x̄ is piecewise linear, which
now that the assumption m=0 γm = ∞ is necessary in order to cover the
whole time axis and be able to track the ODE asymptotically. Next define for
s ≥ 0 the unique solution xs of the ODE (15.12), defined for t ≥ s, with initial
condition xs (s) = x̄(s).
We now give a relatively general set of results, mostly without proofs.
Lemma 15.3.1. For any T > 0,
lim sup 'x̄(t) − xs (t)' = 0, a.s.
s→∞ t∈[s,s+T ]

Thus as s → ∞, the interpolated trajectory x̄ starting from s remains

arbitrarily close to a trajectory of the ODE, a formalization of the idea that
the noise becomes asymptotically too weak to push the iterates away from
the trajectories of the ODE. A general convergence theorem for stochastic
approximations is given below.
Theorem 15.3.2. Assume that the assumptions 1-4 are satisfied. Then almost
surely, the sequence {xn } generated by (15.10) converges to a (possibly sample
path dependent) compact connected internally chain transitive invariant set of
the ODE (15.12).

140
Note that the chain transitive invariant set of the theorem can be much
larger than the ω-limit set of the ODE, because it must essentially be “stable
under small perturbations”, recall Fig. 15.1. In practice, Lyapunov functions
are useful for further narrowing down the potential candidates for the limit
set. Suppose that we have a Lyapunov function V : Rd → [0, ∞), continuously
differentiable, such that lim(x(→∞ V (x) = ∞, H = {x ∈ Rd : V (x) = 0} -= ∅,
and dtd
V (x(t)) = 1∇V (x), f (x)2 ≤ 0 with equality if and only if x ∈ H. Then
we have the following corollary, under the same assumptions as for the theorem.

Corollary 15.3.3. Almost surely the sequence {xn } converges to an internally

chain transitive invariant set contained in H.

Proof. Consider a sample sequence x0 , x1 , . . . (on the probability 1 set where

the assumptions are satisfied). Let C ! = supn 'xn ' and C = sup(x(≤C ! V (x).
Define the level sets of V by H a = {x ∈ Rd : V (x) < a}, and note that
x̄(t) ∈ H̄ C for all t ≥ 0, where H̄ a is the closure of H a . Fix 0 < & < C/2. Then
let
∆ := min |1∇V (x), h(x)2| > 0.
x∈H̄ C \H !

∆ > 0 is a consequence of H̄ C \ H " being compact and ∇V and h continuous.

Hence any trajectory of the ODE starting in H C reaches H " in time at most
T := C/∆. By uniform continuity of V on compact sets, we can choose δ > 0
such that for x ∈ H̄ C and 'x − y' < δ, we have |V (x) − V (y)| < &. Then by
lemma 15.3.1, there is a t0 such that for all t ≥ t0 , sups∈[t,t+T ] 'x̄(s) − xt (s)' <
δ. Hence for all t ≥ t0 , we have |V (x̄(t + T )) − V (xt (t + T ))| < &. Since
xt (t + T ) ∈ H " , we obtain x̄(t + T ) ∈ H 2" . So for all t ≥ t0 + T , x̄(t) ∈ H 2" .
Since & can be chosen arbitrarily small, it follows that x̄(t) → H as t → ∞.

The following corollary is immediate.

Corollary 15.3.4. If the only internally chain transitive invariant sets for
(15.12) are isolated equilibrium points, then {xn } a.s. converges to a possibly
sample path dependent equilibrium point.

"∞
Remark on the assumption 2
n=0 γn <∞
Consider the cumulative noise term
n−1
!
ζn = γm Dm+1 , n ≥ 1.
m=0

in (15.10). We want to show that the effect of noise becomes negligible asymp-
totically, as this is a basic ingredient to prove lemma 15.3.1. Note that ζn is a
(zero mean) martingale, i.e.

E[ζn+1 |Fn ] = ζn , n ≥ 1,

141
which follows immediately from assumption 3. The definition of a martingale
also requires ζn to be Fn measurable, which is immediate, and integrable. In
fact in this case ζn is even square integrable, i.e., E['ζn '2 ] < ∞ for all n, which
is a consequence of assumptions 1 and 3. Moreovever,
! !
E['ζn+1 − ζn '2 |Fn ] = γn2 E['Mn+1 '2 |Fn ]
n≥0 n≥0
!
≤ γn2 K(1 + 'xn '2 ), a.s.
n≥0
 
!
≤ K(1 + B 2 )  γn2  < ∞, a.s.
n≥0

where B = supn 'xn ' < ∞ from assumption 4. We can then apply the Mar-
tingale convergence theorem to conclude that ζn converges almost surely as
"→
n ∞. In particular, the noise entering in the iterations after time K, i.e.,
∞
n=K γn Dn+1 , vanishes as K → ∞. This ensures that the effect of the noise
indeed becomes asymptotically
" negligible. Note here that this property re-
lies on the assumption n≥0 γn2 < ∞, which is important to obtain a general
theorem5 .

5 The case of constant step sizes, which does not satisfy this assumption, is also well

studied

142

The Chinese Radicals - HSK Academy
100% (5)
The Chinese Radicals - HSK Academy
11 pages
Differential Geometry of Manifolds 3r27n5kdr9
No ratings yet
Differential Geometry of Manifolds 3r27n5kdr9
4 pages
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
No ratings yet
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
391 pages
Hansen, Discrete Inverse Problems Full Book
100% (1)
Hansen, Discrete Inverse Problems Full Book
217 pages
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
No ratings yet
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
14 pages
Combes - An Introduction To Stochastic Approximation - 2013
No ratings yet
Combes - An Introduction To Stochastic Approximation - 2013
9 pages
3 - Chapter 6 Stochastic Approximation
No ratings yet
3 - Chapter 6 Stochastic Approximation
24 pages
Lim 05429427
No ratings yet
Lim 05429427
10 pages
Dynamics of stochastic approximation algorithms
No ratings yet
Dynamics of stochastic approximation algorithms
69 pages
SDE Book
No ratings yet
SDE Book
119 pages
9e6cbf4ac9c3320e4e9d5402ab7ac5eb_MIT14_384F13_rec7
No ratings yet
9e6cbf4ac9c3320e4e9d5402ab7ac5eb_MIT14_384F13_rec7
6 pages
amath731_intro
No ratings yet
amath731_intro
7 pages
RP_lecture_notes_Allan
No ratings yet
RP_lecture_notes_Allan
77 pages
Introduction To Nonlinear Filtering
No ratings yet
Introduction To Nonlinear Filtering
126 pages
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
No ratings yet
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
23 pages
Machine Problem
No ratings yet
Machine Problem
15 pages
StochasticApproximation Borkar
100% (1)
StochasticApproximation Borkar
172 pages
Hybrid Switching Diffusions Properties and Applications
No ratings yet
Hybrid Switching Diffusions Properties and Applications
394 pages
Stochan
No ratings yet
Stochan
63 pages
Stochastic Processes, Ito Calculus and Black-Scholes Formula
No ratings yet
Stochastic Processes, Ito Calculus and Black-Scholes Formula
36 pages
Open Probs Merged
No ratings yet
Open Probs Merged
35 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages
One Parameter Is Enough
No ratings yet
One Parameter Is Enough
6 pages
Methods For Applied Macroeconomics Research - ch1
No ratings yet
Methods For Applied Macroeconomics Research - ch1
28 pages
1-s2.0-S1474667017477378-main
No ratings yet
1-s2.0-S1474667017477378-main
24 pages
18.335 Problem Set 2 Solutions
No ratings yet
18.335 Problem Set 2 Solutions
7 pages
Approximations and Numerical Methods: 1 Why Do We Need Approximation?
No ratings yet
Approximations and Numerical Methods: 1 Why Do We Need Approximation?
12 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
Notes On Big-N Problems: 1 Motivation
No ratings yet
Notes On Big-N Problems: 1 Motivation
27 pages
Wavelets 3
No ratings yet
Wavelets 3
29 pages
Curtis F. Gerald Patrick O. Wheatley - Applied Numerical Analysis - Solutions Manual PDF
No ratings yet
Curtis F. Gerald Patrick O. Wheatley - Applied Numerical Analysis - Solutions Manual PDF
124 pages
Instantly download the complete Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manual book (PDF).
100% (8)
Instantly download the complete Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manual book (PDF).
22 pages
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
No ratings yet
Arthur E. Albert, Leland A. Gardner Jr. Stochastic Approximation and NonLinear Regression
211 pages
A Brief Introduction To Pseudo-Spectral Methods
No ratings yet
A Brief Introduction To Pseudo-Spectral Methods
55 pages
SPDEs
No ratings yet
SPDEs
92 pages
Function Approximation: A Gradient Boosting Machine.
No ratings yet
Function Approximation: A Gradient Boosting Machine.
45 pages
Scientific Computation (COMS 3210) Bigass Study Guide: Spring 2012
No ratings yet
Scientific Computation (COMS 3210) Bigass Study Guide: Spring 2012
30 pages
Parameter Estimation and Inverse Problems
67% (3)
Parameter Estimation and Inverse Problems
313 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Lec Note 7 2024
No ratings yet
Lec Note 7 2024
4 pages
A Theoretical Introduction To Numerical Analysis 1st Edition Victor S Ryabenkii pdf download
100% (2)
A Theoretical Introduction To Numerical Analysis 1st Edition Victor S Ryabenkii pdf download
81 pages
The Fast Convergence of Incremental PCA
No ratings yet
The Fast Convergence of Incremental PCA
17 pages
Complete Download of Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manual Full Chapters in PDF DOCX
100% (9)
Complete Download of Numerical Algorithms Methods for Computer Vision Machine Learning and Graphics 1st Solomon Solution Manual Full Chapters in PDF DOCX
48 pages
Proba Num GP
No ratings yet
Proba Num GP
116 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
SSG MA214 NA 201718 - PreMidsem PDF
No ratings yet
SSG MA214 NA 201718 - PreMidsem PDF
133 pages
Arena Stanfordlecturenotes11
No ratings yet
Arena Stanfordlecturenotes11
9 pages
Appunti
No ratings yet
Appunti
34 pages
Brownian Motion and Stochastic Calculus Introductory Notes: Roberto Rigobon MIT Fall 2009
No ratings yet
Brownian Motion and Stochastic Calculus Introductory Notes: Roberto Rigobon MIT Fall 2009
19 pages
Tips and Tricks in Real Analysis: Nate Eldredge August 3, 2008
No ratings yet
Tips and Tricks in Real Analysis: Nate Eldredge August 3, 2008
5 pages
Numerical Approximation of Some Linear Stochastic Partial Differential Equations Driven by Special Additive Noises
No ratings yet
Numerical Approximation of Some Linear Stochastic Partial Differential Equations Driven by Special Additive Noises
25 pages
ADA136455
No ratings yet
ADA136455
11 pages
Análisis Funcional para Bases Wavelet
No ratings yet
Análisis Funcional para Bases Wavelet
34 pages
ODE Notes
No ratings yet
ODE Notes
17 pages
Solving_Nonlinear_Equations
No ratings yet
Solving_Nonlinear_Equations
18 pages
Autocorrelation Spectra of Balanced Boolean Functions On An Odd Number of Input Variables With Maximum Absolute Value 2
No ratings yet
Autocorrelation Spectra of Balanced Boolean Functions On An Odd Number of Input Variables With Maximum Absolute Value 2
22 pages
Minimax Polynomial Approximation: (2.1) F (XJ ) PN (XJ ) I-Iye
No ratings yet
Minimax Polynomial Approximation: (2.1) F (XJ ) PN (XJ ) I-Iye
9 pages
Protter
No ratings yet
Protter
43 pages
Montanari
No ratings yet
Montanari
10 pages
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Face Alignment Across Large Poses - A 3D Solution
No ratings yet
Face Alignment Across Large Poses - A 3D Solution
11 pages
KLA Stylus
No ratings yet
KLA Stylus
26 pages
Part F
No ratings yet
Part F
25 pages
The "Wake-Sleep" Algorithm For Unsupervised Neural Networks
No ratings yet
The "Wake-Sleep" Algorithm For Unsupervised Neural Networks
5 pages
What Is Artificial Intelligence (AI) ? - IBM
No ratings yet
What Is Artificial Intelligence (AI) ? - IBM
8 pages
Cryptography Assignment
No ratings yet
Cryptography Assignment
4 pages
HandBook of Biomedical
100% (1)
HandBook of Biomedical
501 pages
Control Theory Explained
No ratings yet
Control Theory Explained
12 pages
1.unit 1 Introduction
No ratings yet
1.unit 1 Introduction
38 pages
Ai 1ST Unit Short& Long Answers
No ratings yet
Ai 1ST Unit Short& Long Answers
85 pages
Lab Session 09 Objective 1. Simulink: 1.1 A Coil and Its Magnetic Field
No ratings yet
Lab Session 09 Objective 1. Simulink: 1.1 A Coil and Its Magnetic Field
11 pages
ECEG 5352 - Lecture - 4 PDF
No ratings yet
ECEG 5352 - Lecture - 4 PDF
23 pages
Extensions to the Basic Turing Machine
No ratings yet
Extensions to the Basic Turing Machine
4 pages
CE 5106 (Structural Dynamics)
No ratings yet
CE 5106 (Structural Dynamics)
3 pages
BME 23MA201-EM=I- PPT& Seminar Topics - 01
No ratings yet
BME 23MA201-EM=I- PPT& Seminar Topics - 01
3 pages
Quantum Computing: Principles of Quantum Mechanics
No ratings yet
Quantum Computing: Principles of Quantum Mechanics
22 pages
12 Identify and Interpret Gradients
No ratings yet
12 Identify and Interpret Gradients
2 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
18 pages
Advances of Machine Learning in Multi-Energy District Communities 2022
No ratings yet
Advances of Machine Learning in Multi-Energy District Communities 2022
28 pages
Top 40 Array Problems
No ratings yet
Top 40 Array Problems
6 pages
Split Learning Over Wireless Networks Parallel Design and Resource Management
No ratings yet
Split Learning Over Wireless Networks Parallel Design and Resource Management
30 pages
Chapter 11 Object-Oriented Design
No ratings yet
Chapter 11 Object-Oriented Design
45 pages
improving-seasonal-influenza-forecasting-using-time-series-machine-learning-techniques-15132
No ratings yet
improving-seasonal-influenza-forecasting-using-time-series-machine-learning-techniques-15132
13 pages
Testing_16_16_1_1_2.8_6_15 min
No ratings yet
Testing_16_16_1_1_2.8_6_15 min
2 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
AMC - Case Study Overivew
No ratings yet
AMC - Case Study Overivew
1 page
The ARMA Model in State Space Form
No ratings yet
The ARMA Model in State Space Form
10 pages
22.01 Helmholtz Energy
No ratings yet
22.01 Helmholtz Energy
1 page
Google Page Rank With Markov Model
No ratings yet
Google Page Rank With Markov Model
15 pages
Speaker Diarization WJ
No ratings yet
Speaker Diarization WJ
16 pages
Maths1
No ratings yet
Maths1
8 pages
Deep Learning (R20A6610)
No ratings yet
Deep Learning (R20A6610)
46 pages
Airbnb PDF
No ratings yet
Airbnb PDF
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Introduction To Stochastic Approximation Algorithms

Uploaded by

Introduction To Stochastic Approximation Algorithms

Uploaded by

Chapter 15

15.1 Example: The Robbins-Monro Algorithm

θn+1 = θn − αf (θn ), (15.1)

With sufficiently many samples at every iterate θn of (15.1), we can reasonably

θ1 = ξ1 , θn+1 = θn − γn [θn − ξn+1 ],

θn+1 = θn + γn [f (θn ) + bn + Dn+1 ] (15.4)

where θ0 ∈ Rd is possibly random, f is a function Rd → Rd , bn is a small sys-

15.2 The ODE Approach and More Application

Brief Review of Some Concepts from Dynamical Systems

Stochastic Gradient Algorithms

If the interchange of expectation and differentiation can be justified, then

A stochastic approximation of the form

xn+1 = xn + γn [F (xn ) − xn + Dn+1 ] (15.7)

can be used to converge to a solution x∗ of the equation F (x∗ ) = x∗ , i.e., to a

ẋ(t) = F (x(t)) − x(t). (15.8)

We consider the case where F is an α-contraction (0 ≤ α < 1) with respect to

Theorem 15.2.1. The function t → V (x(t)) is a strictly decreasing function

Corollary 15.2.2. x∗ is the unique globally asymptotically stable equilibrium

Explanation of Collective Behaviors

ν̇1 (t) = f1 (ν2 (t)) − ν1 (t)

An equilibrium of this ODE is by definition a Nash equilibrium, and so the goal

Averaging (Consensus) Under Stochastic Perturbations Another well-

where Ni represents the neighbors of i as specified by a graph for example,

attempts to argue that it is a reasonable model of observed behavior.

15.3 Basic Convergence Analysis via the ODE Method

3. {Dn } is a martingale difference sequence with respect to the increasing

(c) !k = 10−2 /(1 + 0.01k). (d) !k = 10−2 /(1 + 0.05k).

This means that

The “assumption” 4 is not easy to establish in general, and specific tech-

We construct a continuous time trajectory x̄(t) interpolating the iterates {xn }

"∞ defines it on the intervals [tn , tn+1 ]. We see

Thus as s → ∞, the interpolated trajectory x̄ starting from s remains

Corollary 15.3.3. Almost surely the sequence {xn } converges to an internally

Proof. Consider a sample sequence x0 , x1 , . . . (on the probability 1 set where

∆ > 0 is a consequence of H̄ C \ H " being compact and ∇V and h continuous.

The following corollary is immediate.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.