Introduction To Stochastic Approximation Algorithms
Introduction To Stochastic Approximation Algorithms
Introduction to Stochastic
Approximation Algorithms
1
Stochastic approximation algorithms are recursive update rules that can be
used, among other things, to solve optimization problems and fixed point equa-
tions (including standard linear systems) when the collected data is subject to
noise. In engineering, optimization problems are often of this type, when you
do not have a mathematical model of the system (which can be too complex)
but still would like to optimize its behavior by adjusting certain parameters.
For this purpose, you can do experiments or run simulations to evaluate the
performance of the system at given values of the parameters. Stochastic ap-
proximation algorithms have also been used in the social sciences to describe
collective dynamics: fictitious play in learning theory and consensus algorithms
can be studied using their theory. In short, it is hard to overemphasized their
usefulness. In addition, the theory of stochastic approximation algorithms, at
least when approached using the ODE method as done here, is a beautiful mix
of dynamical systems theory and probability theory. We only have time to give
you a flavor of this theory but hopefully this will motivate you to explore fur-
ther on your own. For our purpose, essentially all approximate DP algorithms
encountered in the following chapters are stochastic approximation algorithms.
We will not have time to give formal convergence proofs for all of them, but this
chapter should give you a starting point to understand the basic mechanisms
involved. Most of the material discussed here is taken from [Bor08].
f (θn )
θn+1 = θn − .
f ! (θn )
1 This version: October 31 2009
129
Suppose we also know a neighborhood of θ̄, where f (θ) < 0 for θ < θ̄, f (θ) > 0
for θ > θ̄, and f in nondecreasing in this neighborhood. Then if we start
at θ0 close enough of θ̄, the following simpler (but less efficient) scheme also
converges to θ̄, and does not require the derivative of f :
for some fixed and sufficiently small α > 0. Note that if f is itself the derivative
of a function F , these schemes correspond to Newton’s method and a fixed-
step gradient descent procedure for minimizing F , respectively (more precisely,
finding a critical point of F or root of the gradient of F ).
Very often in applications, we do not have access to the mathematical model
f , but we can do experiments or simulations to sample the function at particular
values of θ. These samples are typically noisy however, so that we can assume
that we have a black-box at our disposal (the simulator, the lab where we do
the experiments, etc.), which on input xθ returns the value y = f (θ) + d, where
d is a noise, which will soon be assumed to be random. The point is that we
only have access to the value y, and we have no way of removing the noise from
it, i.e., of isolating the exact value of f (θ). Now suppose that we still want to
find a root of f as in the problem above, with access only to this noisy black
box.
Assume for now that we know that the noise is i.i.d. and zero-mean. A first
approach to the problem could be, for a given value of θ, to sample sufficient
many time at the same point θ and get values y1 , . . . , yN , and then form an
estimate of f (θ) using the empirical average
N
1 !
f (θ) ≈ yi . (15.2)
N i=1
smoothing radar returns) even before the work of Robbins and Monro. However, there was
apparently no general asymptotic theory.
130
we have i.i.d. observations ξ1 , . . . , ξN of a random variable and wish to form
their empirical average as in (15.2). A recursive alternative to (15.2), extremely
useful in settings where the samples become available progressively with time
(recall for example the Kalman filter), is to form
θ̇ = f (θ).
We will give a more formal proof of this fact in the basic case in section 15.3.
Typically for the simplest proofs γn must be decreasing to 0 and satisfy
! !
γn = ∞, γn2 < ∞.
n n
However other choices are possible, including constant small step sizes in some
cases, and in practice the choice of step sizes requires experimentation because
it controls the convergence rate. Some theoretical results regarding convergence
rates are also available but will not be covered here. The ODE is extremely
useful in any case, even if another technique is chosen for formal convergence
proofs, in order to get a quick idea of the behavior of an algorithm. Moreover,
another big advantage of this method is that it can be used to easily create new
stochastic approximation algorithms from convergent ODEs. We now describe
a few more classes of problems where these algorithms arise.
3 By definition, ẋ := d
x(t)
dt
131
Figure 15.1: Consider a flow on a circle that moves clockwise everywhere ex-
cept at a single rest point. This rest point is the unique ω-limit point of the
flow. Now suppose the flow represents the expected motion of some underlying
stochastic process. If the stochastic process reaches the rest point, its expected
motion is zero. Nevertheless, actual motion may occur with positive probability
and in particular the process can jump past the rest point and begin another
circuit. Therefore in the long run all regions of the circle are visited infinitely
often. The long run behavior is captured by the notion of chain recurrence, as
all points on the circle are chain recurrent under the flow.
132
Def of Lyapunov function for a CT system.
Lasalle’s invariance principle.
133
where δ > 0 is a small positive scalar. An issue with this algorithm is that it
requires 2d function evaluations, and using one-sided differences still requires
d+1 function evaluations, which might still be too costly. A nice development in
this context is the simultaneous perturbation stochastic approximation (SPSA)
due to Spall. A basic version of this method considers random variables ∆n ∈
Rd i.i.d., with ∆n independent of D1 , . . . , Dn+1 and x0 , . . . , xn and P (∆im =
1) = P (∆im = −1) = 12 . Then replace the algorithm above by
# $ % &
f (xn + δ∆n ) − f (xn )
xin+1 = xin + γn − + D i
n+1 ,
δ∆in
which requires only two function evaluations. By Taylor’s theorem, for each i,
f (xn + δ∆n ) − f (xn ) ∂f ! ∂f ∆j
≈ (xn ) + (xn ) ni .
i
δ∆n ∂xi ∂xi ∆n
j#=i
Now the expected value of the second term above is zero, and so it acts just like
another noise term that can be included in Dn+1 for the purpose of analysis.
A type of applications quite close to our subject considers the optimization
of an expected performance measure
J(θ) = Eθ [f (X)],
where X is a random variable with a distribution Fθ that depends on a pa-
rameter θ to be adjusted in order to minimize J(θ) (in our context, θ is a
policy). Now it is typically difficult to compute J(θ), but if we fix θ = θn ,
we can generate samples f (X) with X distributed according to Fθn . Suppose
that the laws µθ corresponding to Fθ (i.e., µθ ([−∞, x)) = Fθ (x) for real values
random variables) are all uniformly continuous with respect to a probability
measure µ, i.e., dµθ (x) = Λθ (x)dµ(x), where the likelihood ratio Λθ (x) (or
Radon-Nykodym derivative) is continuously differentiable in θ. Then
' '
J(θ) = f (x)dµθ (x) = f (x)Λθ (x)dµ(x).
134
Stochastic Fixed Point Iterations
( d )1/p
!
'x'p,w := wi |xi | ,
i=1
or 'x'∞,w := max wi |xi |,
where w = [w1 , . . . , wd ]T with wi ≥ 0 for all i. Recall the Banach fixed point
theorem 6.4.1 which says that a contraction has a unique fixed point. To
analyze the behavior of the ODE (15.8), where F is an α-contraction with
fixed point x∗ , we consider the Lyapunov function V (x) = 'x − x∗ 'p,w for
x ∈ Rd (the notation includes the case p = ∞). Note that the only equilibrium
of (15.8) is x∗ and the only constant trajectory is x(·) ≡ x∗ .
Proof of the theorem. We start with the case 1 < p < ∞. Define sgn(x) =
+1, −1, or 0 depending on whether x > 0, x < 0, or x = 0. For x(t) -= x∗ , we
135
have
d
V (x(t))
dt
( d )(1−p)/p
1 !
= wi |xi (t) − x∗i |p ×
p i=1
( d )
!
p wi sgn(xi (t) − xi )|xi (t) − xi |
∗ ∗ p−1
ẋi (t)
i=1
( d )
!
='x(t) − x∗ '1−p
p,w wi sgn(xi (t) − x∗i )|xi (t) − x∗i |p−1 (Fi (x(t)) − xi (t))
i=1
( d )
!
='x(t) − x∗ '1−p
p,w wi sgn(xi (t) − x∗i )|xi (t) − x∗i |p−1 (Fi (x(t)) − Fi (x )∗
i=1
( d )
!
− 'x(t) − x∗ '1−p
p,w wi |xi (t) − x∗i |p−1 sgn(xi (t) − x∗i )(xi (t) − x∗i )
i=1
( d )
!
='x(t) − x∗ '1−p
p,w wi sgn(xi (t) − x∗i )|xi (t) − x∗i |p−1 (Fi (x(t)) − Fi (x )∗
i=1
− 'x(t) − x∗ 'p,w
p,w 'x(t) − x 'p,w 'F (x(t)) − F (x )'p,w − 'x(t) − x 'p,w
≤'x(t) − x∗ '1−p ∗ p−1 ∗ ∗
≤ − (1 − α)'x(t) − x∗ 'p,w ,
where the first inequality is obtained using Hölder’s inequality, valid for 1 <
p < ∞. Hence the time derivative is strictly negative for x(t) -= x∗ , which
proves the claim for 1 < p < ∞. The inequality can be written, for t > s ≥ 0,
as ' t
'x(t) − x∗ 'p,w ≤ 'x(s) − x∗ 'p,w − (1 − α) 'x(τ ) − x∗ 'p,w dτ.
s
The claim then follows for p = 1 and p = ∞ by continuity of p → 'x'p,w on
[1, ∞].
136
i.e., νi (n) is the frequency with which player i played strategy 1 up to time
n. In the fictitious play model, an agent records the empirical frequency of its
opponent and plays at each stage the best response assuming the the opponent
chooses its strategy randomly according to its empirical frequency4 . This best
response for player i is a map fi (p−i ) : [0, 1] → [0, 1] which, based on the
one stage game, prescribes the probability with which player i should choose
its strategy si if the probability that its opponent chooses s−i is p−i . In the
model, the empirical frequencies then evolve according to
1
νi (n + 1) = νi (n) + (1{ξn+1
i
= si } − νi (n)), i = 1, 2
n+1
and the corresponding ODE is
137
Now we can consider many variations of the basic averaging rule (15.9).
For example, suppose that at period k the communication link from j to i fails
with probability 1 − pij . This probability can be made dependent on the past
and time dependent without change, but for simplicity, let us assume here that
the failures are i.i.d. Moreover, let’s assume that the difference xj (k) − xi (k) in
(15.9) is also perturbed by a zero mean noise νij (k) (due say to quantization or
communication errors), independent of the random link failures. The perturbed
averaging rule becomes then
!
xi (k + 1) = xi (k) + &k [δij (k + 1)(xj (k) − xi (k) + νij (k + 1))], i = 1, . . . , n.
j∈Ni
where {δij (k)}k is i.i.d. Bernoulli, with P (δij (k) = 1) = pij , and we allow for a
time-varying (typically diminishing) step size &k . Under broad conditions, this
stochastic approximation tracks asymptotically the corresponding ODE
!
ẋi (t) = pij (xj (t) − xi (t)).
j∈Ni
Note that the set of equilibria of this equation is the one-dimensional subspace
x1 = . . . = xn under reasonable conditions on the underlying connectivity
graph and failure probabilities, hence consensus is obtained asymptotically.
However, the choice of step sizes, as often in such simple stochastic approxi-
mation algorithms, has a strong influence on the practical (transient) behavior
of the trajectories, see Fig. 15.2. One can also study asynchronous versions of
the averaging algorithm using the ODE method, which is perhaps more useful
from an engineering point of view.
138
(a) !k = 10−3 . (b) !k = 10−2 .
Figure 15.2: Transient behavior of the local averaging algorithm for different
choices of step sizes. If we choose a constant step size, increasing it improves the
convergence speed but the communication noise is not well filtered. Decreasing
step sizes for larger values of k improves the asymptotic filtering property of
the algorithm, but can also reduce the convergence speed if decreasing too
fast. In fact for constant step sizes in this problem, we only obtain asymptotic
convergence in a neighborhood of the limit set of the ODE.
139
analysis is done by artificially forcing the iterates to remain bounded in (15.10)
(say by truncation), which can actually be a useful device in applications. This
requires to consider a limiting ODE with reflection terms on the domain bound-
ary [KY03]. But for general unbounded state spaces this is a stability assump-
tion that must be proved separately perhaps via other means than the ODE
method, e.g. a method based on stochastic Lyapunov functions [KY03]. Under
the stability assumption, the iterates (15.10) are expected to track asymptoti-
cally the ODE
ẋ(t) = f (x(t)), t ≥ 0. (15.12)
Assumption 1 ensures that this ODE has a unique solution for each x(0), which
depends continuously on x(0). The martingale difference assumption (15.11) is
a more precise definition of our earlier assumption of zero-mean noise Dn . We
allow conditioning on the past iterates, so this is a quite general set-up. Any
deterministic trend in the noise should be captured in f or the bias terms bn
in (15.4).
To make the idea that the stochastic approximation asymptotically tracks
the trajectories of the ODE more formal, first define the sequence of times
n−1
!
t0 = 0, tn = γm .
m=0
140
Note that the chain transitive invariant set of the theorem can be much
larger than the ω-limit set of the ODE, because it must essentially be “stable
under small perturbations”, recall Fig. 15.1. In practice, Lyapunov functions
are useful for further narrowing down the potential candidates for the limit
set. Suppose that we have a Lyapunov function V : Rd → [0, ∞), continuously
differentiable, such that lim(x(→∞ V (x) = ∞, H = {x ∈ Rd : V (x) = 0} -= ∅,
and dtd
V (x(t)) = 1∇V (x), f (x)2 ≤ 0 with equality if and only if x ∈ H. Then
we have the following corollary, under the same assumptions as for the theorem.
Corollary 15.3.4. If the only internally chain transitive invariant sets for
(15.12) are isolated equilibrium points, then {xn } a.s. converges to a possibly
sample path dependent equilibrium point.
"∞
Remark on the assumption 2
n=0 γn <∞
Consider the cumulative noise term
n−1
!
ζn = γm Dm+1 , n ≥ 1.
m=0
in (15.10). We want to show that the effect of noise becomes negligible asymp-
totically, as this is a basic ingredient to prove lemma 15.3.1. Note that ζn is a
(zero mean) martingale, i.e.
E[ζn+1 |Fn ] = ζn , n ≥ 1,
141
which follows immediately from assumption 3. The definition of a martingale
also requires ζn to be Fn measurable, which is immediate, and integrable. In
fact in this case ζn is even square integrable, i.e., E['ζn '2 ] < ∞ for all n, which
is a consequence of assumptions 1 and 3. Moreovever,
! !
E['ζn+1 − ζn '2 |Fn ] = γn2 E['Mn+1 '2 |Fn ]
n≥0 n≥0
!
≤ γn2 K(1 + 'xn '2 ), a.s.
n≥0
!
≤ K(1 + B 2 ) γn2 < ∞, a.s.
n≥0
where B = supn 'xn ' < ∞ from assumption 4. We can then apply the Mar-
tingale convergence theorem to conclude that ζn converges almost surely as
"→
n ∞. In particular, the noise entering in the iterations after time K, i.e.,
∞
n=K γn Dn+1 , vanishes as K → ∞. This ensures that the effect of the noise
indeed becomes asymptotically
" negligible. Note here that this property re-
lies on the assumption n≥0 γn2 < ∞, which is important to obtain a general
theorem5 .
5 The case of constant step sizes, which does not satisfy this assumption, is also well
studied
142