Dynamics of stochastic approximation algorithms
Dynamics of stochastic approximation algorithms
M ICHEL B ENAÏM
Dynamics of stochastic approximation algorithms
Séminaire de probabilités (Strasbourg), tome 33 (1999), p. 1-68
<http://www.numdam.org/item?id=SPS_1999__33__1_0>
Abstract
These notes were written for a D.E.A course given at Ecole Normale
Supérieure de Cachan during the 1996-97 and 1997-98 academic years and
at University Toulouse III during the 1997-98 academic year. Their aim
is to introduce the reader to the dynamical system aspects of the theory
of stochastic approximations.
Contents
1 Introduction 3
1.1 Outline of contents ........................ ~ 4
2 Some Examples 6
2.1 Stochastic Gradients and Learning Processes ........... 6
2.2 Polya’s Urns and Reinforced Random Walks ............ 6
2.3 Stochastic Fictitious Play in Game Theory ............ 8
3 Asymptotic Pseudotrajectories 9
3.1 Characterization of Asymptotic Pseudo trajectories ........ 10
11
4.1 Notation and Preliminary Result .................. 11
4.2 Robbins-Monro Algorithms ..................... 14
4.3 Continuous Time Processes ..................... 18
8 Shadowing Properties 35
8.1 03BB-Pseudotrajectories
......................... 36
8.2 Expansion Rate and Shadowing ................... 40
8.3 Properties of the Expansion Rate .................. 43
1 Introduction
Stochastic approximation algorithms are discrete time stochastic processes whose
general form can be written as
=
’Yn+1 Vn+1 ( 1)
where xn takes its values in some euclidean space, is a random variable
and In > 0 is a "small" step-size.
Typically xn represents the parameter of a system which is adapted over time
and f(xn, ~»+1 ). At each time step the system receives a new information
that causes xn to be updated according to a rule or algorithm characterized
by the function f.Depending on the context f can be a function designed by a
user so that some goal (estimation, identification, ... ) is achieved, or a model
of adaptive behavior.
The theory of stochastic approximations was born in the early 50s through
the works of Robbins and Monro (1951) and Kiefer and Wolfowitz (1952) and has
been extensively used in problems of signal processing, adaptive control (Ljung,
1986; Ljung and Soderstrom, 1983; Kushner and Yin, 1997) and recursive es-
timation (Nevelson and Khaminski, 1974). With the renewed and increased
interest in the learning paradigm for artificial and natural systems, the theory
has found new challenging applications in a variety of domains such as neural
networks (White, 1992; Fort and Pages, 1994) or game theory (Fudenberg and
Levine, 1998).
To analyse the long term behavior of (1), it is often convenient to rewrite the
noise term as
Vn+l =
F(xn) + (2)
where F : - I~m is a deterministic vector field obtained by suitable averag-
ing. The examples given in Section 2 will illustrate this procedure. A natural
approach to the asymptotic behavior of the sequences {xn } is then, to consider
them as approximations to solutions of the ordinary differential equation (ODE)
dx dt = F(x). (3)
merically solving (3) with step size yn . It is natural to expect that, owing to
the fact that yn is small, the noise washes out and that the asymptotic behavior
of {xn } is closely related to the asymptotic behavior of the ODE. This method
called the ODE method was introduced by Ljung (1977) and extensively studied
thereafter. It has inspired a number of important works, such as the book by
Kushner and Clark (1978), numerous articles by Kushner and coworkers, and
more recently the books by Benveniste, Metivier and Priouret (1990), Duflo
The aim of this set of notes is to show how dynamical system ideas can be
fully integrated with probabilistic techniques to provide a rigorous foundation
to the ODE method beyond gradients or other dynamically simple systems.
However it is not intended to be a comprehensive presentation of the theory of
stochastic approximations. It is principally focused on the almost sure dynamics
of stochastic approximation processes with decreasing step sizes. Questions of
weak convergence, large deviation, or rate of convergence, are not considered
here. The assumptions on the "noise" process are chosen for simplicity and
clarity of the presentation.
These notes are partially based on a DEA course given at Ecole Normale
Superieure de Cachan during the 1996-1997 and 1997-1998 academic years and
at University Paul Sabatier during the 1997-1998 academic year. I would like to
especially thank Robert Azencott for asking me to teach this course and Michel
Ledoux for inviting me to write these notes for Le Séminaire de Probabilités.
An important part of the material presented here results from a collabora-
tion with Morris W. Hirsch and it is a pleasure to acknowledge the fundamental
influence of Moe on this work. I have also greatly benefited from numerous dis-
cussions with Marie Duflo over the last months which have notably influenced
the presentation of these notes. Finally I would like to thank Odile Brandiere,
Philippe Carmona, Laurent Miclo, Gilles Pages and Sebastian Schreiber for valu-
able insights and informations.
Although most of the material presented here has already been published,
some results appear here for the first time and several points have been improved.
1996) but I have chosen to present here the proof of Benaim and Hirsch (1996).
I find this proof conceptually attractive and it is somehow more directly related
to the original ideas of Kushner and Clark (1978).
Section 6 applies the abstract results of section 5 in various situations. It
is shown how assumptions on the deterministic dynamics can help to identify
the possible limit sets of stochastic approximation processes with a great deal of
generality. This section generalizes and unifies many of the results which appear
in the literature on stochastic approximation.
Section 7 establishes simple sufficient conditions ensuring that a given attrac-
tor of the ODE has a positive probability to host the limit set of the stochastic
approximation process. It also provides lower bound estimates of this probabil-
ity. This section is based on unpublished works by Duflo (1997) and myself.
Section 8 considers the question of shadowing. The main result of the sec-
tion asserts that when the step size of the algorithm goes to zero at a suitable
rate (depending on the expansion rate of the ODE) trajectories of (1) are al-
most surely asymptotic to forward trajectories of (3). This section represents
a synthesis of the works of Hirsch (1994), Benaim (1996), Benaim and Hirsch
(1996) and Duflo (1996) on the question of shadowing. Several properties and
estimates of the expansion rate due to Hirsch (1994) and Schreiber (1997) are
presented. In particular, Schreiber’s ergodic characterization of the expansion
rate is proved.
Section 9 pursues the qualitative analysis of section 7. The focus is on the
behavior of stochastic approximation processes near "unstable" sets. The cen-
terpiece of this section is a theorem which shows that stochastic approximation
processes have zero probability to converge toward certain repelling sets includ-
ing linearly unstable equilibria and periodic orbits as well as normally hyper-
bolic manifolds. For unstable equilibria this problem has often been considered
in the literature but, to my knowledge, only the works by Pemantle (1990) and
Brandiere and Duflo (1996) are fully satisfactory. I have chosen here to follow
Pemantle’s arguments. The geometric part contains new ideas which allow to
cover the general case of normally hyperbolic manifolds but the probability part
owes much to Pemantle.
2 Some Examples
2.1 Stochastic Gradients and Learnin g Processes
Let ~~i }; > 1, ~= E E be a sequence of independent identically distributed random
inputsto a system and let zn E IRm denote a parameter to be updated, n > 0.
We suppose the updating to be defined by a given map f x E -~ and
the following stochastic algorithm:
xn+1 - xn =
(4)
Let be the common probability law of the çn. Introduce the average vector
field .
F(x) =
and set
Un+l =
It is clear that this algorithm has the form given by [(1) (2)]. Such processes are
classical models of adaptive algorithms.
A situation often encountered in "machine learning" or "neural networks" is
the following: Let I and 0 be euclidean spaces and M : x I -> 0 a smooth
function representing a system (e.g a neural network). Given an input y E I and
a parameter x the system produces the output M(x,
y) . .
~"z =
{v E I~r’~’+i : vi ~ 0, = 1}.
We consider Om as a differentiable manifold, identifying its tangent space at any
point with the linear subspace
7
An urn initially (i. e. ,at time n = 0) contains no > 0 balls of colors 1,..., m + 1.
At each time step a new ball is added to the urn and its color is randomly chosen
as follows:
Let xn,; be the proportion of balls having color i at time n and denote by
xn E Om the vector of proportions xn = (xn,l, ..., ’Cn,m+i). The color of the
ball added at time n + 1 is chosen to be i with probability f t ( xn ) , where the f s
are the coordinates of a function f : Om -~
Such processes, known as generalized Polya urns, have been considered by
Hill, Lane and Sudderth (1980) for m = 1; Arthur, Ermol’ev and Kaniovskii
(1983); Pemantle (1990). Arthur (1988) used this kind of model to describe
competing technologies in economics.
An model is determined by the initial urn composition
urn no) and the
urn function f : ~m -~ We assume that the initial composition no) is
fixed one for all. The u-field 7n is the field generated by the random variables
xo, , xn One easily verifies that the equation
... .
.Cn+l - xn =
no + n
Observe that f being arbitrary, the dynamics of F = -I d + f can be arbi-
trarily complicated.
. =
s= (0) + r n >- o.
k=l
.
j|Fn) =
8
Benaim, 1997).
The original idea of these processes is due to Diaconis who introduced the
process defined by
s ,j ( )
Ri,kvk
with R,j 0. For this process called a Vertex Reinforced Random Walk the
>
probability of transitionto site j increases each time j is visited. The long term
behavior of {xn} has been analyzed by Pemantle (1992) for R; j = Rj,a and by
Benaim (1997) in the non-symmetric case. With a non-symmetric R the ODE
may have nonconvergent dynamics and the behavior of the process becomes
highly complicated (Benaim, 1997). .
denoted { 0,1 } .
Let be a sequence of identically distributed random variables de-
scribing the states of nature. The payoff to player i at time n is a function
~n) : {0,1}2 -~ I~. We extend ~n) to a function ~") : ~0,1J2
defined by U~ x2, ~n) =
Consider now the repeated play of the game. At round n player i chooses an
action sn E {o,1} independently of the other player. As a result of these choices
player i receives the payoff ~’ri). The basic assumption is that Ua (., ~n)
is known to player i at time n but the strategy chosen by her opponent is not.
At the end of the round, both players observe the strategies played.
Fictitious play produces the following adaptive process: At time n+ 1 player 1
(respectively 2) knowing her own payoff function U1 (. , ~"+1) and the strategies
9
played by her opponent up to time n computes and plays the action which
maximizes her expected payoff under the assumption that her opponent will
play an action whose probability distribution is given by historical frequency of
past plays. That is
4+1 =
where
xin = 1 nsik .
3 Asymptotic Pseudotrajectories
A semiflow 03A6 on a metric space (M, d) is a continuous map
~ :R+ x M - M,
(t~ x) ~ ~(t~ x) _
such that
~o
Identity,=
~t o ~3 =
for any T > 0. Thus for each fixed T > 0, the curve
shadows the ~-trajectory of the point X(t) over the interval [0, T] with arbitrary
accuracy for sufficiently large t. By abuse of language we call X precompact if
its image has compact closure in M.
The notion of asymptotic pseudotrajectories has been introduced in Benaim
and Hirsch ( 1996) and is particularly useful for analyzing the long term behavior
of stochastic approximation processes.
10
d(f,g) = 1 2kmin(1,dk(f,g)
where =
SUPtE[-k. k] d(f (t)~9(t))~ .
~(X) H(X(0)) = =
other hand, Lemma 3.1 shows that any limit point of ~Ot (X) } is a fixed point
of$. This proves that (i) implies (ii).
Suppose now that (ii) holds. Since {X(t) : t > 0} is relatively compact and
X is uniformly is equicontinuous and for each s 2:: 0
is relatively compact in M. Hence by the Ascoli Theorem (see
e.g Munkres 1975, Theorem 6.1), is relatively compact in C° (I~, M).
Therefore limt~~ d(0398t(X), (~t(X))
= 0 which
by Lemma 3.1 implies (i).
The above discussion also shows that (ii) implies (iii). QED
Remark 3.3 Let M) be the space of functions which are right continuous
and have left-hand limits ( cad lag functions). The definition of asymptotic pseu-
dotrajectories can be extended to elements of D(R M). Since the convergence
of a sequence ~ f n } E D toward a continuous function f is equivalent to the uni-
form convergence of ~ fn} toward f on compact intervals, Lemma 3.1 continues
to hold and Theorem 3.2 remains valid provided that we replace the statement
that X is uniformly continuous by the weaker statement:
Vf > 0 there exists a > 0 such that
xn = + (7)
where
. is a given sequence of nonnegative numbers such that
03A3
~2014~ 03B3k
k
=
~, lim 03B3n = 0.
12
and define the continuous time a,~Cne and piecewise constant interpolated pro-
cesses X, ? :I~+ -~ Ilgm by
X ( Tn + s) = xn + s , and X (rn + s) =
zn
Tn+l - Tn
X(t) - X(0) =
[F(X (s))
Jo
+ U(s)]ds (9)
The vector field F is said to be globally integrable if it has unique integral
curves. For instance a bounded locally Lipschitz vector field is always globally
integrable. We then have
Proposition 4.1 Let F be a continuous globally integrable vector field. Assume
that
or equivalently
lim (t, T) = 0
with
0394(t,T) =
sup il / t+h
t
(10)
13
A2 Sllpn 00, or
sup + h) -
~h(X(t))~~ -
LF(X)(S) =
X(0) +
13 F(X(u))du.
and
At(s) = [F(X(u)) -
Bt(s) U(u)du.
=
/t /
By assumption Al, limt~~ Bt 0 in =
X(u)1I = 1 F(X(s)) + l
Ky-(u) + ~u03C4m(u)U s d ~
For t large enough y(u) 1, therefore
)))) U(s)ds) [ ))
.It1
U(s)ds) ) + ) )
t
U
1 2A(t I , T I) . -
Thus
tut-~T tut+T
14
X* =
-1,T+ 1) + sup
-1~T+ 1)~
T) -1, T + 1))
and by equation (12)
s
IIX(t s) - 03A6s(X(t))~ ~
+ L F ))x( + u) - + + .
(iii) = o.
and
~ ’~n+qI 2
n
00 .
Proof For any t > 0 Burkholder’s inequality (see e.g Stroock, 1993) implies
k-1 m(rn+T)-1
El sup ~03A303B3i+1Ui+1~q} ~
==n
CqE{[ 03A3 03B32i+1~Ui+1~2]q/2}
s-n
(13
( i
( i i
(14)
obtained with xs =
a~ -~ ( and ys aa = .
m(r"+T)-1
E( £
i=n
(15
rn+T
~C(q, T) 03B31+q/2i+1 _ C(q, T)
/
for some constant C(q, T) > 0.
From the preceding inequality we get that
E(0394(t,T)q) ~ C(q, T )
/t (s)ds (16)
16
k>0
C(q,T) /"
"o
= 00. (17)
lim ~ ( kT, T ) = 0
with probability one. On the other hand for kT t (k + 1)T
~(t, T) 2~(kT, T) + 0((k + I)T, T).
Hence assumption Al is satisfied QED
Remark 4.3 Suppose that is a sequence of random variables such that
is :Fn measurable. Then the conclusion or corollary 4.2 remains valid
provided that we strengthen the assumption on to
sup C
n
~n
DD
for each c > 0. Then assumption Al of proposition l~.1 is satisfied with proba-
bility 1. Therefore if A2 and A2’ hold almost surely and F has unique integral
curves the interpolated process X is almost surely an asymptotic pseudotrajectory
of the flow
17
Proof Let
i=l 2 i-1
By the assumption on {Un}, is a supermartingale. Thus for any
~3>0
k-1
nkm(T,,+T) i=n
h
P( sup Zk(~8) > Zn(9) exp(,~ _ -~~B~~Z ~
03B32i+1 - 03B2).
~ exp(039 2~03B8~2
i=n
R =
~
i=n
(18)
P(0394 t,T) ~ 03B1 C exp(-03B12 C’ t+T 03B (s)d ~ C exp(-03B12 C’T03B (u)
Lt f(x) -
==!
8xa
+
f (t) ai,j(x) ~2fax;~xi~ax~xi (x)
2
=,j J
(19)
Ljt(x) = 1 ~(t)Rm
(f(x + ~(t)v) -
that
(ii) a =
(aij) is a m x m matrix-valued continuous bounded function such
a(x) is symmetric and nonnegative definite for each x E IRm.
(iii) a family of positive measures on I~m such that
Ls f (xs )ds
(~~>
~0 exp(-c~(t )dt o0
~c~X~t))~~ > a)
Proof Set
0394(t,T) sup0~h~T~X(t
-
+ h) - X(t) - ft+ht F(X(s) ds~
Let f(x) exp(B, ~ xo~ and r" inf{t > 0 X (0, t~ n B(0, n)‘ ~ 0} where
= -
=
B(0, n) f x E =
n}. Since the measure has uniformly bounded
support there exists r > 0 such that f(X(t n T")) fn(X(t n T")) where fn =
is a C°° function with compact support which equals f on B(0, n + r). Since
f" (X (t)) - fo L, f" (X (s))ds is a martingale and r" a stopping time f (.Y(t n
-
is a martingale. Hence
Therefore
P(0394t,T)~03B1)~C exp(-03B12C’T~(t).
The rest of the proof is now exactly as in Proposition 4.4. Details are left to the
reader. QED
Further we set Eq(~) the set of equilibria, Per ~ the closure of the set of
periodic orbits L+(03A6) = ~x~M w(x), L-(03A6) = ~x~M a(x) and
~(~) _ ~+(~) ~,C_(~).
Chain Recurrence and Attractors
Equilibria, periodic and omega limit points are clearly "recurrent" points. In
general, we may say that a point is recurrent if it somehow returns near where
it under time evolution.
was
A notion of recurrence related to slightly perturbed orbits and well suited
to analyse stochastic approximation processes is the notion of chain recurrence
introduced by Bowen ( 1975) and Conley (1978).
Let 6 > 0, T > 0. A (03B4,T)-pseudo-orbit from a E M to b E is a finite
sequence of partial trajectories
such that
d (Yo, a) I,
=
0,..., k -1;
Figure 1: 9 = f(0)
/-’(0)={~7r:~eZ}.
We have
E9(~) =
~r} =
L+(~) -
L(~)
and
~(~) - S~ . .
Internally chain recurrent sets are {0}, and Sl. Remark that the set X =
~0, ~r~ is a compact invariant set consisting chain recurrent points. However,
of
X is not internally chain recurrent.
uniformly in x E W.
Lemma 5.2 Let U C M be an open set with compact closure. Suppose that
03A6T(U) C_U for some T > 0. Then there exists an attractor A C U whose basin
contains U.
The following proposition originally due to Bowen (1975) makes precise the
relation between the different notions we have introduced.
inf{~ OA C
> C This distance makes the space of closed subsets of M a
~. It follows that
Thus we have constructed an (é, T) pseudo orbit from p to p which lies entirely
in C C R(~). To conclude the proof it remains to show that R(&) is invariant. It
is clearly positively invariant. Let p and Cn as above. By extracting convergent
subsequences from {tkn _ 1 } and {pkn _ 1 } we obtain points r E ~T, 2T~ and p* E
R(~) such that ~T (p~) = p. Hence p E ~t (R(~)) for all 0 t T and since p
is arbitrary R(~) C t (R( )) for all 0 ~ T. By the semiflow property this
implies R($) ~t (R(~))
C for all t > 0. QED
Corollary 5.6 Let x E M (non-necessarily compact). is compact then
w(z) is internally chain transitive.
Proof Let T [0,1] x 03B3+(x)
= and 03A8 the semiflow on T defined by y) =
(e’tu, 03A6t(y)). Clearly {0} x is a global attractor for 03A8 and points of {o} x
w(x) are chain recurrent for Therefore R(w) = {o} x w(x). By Theorem 5.5
= This implies =
w(x) and w(x) being connected
it is internally chain transitive by Proposition 5.3. QED
t~O
Theorem 5.7
Proof We only give the proof of (i). We refer the reader to Benaim and Hirsch
(1996) for a proof of (ii) and further results. Since {X(t) : t > 0} is relatively
compact, Theorem 3.2 shows that {Ot (X) : t E l~} is relatively compact in
M) and limt~~ d(0398t(X), S03A6) = 0. Therefore by Corollary 5.6 the omega
limit set of X for 0, denoted by is internally chain transitive for the
semiflow 9 ~ S~ .
25
where t > 0 for a semiflow ~, and t E 1R for a flow. Since the property of being
chain transitive is (obviously) preserved by conjugacy it suffices to verify that
H(L(X)) =
Remark 5.8 Our proof of Theorem 5.7 follows from Benaim and Hirsch ( 1996) .
It has the nice interpretation that the limit set L(X) can be seen as an omega
limit set for an extension of the flow to some larger space. A more direct proof in
the spirit of Theorem 5.5 can be found in Benaim (1996) (see also Duflo 1996).
simple flow, then every non-stationary point of L belongs to a cyclic orbit chain
in L.
c A =
U A~
j=1
where A1, ..., An are compact invariant subsets of L. Then for every point p E L
either p ~ or there exists a finite sequence xi, ..., xk E L 1 A and indices
ii , ik such that
...
= inf{V(x) : x E L n A}.
v*
Let x E L. The function t -)- V(~t(x)) being non-increasing and bounded the
limit =
limt~~ V(03A6t (x)) exists. Therefore V (p) =
V(x) for all
pE By invariance of V is constant along in
trajectories w(x). Hence
C A. This proves the claim.
By continuity of V and compactness of L n A, v* E V(L n A). Since V(A)
has empty interior there exists a sequence , vn E RB V(A) decreasing
to For n > 1_let Ln = {x E L: V(x)
v * .
vn}. Because V is a Lyapounov
function for A C Ln for any t > 0. Hence by Lemma 5.2 and Proposition
5.3 L = Ln. Then L = ~n~1 Ln = {x E L : V(x) = v*}. This implies L = A and
V(L) =
~v~}. QED
Remark 6.5 The following example shows that the assumption that V(A) has
empty interior is essential in Proposition 6.4.
Consider the flow on the unit circle Sl = R/27rR induced by the differential
equation
that
~ 8 _ d t f(8) wheref is
E
periodic
a
F(x) =
Assume
28
Proof Let A =
Eq(~). By Sard’s theorem (Hirsch, 1976; chapter 3) V(A) has
Lebesgue in I~ and the result follows from
measure zero Proposition 6.4 applied
with the strict Lyapounov function V. QED
6.3 Attractors
Let X : R - M be an asymptotic pseudotrajectory of ~. For any T > 0 define
dx(T) =
X (kT + T)). (23)
kEN
~i~ p is an equilibrium.
(ii) p is periodic (i. e ~T ( p) =
p for some T> 0 ).
(iii) There exists a cyclic orbit chain r C L which contains p.
Notice that this rules out trajectories in L which spiral toward a periodic orbit,
or even toward a cyclic orbit chain.
In view of Theorem 5.7 we obtain:
Theorem 6.15 Let ~ be a flow in an open set in the plane, and assume that
~t decreases area for t > 0. Then:
(a) L(X) is a connected set of equilibria which is nowhere dense and which does
not separate the plane.
(b) If ~ has at most countably many stationary points, than L(X) consists of
a single stationary point.
Example 6.16 Consider the learning process described in section 2.3. Assume
that the probability law of ~n is such that functions h2 are smooth. Then
the divergence of the vector field (6) at every point (xl, x2) is
This impliesthat ~t decreases area for t > 0. Since the interpolated process of
is almost surely an asymptotic pseudotrajectory of ~ (use Proposition 4.4),
the results of Theorem 6.15 apply almost surely to the limit set of the sequence
M.
For more details and examples of nonconvergence with more that two players
see (Benaim and Hirsch, 1994; Fudenberg and Levine, 1998).
some probability space (Q, 0, P) with continuous (or càd lag) paths taking value
in M.
We suppose that X(.) is adapted to a non-decreasing sequence of sub-a al-
gebras : t > 0} and that for all d > 0 and T > 0
lim w(t, d, T) ,~ 0.
31
t+T
P( sup d(X(t + h), ~h(X(t))) > _
J~
r(t (25)
~0r(t,03B4,T)dt ~.
This last condition is satisfied by most examples of stochastic approximation
processes (see section (4) and section (7.2) below).
Our goal is to give simple conditions ensuring that X converges with pos-
itive probability toward a given attractor. We develop here some ideas which
originally appeared in Benaim (1997) and Duflo (1997).
Hence it is clear that M B Att(X) is an open set almost surely disjoint from
L(X).
It remains to prove that given any p E Att(X) and T > 0, T(p) E Att (X ) .
Fix f > 0. By continuity of ~T there exists a > 0 such that ~T (Ba ( p} ) C
P(L(X) C A) > 0.
32
By Lemma (6.8)
{Tn oo} n { sup d(X(s + T) ~~(X (s)) - ~} C {L(X ) C A}.
Hence
p(L(X) C A) >
~ E[P( sup d(X (s+T), -
k>(2"tJ-~1
>
~, (1-’~(tn(k)~~~T’))p(Tn =
tn(k)) >_ (1- ~).
k>(2nt~+1
Since oo) =
P(3s > t: X (s) E U) we obtain
7.2 Examples
Let F I~m I~m be Lipschitz vector field. Consider the
Proposition 7.4 : --~ a
diffusion process
dX =
F(X)dt + ~(t)dBt
where E is a positive decreasing function such that for all c > 0
~0 exp(-
c ~(t )dt
)
~.
Then
QA =
lim d(X(t), A)
{ t-~~ =
0} _ {L(X) C A}
33
has positive probability and for each open set U relatively compact with
UC B(A)
> P(3s > t : X(s) E
with ~ and T given by Lemma 6.8 and C, C(T ) are positive constant (de-
pending on F.)
(ii) On S2A L(X) is almost surely internally chain transitive.
(iii) If F is a dissipative vector field with global attractor A
-1 ~X(t)~ =
~) > 0.
Proof follows from the fact that the law of X(t) has positive density with
(i)
respect to the
Lebesgue measure. Hence Att(X) = and Theorem 7.3 applies.
The lower bound for follows from Theorem 7.3 combined with Proposi-
tion 4.6, (iii). Statement (iii) follows from Theorems 7.3 and fi.ll. QED
Similarly we have
Proposition 7.5 Let F : l~m -~ I~m be a Lipschitz bounded vector field. Con-
sider a Robbins-Monro algorithm (7) satisfying the assumptions of Proposition
,~.2 or ,~.,~. Then
(i) For each attractor A C 1~m whose basin has nonempty intersection with
Att(X) the event
d(X(t), A) =
0} =
tL(X ) C A}
has positive probability and for each open set U relatively compact such
that U C B(A)
r(03B4, T, s) = C (T, q) ( )
03B (s) 03B4
under the weaker assumptions given by Proposition 4.2 with 03B4 and T given
by Lemma 6.8. Here C, C(T) C’(T, q) denote positive constants.
(ii) On Q A L(X) is almost surely internally chain transitive.
(iii) If F is a dissipative vector field with global attractor A
(I - IIX(t)II > 0. =
34
7.3 Stabilization
Most of the results given in the preceding sections assume a precompact asymp-
totic pseudotrajectory X for a semiflow ~. Actually when X is not precompact
the long term behavior of X usually presents little interest (See Corollary 6.11). .
Xn+1 - xn = + Un+1)
be Robbins-Monro algorithm (section ,~.~). Suppose that there exists a C2
a
+ + .
Wn - +
s>n
and Vn =
V(xn) + Wn Vn is nonnegative and
E(Vn+1 - +
35
8 Shadowing Properties
In this section we consider the following question:
Given a stochastic approximation process such as (7) (or more generally an
asymptotic pseudotrajectory for a flow ~) does there exist a point x such that
the omega limit set of the trajectory {~t(x) : t > 0} is L(X) ?
The answer is generally negative and L(X) can be an arbitrary chain tran-
sitive set. However it is useful to understand what kind of conditions ensure a
positive answer to this question. A case of particular interest in applications
is given by the following problem: Assume that each ~- trajectory converges
toward an equilibrium. Does X converge also toward an equilibrium ?
The material presented in this section is based on the works of Hirsch ( 1994),
Benaim (1996), Benaim and Hirsch ( 1996), Duflo (1996) and Schreiber (1997).
We begin by a illustrative example borrowed from Benaim (1996) and Duflo
(1996).
Example 8.1 Consider the Robbins Monro algorithm given in polar coordi-
nates (p, 8) by the system
Pn+1- Pn = +
=
+
where is a sequence of i. i. d random variables uniformly distributed on
~~n }
~-1,1~ and satisfies the condition of Proposition 4.4. The function h is a
smooth function such that h(u) and -3 h(u) -4 for
~c > 4, g(p) = and ~yn 1/4 for all n. These choices ensure that the
algorithm is well defined (i.e po > 0 implies Pn 0 for all n ~ 0).
We suppose given po > 0. It is then not hard to verify there exist some
constants 0 k(po) K(po) such that k(po) Pn K(po) for all n >_ 0.
Let F : I~2 ~ R 2 be the vectorfield definedby
F(x, y) =
(xh(x2 + y2) - y3, + y2) + xy2). °
(26)
Then Xn =
(xn, y") =
(PnCOS(Bn), , satisfies a recursion of the form
Xn+1 - Xn =
(F(Xn) + Un+1) +
where {Un} is a sequence of bounded random variables such that =
0.
36
Let ~ be the flow induced by F (see figure 2). Equilibria of $ are the points
a = (-l, o), b = (0,0), c = (0,1), and every trajectory of ~ converges toward one
of these equilibria. Internally chain transitive sets are the equilibria {a}, {b}, {c}
and the unit circle Sl =={/?= 1} which is a cyclic orbit chain.
Since {(xn, 2/n)} lives in some compact set disjoint from the origin, Theorem
5.7 combined with Proposition 4.4 and Remark 4.5 imply that the limit set of
is almost surely one of the sets {a}, {c} or 81. We claim that if ~ y~ =
m then this limit set is almost surely Suppose on the contrary that { (xn, yn ) }
of a or c. Then limn~~ d(Bn, = 0 and since
converges toward one the points
= the sequence must converges. On the other hand by
the law of iterated logarithm for martingales lim 03A3ni=1 03B3i03BEi = ~. Thus
n
A contradiction.
This example shows that the limiting behavior of a stochastic approximation
process can be quite different from the limiting behavior of the associated
ODE.
We will show later (see Example 8.16) that {(xn, yn)} actually converges toward
one the points a or c provided that yn goes to zero "fast enough". .
8.1 A-Pseudotrajectories
Let X denote an asymptotic pseudotrajectory for a semiflow ~ on the metric
space M. For T > 0 let
e(X~T) t-).oo t
sup d ( X ( t + h , ) ~h ( X ( t )))
and define the asymptotic error rate of X to be
e(X) =
sup e(X,T).
T>o
37
Lemma 8.2 If the (&t ) are Lipschitz, locally uniformly in t > 0 then e(ii, T) =
Main Examples
Our main example of 03BB-pseudotrajectories is given by stochastic approximation
processes whose step sizes go to zero at a "fast" rate:
Assume that F is Lipschitz bounded vector field and that (Un) satisfies the
a
almost surely. Since a can be chosen arbitrary close to -A/2 this proves that
almost surely and we conclude the proof by using inequality (11). QED
Remark 8.4 If yn =
f (n) for some positive decreasing function with f (s)ds =
oo, then
Similar to Proposition 8.3 is the next proposition whose proof is left to the
reader: .
d(~t(x)~ K) - K)
for all x ~ B and t ~ 0.
39
B for all t > 0 and let Y(t) E Ii be a point nearest to X (t). Then .
(i)
lim sup t logg d( X()~t Y ())
t .
(ii) If the {03A6t} are Lipschitz, locally uniformly in t > 0 then Y is 03B103B2-pseudotrajectory
for ~~Ii
Proof Choose 0 E -~i and choose T> 0 large enough such that
d(~T(x)~ K) k.)
for all x E B. Thus there exists to such that for t > to
d(X(t+T), ~T(X (t))+d(~T(X (t)), l~ ) K). .
Let vk =
d(X (kT), li’), p = and ko = + l. Then vk+1 pk + pvk
for k > ko. Hence
Vko+m ~ + Vko)
for m > 1. It follows that
kT - T
03B2 + f.
Also for kT t (k + I)T and k > ko
d(X(t), K) X (t)) + K)
+ +
Thus
K) _ inf m(D03A6t(x))
where
m(D03A6t(x)) =
~v~ =
1}
denote the minimal norm of Observe that since ~ is a flow then
= .
We now state a shadowing result due to Benaim and Hirsch (1996) whose proof
is an (easy) adaptation of Hirsch’s shadowing theorem (Hirsch,1994). .
A.
t-cn t
>
XEK
Set f = &T, yk =
,I(kT) and fix w such that e03BBT w minxEK m(D f(z)).
Thus for k large enough
d(Yk+if(Yk)) W~ .
(27)
By continuity of D f and compactness of Ii there exists a neighborhood U of Ii
such that
minm(D
XEU
f(z)) =
p > w..
(28)
Claim: There exists a neighborhood N c U of I and p* > 0 such that
z E C f (B(yk-i,
where the last inclusion follows from the claim. This proves (29). .
i>0
(d) /3 =
sup(a, A) min~0, E(~, K)}. .
limsup
t
t-+oo
g ( ~) +( ))_ ~3.
dX =
F(X)dt +
43
then Corollary 8.10 applied with A = K = r and Proposition 8.5 imply that for
almost every w E Qr there exists x(w) E r such that
MC(~) =
U supp(p,).
BC(~) = : x E w(x)?.
By the Poincare recurrence Theorem (Mane, 1987, chapter 1)
C (30)
3 the one a functional analyst would call weak*
44
Let E M(03A6). By the celebrated Oseledec’s Theorem (see e.g Mane 1987,
chapter 11) there exists a Borel set R C M of full measure = 1) such that
for all x E R, there exist numbers À1(X) (x) and a decomposition of ...
Tx Minto
TxM El (x) ® ~ = ...
where the infimum is taken over all ergodic measures with support in Ii. .
Proof Let f =
03A61 and let X {(x, v) :: x E Ii, v E TxM, ~v~ 1.} By
= =
= =
’ ’
n-~oo n n-m n
Define a map G :X - X by
Df(x)v ~Df(x)v~
G(x,v) =
(f(x), ).
i=O
45
Therefore
/
x
hd03B8 = lim
n-m
/
x hd03B8n
= lim
n-oJ n
=
Now, by the ergodic decomposition theorem (see Mañé, 1987 chapter 6 Theorem
6.4)
p
inf . Ai (p)
ergodic
S(&, K) .
Ai(p) * 1 n log(~Dfn(x)v~)
n-co n
> S(&, I)
QED
Corollary 8. 13
S(&, Ii) = =
S(&, BC(&[K)).
Proof The first equality follows from Theorem 8.12 and the second from
Poincaré recurrence theorem (equation 30) . QED
£(~, K) = ~ pe Eq(~) ~ Ii }
where al (p) denote the smallest real part of the eigenvalues of the jacobian matrix
DF (p) .
Proof Under the assumption that C Eq(~) every ergodic measure
with support in K has to be a Dirac measure at an equilibrium point. ~p
be such a measure. Then =
Ai(p) and the result follows from Theorem
8.12. QED
with 03B2 =
infx~MC(03A6|K) p(x). Therefore
Example 8.16 Let (xn, yn) E R2 be the Robbins Monro algorithm described
in Example 8.1. It is convenient here to express the dynamics of the vector field
(26) in polar coordinates. That is
d03C1 dt = 03C1h(03C12), d03B8 dt = (03C1 sin 03B8)2.
Let BE = ~ ~ 1- E} . For E « 1 and (p, 9) E BE
~~~ ~~~ _2(1 " P)P(1- P ) _ 2(1 " P) (2 E)(1- e)
i = -
47
03C0. Thus
0 is the eigenvalue of the linearized ODE at equilibria
= 0. Suppose now that
Then Corollary 8.10 and Proposition 8.3 imply that yn? converges almost
surely toward one of the points a or c of Figure 2.
~t(UnS) C S
(iii) There exist A > 0 and C > 0 such that for all p E r, wEE; and t > 0
>_
Examples
Linearly Unstable Equilibria: Suppose r = {p} where p E I~m is a linearly
unstable equilibrium of F. Then Rm = Ep ~ Ecp ~ Eup where and Ep
are the generalized eigenspaces of DF ( p) corresponding to eigenvalues with real
(I ~ (33)
il D~_t (p) I Ep >
49
E: =
span(F(p))
where p x
E; denotes the fibre of EU(r) over p, and similar notation applies to
Ep and Ep .
Because F is linearly unstable, the dimension of is at least 1.
Each E; is a linear subspace of The map p H- E; is a continuous
map from r into the Grassmann manifold of linear subspaces of the appropriate
dimension (it is actually Ck due to the fact that Tr~t maps p x E; to ~t(p) x
and that Tr is a Ck flow).
E~t~, ~,
For p ~ 0393 and sufficiently small f > 0,the local stable manifold of p is defined
to be the set:
_
’(~ f, and ~ _ ~}.
Using stable manifold theory, we take E small enough so that WE (p) is a Ck
(iii) There exists a neighborhood N(r) of F and b > 0 such that for all unit
vector v E ~gm
L’’(~~n+1 ~ v)+~.~n) >
(iv) There exists 0 a 1 such that:
Then
03B 1n+lim~03A=n+1B3 2i
= 0.
P ( lim d(xn, r) =
0) = 0.
stepl The first step of the construction is to replace the continuous invariant
splitting TrJRm = TrS by a smooth (noninvariant) splitting
T0393S~ EU close enough to the first one to control the expansion of DcJJt along
fibers of EU.
Choose T > 0 large enough so that for f =
p E F, and w E Ep:
~Df(p)03C9~ ~ 5~03C9~.
By Whitney embedding theorem (Hirsch, 1976, chapter 1) we can embed
G(d, m) into RD for some D ~ N large enough so that we can see p ~ E; as
a map from F - Thus by Tietze extension theorem (Munkres, 1975) we
can extend this map to a continuous map from IR m into IR D. Let fl denote a C°°
retraction from a neighborhood of G(d, m) C IRD onto G(d, m) whose existence
follows from a classical result in differential topology (Hirsch, 1976, chapter 4).
By composing the extension of p ~ E; with g we obtain a continuous map
defined on a neighborhood N of F, taking values in G(d, m) and which extends
p -~ E;.To shorten notation, we keep the notation Ep E G(d, m) to
denote this new map.
By replacing N by a smaller neighborhood if necessary, we can further assume
~~ I >_ 4 I I w II
for all
Now, by standard approximation procedure, we can approximate
a
and let
p : TpS ~ Ep ~ Tp S,
u+v--~u.
Fix E > 0 small enough so for all pEN, f(p) ~+4) 1 and f(p)I ( I I Pp I I+
~.+ 1 ) ) 1 (this choice will be clarified in the next lemma). From now on,
we will assume that the map Eup
E G(d, m) is chosen such that for all
p~N~S:
.
(1) =
TpS ® Eup
( ii ) The projector Pp TpS ® Eup ~ TpS, satisfies
:
~Pp-Pp~ ~ ~.
Let Eu{(p, v) E S n N x
= v E
Ep }.
Since S is C1+a, Eu is a vector
bundle over S n N. Let H -~ 1~m be the map defined by
H(p, v) = p + v.
It is easy to see that the tangent map of H at a point
(p, 0) is invertible. The
inverse function theorem implies that H is a local diffeomorphism at each
point of the zero section of E. Since H maps the zero section to S n N by the
diffeomorphism (p, o) - p, it follows that H restricts to a diffeomorphism
H : N~ --~ No between open neighborhoods No or the zero section and No ~ N
of S n N. We now define the maps
11: s,
x ~
DII(p) =
Pp.
Step 2 The second step consists in the construction of Lyapounov function
which is zero on S and increases exponentially along trajectories outside S. This
function (see Proposition 9.5) is obtained from V by some averaging procedure.
Lemma 9.3 There exists a neighborhood of r, Nl ~ No, and p > 1 such that
for all x ~ N1
>
(i) PII _ E~
(ii) > 03B1~u~ for all u E
Then
= )
~ ~~A~(~Id-~+~P~) .
QED
Proof of Lemma 9.3
let x E No n f-1(No) and set p =
II(x).
? =
- .
Lemma 9.4 (i) applied to D f (p), Pp, Pp and our choice for E imply
-P)~~ ~ =
3V(x)
Also, by Lemma 9.4 (ii)
IIPI(n)Df(p)(Id-Pa)-Pf(p)D.f(p)(Id-Pr)II1 _ 1.
Pp)~~
1. This implies
~D03A0(f(p))Df(p)(x p)~ ~x - -
p~ =
V(x).
It follows that V(f(x)) > 3V(x) -
- ~(x) t
D~(x).h = lim ~(x + th)
exists. If ~ is differentiable at x, then =
h) where E
is the usual gradient.
53
~(x) =
l0 V( 3A6-t(x) dt
enjoys the following properties:
(i) r~ is Cr on N(f) B S.
(ii) For all x E N(r) n S, r~ admits a right derivative I~m -~ which
is Lipschitz, convex and positively homogeneous.
clllv - .
+ v) =
= -
I_ ~
(34)
We first fix 1 > 2T and assume that N(r) C Nt. We will see below (in proving
(vi)) how to choose l.
(i) is obvious.
(ii) follows from the fact that II is Cl, and .c -~ admits a right derivative
at the origin of Rm given as
Before passing to the proof of (iii) let us compute For x E Nt let
Gt(x) =
~_t(x) -
54
B(t, z) - -
(Id -
for x ~ Nl B S and
=
J0 (36)
for x ~ Nl ~ S.
(iii) If r ~ 1 + a, ~ and II are C~ with a Holder derivatives. Hence there
exits k > 0 such that
~_t(x) - =
(B(t, ~_t(z)-1I(~_~(x))) _
(B(t, °
Thus, if we set h =
b(x) we get
55
>
Co this implies
(B(t, > co/2
for all v E Now the claim together with (36) imply that _
Dn(x)vi I.
(vi) For x E N~ and 0 t 1 we can write t = kT + r for kEN and
0 r T. Thus, by Lemma 9.3 and equation (34),
v(~t(x)) _
>_ PkV (~r(~)) >_ >
where C 1( T=
For s > 0,
PC1(T) and a =
It follows that
D~(x).F(x)~03B2~(x)
that ~ ~"
n
oo and let an =
L
s=n.f.1
~. .
(i~ I=
56
(it )
(iii)
(iv)
Then ~ =
0) = 0.
This lemma is stated and proved in (Pemantle, 1992) for in = and but the
proof adapts without difficulty to the present situation.
Proof Assume without loss of generality that No = 0, | Xn |~ b103B1n and
~n ~
where
2(6i+62)ai.
Given n ~ N let T be the stopping time defined as
T =
5’,~
Claim:
~
ai
(37)
k
cr =
5’, .
where the last term is nonnegative by condition (it) . Therefore by Doob’s de-
composition Lemma there exist a martingale {Mi}i~n and a previsible process
{~}~n such that =
M,+7,, In = 0 and 7,+i ~ 7,. The fact that 5’,Ay ~ M,
implies
P(r =
oo~) ~ P(V. ~ : M, ~ ~~/62~!~).
Thus
P(~ =
oo)~)l~ ~ M, - -~~/~’~~)lE. (39)
Our next goal is to estimate the right hand term of (39). Set Af~ =
At, 2014
M~.
For ~ n ::
t-i
E((M,+i - =
F((~ - (~ - /,)~ ~ 02~+1
by condition (t~). Therefore for s > 0, ~ ~ n and > 0
~ E(M~)~)+~ -
(S+~ (s+~22
where the last two inequalities follow from Doob’s inequality combined with (40).
With s =
-v~2~n and =
20142014"- we get that
Thus
P(~i ~ n : Mi - Mn ~ -1 2b203B1n|Fn) ~ 1 - 4a2 4a2 + b2.
t>n t>n
58
exists al > 0 and some integer No such that for all n > No
a 103B32n + 1.
Then P(limn~~ Sn =
0) = 0.
n
nAT
Xn+l = +
n
So -
Sn =
So +
~=1
kr~’n+1 ) .
By convexity of the right derivative of ~ (Proposition 9.5, (it)) and the condi-
tional Jensen inequality we have
= 0.
Thus
(42)
If n > T,Xn+i = so
0 (43) .
E(~+i - =
If Sn > ~n, the right hand term is nonnegative by condition (ii), .previously
proved. If Sn fn, (42) and (43) imply =
-
Thus
E(’Sn+1 -
Therefore, to prove condition (iii) of Lemma 9.6, it suffices to show that
>
o (44)
Using Proposition 9.5, (iv) and assumption (iii) of Theorem 9.1 we see that
c1b1A. (46)
Putting (44), (45), (46) together and (43) give
proof given here also shows that conditions of Lemma 9.7 are satisfied.
Now suppose T = oo. Then =
Sn and {xn} remains in N(r). There-
fore (by Theorem 5.7) (the limit set of {xn}) is a nonempty compact
invariant subset so that for all y E L( {xn}) and t E N(r).
By condition (vi) Proposition (9.5)
or this implies that r~(~t (y)) > for all
t > 0 forcing to be zero. Thus L({xn}) C S. This implies Sn = -~ 0.
~n =
0(n-"), a 1
(Proposition 4.2).
step-sizes go to zero at a slower rate we cannot expect to characterize
If the
precisely the limit sets of the process4. However it is always possible to describe
the "ergodic" or statistical behavior of the process in term of the corresponding
behavior for the associated deterministic system. This is the goal of this section
which is mainly based on Benaim and Schreiber (1997). It is worth mentioning
4For instance, with a step-size of the order of it is easy to construct examples for
which the process never converges even though the chain recurrent set of the ODE consists of
isolated equilibria.
61
that Fort and Pages (1997) in a recent paper largely generalize results of this
section and address several interesting questions which are not considered here.
Let (~, ~, P) be a probability space and t ~ 0} a nondecreasing family
of sub-u-algebras. Let (M, d) be a separable metric space equipped with its
Borel 03C3-algebra.
A process
(t, w) -~ X (t, w)
is said to be a weak asymptotic pseudotrajectory of the semiflow 03A6 if
(i) It is progressively measurable: X|[0, T] x Q is x measurable for all
T > 0 where B[o,T] denotes the Borel o~- field over [0, T~ .
(ii)
lim P{ 0hT
sup d(X(t + h)’ h(X(t)) a = 0
c ~t (~).
Proof
Let f: M -~ [0,1] be a uniformly continuous function and T > 0. For n > 1
set
nT
Un (fT) =
Nn(f,T) =
1i=2
i[E(Ui(f,T)|F(i-1)T 1 - E(Ui(f,T)|F(i-2)T)].
62
Doob’s convergence theorem implies that ( f, T ) }" > 1 converges almost surely.
-
( f, T ) ~~(s-1)T)~ _ ~ (50)
i=1
almost surely.
we claim that
lim E(ui+1(f,T) -
> + TE.
Since X is a weak asymptotic pseudotrajectory of 03A6 the first term in the right
of the inequality goes to Zero almost surely as i -~ oo and since f is arbitrary,
this proves the claim.
Now, write
(f~ ~’) -
~
~T) =
~’) -
(f~
63
~ ’~ (f~ ~) ~T)
- ~,(/0~)].
Then use equations (50), (51), and equation (47) with f o ~T in lieu of f. It.
follows that there exists a set ~ ( fT) C ~ of full measure such that for all
w E T)
lim 1 n Ui+1 (f, T)-1 n Ui(f 03A6T, T) = 0. (52)
Statement (a) follows for example from the construction given in Lemma 3.1.4
of Stroock (1993) while (b) follows from Theorem 3.1.5 of Stroock (1993).
Let
~ = n
kEN,TEQ+
Given w ~ 03A9 and E there exists a sequence
M(X, w) tj ~ oo (depending on
M(fk03A6T)(x d)=Mfk(x)d
for all f k E H and T e Q+. This proves that is 03A6T invariant for all T E Q+.
By continuity of 03A6 this implies that is 03A6T invariant for all T > 0 and since 03A6
is a semiflow, is 03A6 invariant. QED
supp( X,~ w) =
U
64
(i) w)) =1 and for any other closed set A C M such that r (w)(:-1) _
1 it follows that w) C A.
(ii) ____________
C =
{x E M : x E
Proof The proof of part (i) is an easy consequence of Theorem 10.1 and ( ii)
follows Theorem 10.1 and Poincare recurrence Theorem (equation (30). QED
This last corollary has the interpretation that the fraction of time spends
by weak asymptotic pseudotrajectory in an arbitrary neighborhood of
a BC(~)
goes to one with probability one.
(R) =
and
=
65
Then
k
P( sup
i~n
>_
i=n
Hence
References
Akin, E. (1993). The General Topology of Dynamical Systems. American Math-
ematical Society, Providence.
Arthur, B., Ermol’ev, Y., and Kaniovskii, Y. (1983). A generalized urn problem
and its applications. Cybernetics, 19:61-71.
Conley, C. C. (1978). Isolated invariant sets and the Morse index. CBMS
Regional conference series in mathematics. American Mathematical Society,
Providence.
Fort, J. C. and Pages, G. (1997). Stochastic algorithm with non constant step:
a.s. weak convergence of empirical measures. Preprint.
Fudenberg, D. and Kreps, K. (1993). Learning mixed equilibria. Games and
Econom. Behav., 5:320-367.
Ljung, L. (1986). System Identification Theory for the User. Prentice Hall,
Englewood Cliffs, NJ.
Ljung, L. and Söderström, T. (1983). Theory and Practice of Recursive Identi-
fication. MIT Press, Cambridge, MA.
Mañé, R. (1987). Ergodic Theory and Differentiable Dynamics. Springer-Verlag,
New York.