CouplingLectures-Den Hollander
CouplingLectures-Den Hollander
1
ABSTRACT
Coupling is a powerful method in probability theory through which random variables can
be compared with each other. Coupling has been applied in a broad variety of contexts, e.g.
to prove limit theorems, to derive inequalities, or to obtain approximations.
The present course is intended for master students and PhD students. A basic knowledge
of probability theory is required, as well as some familiarity with measure theory. The course
first explains what coupling is and what general framework it fits into. After that a number of
applications are described. These applications illustrate the power of coupling and at the same
time serve as a guided tour through some key areas of modern probability theory. Examples
include: random walks, card shuffling, Poisson approximation, Markov chains, correlation
inequalities, percolation, interacting particle systems, and diffusions.
2
PRELUDE 1: A game with random digits.
Draw 100 digits randomly and independently from the set of numbers {1, 2, . . . , 9, 0}. Consider
two players who each do the following:
3. Repeat.
It turns out that the probability that the two players record the same last digit is approxi-
mately 0.974.
Why is this probability so close to 1? What if N digits are drawn randomly instead of 100
digits? Can you find a formula for the probability that the two players record the same last
digit before moving beyond digit N ?
3
Contents
1 Introduction 6
1.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Birth-Death processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Poisson approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Random walks 17
3.1 Random walks in dimension 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Simple random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Beyond simple random walk . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Random walks in dimension d . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Simple random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Beyond simple random walk . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Random walks and the discrete Laplacian . . . . . . . . . . . . . . . . . . . . . 21
4 Card shuffling 23
4.1 Random shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Top-to-random shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Poisson approximation 28
5.1 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Stein-Chen method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2.1 Sums of dependent Bernoulli random variables . . . . . . . . . . . . . . 29
5.2.2 Bound on total variation distance . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Two applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Markov Chains 35
6.1 Case 1: Positive recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Case 2: Null recurrent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.3 Case 3: Transient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Perfect simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 Probabilistic inequalities 40
7.1 Fully ordered state spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2 Partially ordered state spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2.1 Ordering for probability measures . . . . . . . . . . . . . . . . . . . . . 41
7.2.2 Ordering for Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.3 The FKG inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4
7.4 The Holley inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8 Percolation 50
8.1 Ordinary percolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2 Invasion percolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.3 Invasion percolation on regular trees . . . . . . . . . . . . . . . . . . . . . . . . 53
10 Diffusions 68
10.1 Diffusions in dimension 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.1.1 General properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.1.2 Coupling on the half-line . . . . . . . . . . . . . . . . . . . . . . . . . . 69
10.1.3 Coupling on the full-line . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10.2 Diffusions in dimension d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5
1 Introduction
In Sections 1.1–1.3 we describe three examples of coupling illustrating both the method and
its usefulness. Each of these examples will be worked out in more detail later. The symbolN0
is used for the set N ∪ {0} with N = {1, 2, . . .}. The symbol tv is used for the total variation
distance, which is defined at the beginning of Chapter 2. The symbols P and E are used to
denote probability and expectation.
Lindvall [10] explains how coupling was invented in the late 1930’s by Wolfgang Doeblin,
and provides some historical context. Standard references for coupling are Lindvall [11] and
Thorisson [15].
This is the standard Markov Chain Convergence Theorem (MCCT) (see e.g. Häggström [5],
Chapter 5, or Kraaikamp [7], Section 2.2).
A coupling proof of (1.1) goes as follows. Let X ′ = (Xn′ )n∈N0 be an independent copy of
the same Markov chain, but starting from π. Since πP n = π for all n, X ′ is stationary. Run
X and X ′ together, and let
T = inf{k ∈ N0 : Xk = Xk′ }
be their first meeting time. Note that T is a stopping time, i.e., for each n ∈ N0 the event
{T = n} is an element of the sigma-algebra generated by (Xk )0≤k≤n and (Xk′ )0≤k≤n . For
n ∈ N0 , define
Xn , if n < T,
Xn′′ =
Xn′ , if n ≥ T.
By the strong Markov property (which says that, for any stopping time T , (Xk )k>T depends
on (Xk )k≤T only through XT ), we have that X ′′ = (Xn′′ )n∈N0 is a copy of X. Now write, for
6
i ∈ S,
where we use P as the generic symbol for probability (in later Chapters we will be more careful
with the notation). Hence
X
kλP n − πktv = |(λP n )i − πi |
i∈S
X
≤ P(Xn′′ = i, T > n) + P(Xn′ = i, T > n) = 2P(T > n).
i∈S
The left-hand side is the total variation norm of λP n − π. The conditions in the MCCT
guarantee that P(T < ∞) = 1 (as will be explained in Chapter 6). The latter is expressed by
saying that the coupling is successful. Hence the claim in (1.1) follows by letting n → ∞.
Let X = (Xt )t≥0 , be the Markov process with state space N0 , birth rates b = (bi )i∈N0 , death
rates d = (di )i∈N0 (d0 = 0), and initial distribution λ = (λi )i∈N0 . Suppose that b and d are such
that X is recurrent (see Kraaikamp [7], Section 3.6, for conditions on b and d that guarantee
recurrence). Let X ′ = (Xt′ )t≥0 be an independent copy of the same Markovv process, but
starting from a different initial distribution µ = (µi )i∈N0 . Run X and X ′ together, and let
T = inf{t ≥ 0 : Xt = Xt′ }
where Pt is the transition matrix at time t, i.e., (λPt )i = P(Xt = i), i ∈ N0 . Since transitions
can occur between neighboring elements of N0 only, X and X ′ cannot cross without meeting.
Hence we have
T ≤ max{τ0 , τ0′ }
7
with
τ0 = {t ≥ 0 : Xt = 0}, τ0′ = {t ≥ 0 : Xt′ = 0},
the first hitting times of 0 for X and X ′ , respectively. By the assumption of recurrence, we
have P(τ0 < ∞) = P(τ0′ < ∞) = 1. This in turn implies that P(T < ∞) = 1, i.e., the coupling
is successful, and so we get
lim kλPt − µPt ktv = 0.
t→∞
If X is positive recurrent (see Kraaikamp [7], Section 3.6, for conditions on b and d that
guarantee positive recurrence), then X has a unique stationary distribution π, solving the
equation πPt = π for all t ≥ 0. In that case, by picking µ = π we get
lim kλPt − πktv = 0. (1.2)
t→∞
Remark: The fact that transitions can occur between neighboring elements of N0 only allows
us to deduce, straightaway from the recurrence property, that the coupling is successful. In
Section 1.2 this argument was not available, and we had to defer this part of the proof to
Chapter 6. In Chapter 6 we will show that the coupling is successful under the stronger
assumption of positive recurrence.
8
By summing out over i′ , respectively, i we see that
′
1 − pm , if i = 0, pi
P(Ym = i) = P(Ym′ = i′ ) = e−pm m , i′ ∈ N0 ,
pm , if i = 1, i′ !
so that the marginal distributions are indeed correct and we have a proper coupling. Now
estimate
n n
!
X X
P(X 6= X ′ ) = P Ym 6= Ym′
m=1 m=1
≤ P ∃ m ∈ {1, . . . , n} : Ym 6= Ym′
Xn
≤ P(Ym 6= Ym′ )
m=1
n
" ∞ ′
#
X X pi
= e−pm − (1 − pm ) + e−pm m
i′ !
m=1 i′ =2
Xn
= pm (1 − e−pm )
m=1
Xn
≤ p2m .
m=1
Pn
Hence, for λ = m=1 pm , we have proved that
with M = maxm=1,...,n pm . This quantifies the extent to which the approximation is good
when M is small. Both λ and M will in general depend on n. Typical applications will have
λ of order 1 and M tending to zero as n → ∞.
Remark: The coupling produced above will turn out to be the best possible: it is a maximal
coupling (see Chapter 2.5). The crux is that (Ym , Ym′ ) = (0, 0) and (1, 1) are given the largest
possible probabilities. More details will be given in Chapter 5.
9
2 Basic theory of coupling
Chapters 2 and 7 provide the theoretical basis for the theory of coupling and consequently are
technical in nature. It is here that we arm ourselves with a number of basic facts about coupling
that are needed to deal with the applications described in Chapters 3–6 and Chapters 8–10. In
Section 2.1 we give the definition of a coupling of two probability measures, in Section 2.2 we
state and derive the basic coupling inequality, bounding the total variation distance between
two probability measures in terms of their coupling time, in Section 2.3 we look at bounds
on the coupling time of two random sequences, in Section 2.4 we introduce the notion of
distributional coupling, while in Section 2.5 we prove the existence of a maximal coupling for
which the coupling inequality is optimal.
In what follows we use some elementary ingredients from measure theory, for which we
refer the reader to standard textbooks.
Definition 2.1 Given a bounded signed measure M on a measurable space (E, E) such that
M(E) = 0, the total variation norm of M is defined as
kMktv = 2 sup M(A).
A∈E
where the supremum runs over all functions f : E → R that are bounded and measurable w.r.t.
E, and kf k∞ = supx∈E |f (x)| is the supremum norm. By the Jordan-Hahn decomposition
theorem, there exists a set D ∈ E such that M+ (·) = M( · ∩ D) and M− (·) = −M( · ∩ D c ) are
both non-negative measures on (E, E). Clearly, + −
R M = M − M and+ supA∈E M(A) = M(D) =
+
M (E). It therefore follows that kMktv = E (1D − 1Dc ) dM = M (E) + M− (E) (note that
the absolute value sign disappears). If M(E) = 0, then M+ (E) = M− (E), in which case
kMktv = 2M+ (E) = 2 supA∈E M(A).
Definition 2.2 A coupling of two probability measures P and P′ on the same measurable space
(E, E) is any (!) probability measure P̂ on the product measurable space (E × E, E ⊗ E) (where
E ⊗ E is the smallest sigma-algebra containing E × E) whose marginals are P and P′ , i.e.,
P = P̂ ◦ π −1 , P′ = P̂ ◦ π ′−1 ,
where π is the left-projection and π ′ is the right-projection, defined by
π(x, x′ ) = x, π ′ (x, x′ ) = x′ , (x, x′ ) ∈ E × E.
A similar definition holds for random variables. Given a probability space (Ω, F, Q), a
random variable X is a measurable mapping from (Ω, F) to (E, E). The image of Q under
X is P, the probability measure of X on (E, E). When we are interested in X only, we may
forget about (Ω, F, Q) and work with (E, E, P) only.
10
Definition 2.3 A coupling of two random variable X and X ′ taking values in (E, E) is any
(!) pair of random variables (X̂, X̂ ′ ) taking values in (E × E, E ⊗ E) whose marginals have
the same distribution as X and X ′ , i.e.,
D D
X̂ = X, X̂ ′ = X ′ ,
D
with = denoting equality in distribution.
Theorem 2.4 Given two random variables X, X ′ with probability distributions P, P′ , any (!)
coupling P̂ of P, P′ satisfies
kP − P′ ktv ≤ 2P̂(X̂ 6= X̂ ′ ).
Proof. Pick any A ∈ E and write
≤ 2 sup P̂(X̂ ∈ A, X̂ 6= X̂ ′ )
A∈E
= 2P̂(X̂ 6= X̂ ′ ),
11
Exercise 2.5 Let U, V be random variables on N0 with probability mass functions
1 1
fU (x) = 2 1{0,1} (x), fV (x) = 3 1{0,1,2} (x), x ∈ N0 ,
where 1S is the indicator function of the set S. (a) Compute the total variation distance.
(b) Give two different couplings of U and V . (c) Give a coupling of U and V under which
{U ≥ V } with probability 1.
Exercise 2.6 Let U, V be random variables on [0, ∞) with probability density functions
which is the coupling time of X̂ and X̂ ′ , i.e., the first time from which the two sequences agree
onwards (possibly T = ∞).
Theorem 2.7 For two sequences of random variables X = (Xn )n∈N0 and X ′ = (Xn′ )n∈N0
taking values in (E N0 , E ⊗N0 ), let (X̂, X̂ ′ ) be a coupling of X and X ′ , and let T be the coupling
time. Then
kP(Xn ∈ ·) − P′ (Xn ∈ ·)ktv ≤ 2P̂(T > n).
Proof. This follows from Theorem 2.4 because {X̂n 6= X̂n′ } ⊆ {T > n}.
Remark: In Section 1.3 we already saw an example of sequence coupling, namely, X and
X ′ were two copies of a Birth-Death process starting from different initial distributions. The
Markov property implies that T is equal in distribution to the first time X̂ and X̂ ′ meet each
other.
A stronger form of sequence coupling can be obtained by introducing the left-shift θ on
E N0 ,defined by
θ(x0 , x1 , . . .) = (x1 , x1 , . . .),
i.e., θ drops the first element of the sequence.
Proof. Because
′
{X̂m 6= X̂m for some m ≥ n} ⊆ {T > n},
the claim again follows from Theorem 2.4.
Remark: Similar inequalities hold for continuous-time random processes X = (Xt )t≥0 and
X ′ = (Xt′ )t≥0 .
12
2.2.3 Mappings
Since total variation distance never increases under a mapping, we have the following corollary.
where the inequality comes from the fact that E may be larger than ψ −1 (E ∗ ). Use Theorem 2.4
to get the bound.
Ê(φ(T )) < ∞.
Proof. Estimate
φ(n) P̂(T > n) ≤ Ê φ(T )1{T >n} .
Note that the right-hand side tends to zero as n → ∞ by dominated convergence, because
Ê(φ(T )) < ∞. Use Theorem 2.8 to get the claim.
Typical examples are:
For instance, for finite-state irreducible aperiodic Markov chains, there exists an M < ∞ such
that P̂(T > 2M | T > M ) ≤ 21 (see Häggström [5], Chapter 5), which implies that there exists
a β > 0 such that Ê(eβT ) < ∞. In Section 3 we will see that for random walks we typically
have Ê(T α ) < ∞ for all 0 < α < 21 .
13
2.4 Distributional coupling
Suppose that a coupling (X̂, X̂ ′ ) of two random sequences X = (Xn )n∈N0 and X ′ = (Xn′ )n∈N0
comes with two random times T and T ′ such that not only
D D
X̂ = X, X̂ ′ = X ′ ,
but also D
′
θ T X̂, T = θ T X̂ ′ , T ′ .
Here we compare the two sequences shifted over different random times, rather than the same
random time.
It follows that
and hence
D
Remark: A restrictive feature of distributional coupling is that T = T ′ , i.e., the two ran-
dom times must have the same distribution. Therefore distributional coupling is more of a
theoretical than a practical tool. We will see in Chapter 4 that it plays a role in card shuffling.
Remark: In Section 6.2 we will encounter yet another form of coupling, called shift-coupling.
This requires the existence of random times T, T ′ such that
′
θT X = θT X ′
D ′ D
(which is stronger than θ T X = θ T X ′ ), but does not require that T = T ′ . This form of coupling
is useful for dealing with time averages. Thorisson [15] contains a critical analysis of how
different forms of coupling compare with each other.
14
2.5 Maximal coupling
Does there exist a “best possible” coupling, one that gives the sharpest estimate on the total
variation distance, in the sense that the inequality in Theorem 2.4 becomes an equality? The
answer is yes!
Theorem 2.12 For any two probability measures P and P′ on a measurable space (E, E) there
exists a coupling P̂ such that
(i) kP − P′ ktv = 2P̂(X̂ 6= X̂ ′ ).
(ii) X̂ and X̂ ′ are independent conditional on {X̂ 6= X̂ ′ }, provided the latter event has
positive probability.
Proof. We give an abstract construction of a maximal coupling. Let ∆ = {(x, x) : x ∈ E} be
the diagonal of E × E. Let ψ : E → E × E be the map defined by ψ(x) = (x, x).
Here, the first equality uses the Jordan-Hahn decomposition of signed measures into a differ-
ence of non-negative measures, the second equality uses the identity |g − g ′ | = g + g′ − 2(g ∧ g′ ),
the third equality uses the definition of Q, the fourth equality uses that Q(E) = Q̂(∆) = γ,
the fifth equality uses that Q̂(∆c ) = 0 and (ν × ν ′ )(∆c ) = ν(E)ν ′ (E) = (1 − γ)2 , while the
sixth equality uses the definition of ∆.
Exercise 2.14 Prove the first equality. Hint: use a splitting as in the remark below Defini-
tion 2.1 with M = P − P′ and D = {x ∈ E : g(x) ≥ g ′ (x)}.
15
To get (ii), note that
′ c ν ν′
P̂( · | X̂ 6= X̂ ) = P̂( · | ∆ ) = × (·).
1−γ 1−γ
Remark: What Theorem 2.12 says is that we can in principle find a coupling that gives the
correct value for the total variation. Such a coupling is called a maximal coupling. However, in
practice it is often difficult to write out this coupling explicitly (the above is only an abstract
construction), and we have to content ourselves with good estimates or approximations. We
will encounter examples in Chapter 9.
Exercise 2.17 Is the coupling of the two coins in PRELUDE 2 a maximal coupling?
16
3 Random walks
Random walks on Zd , d ≥ 1, are special cases of Markov chains: the transition probability to
go from site x to site y only depends on the difference vector y − x. Because of this translation
invariance, random walks can be analyzed in great detail. A standard reference is Spitzer [14].
One key fact we will use below is that any irreducible random walk whose step distribution
has zero mean and finite variance is recurrent in d = 1, 2 and transient in d ≥ 3. In d = 1 any
random walk whose step distribution has zero mean and finite first moment is recurrent.
In Section 3.1 we look at random walks in dimension 1, in Section 3.2 at random walks
in dimension d. In Section 3.3 we use random walks in dimension d to show that bounded
harmonic functions on Zd are constant. This result has an interesting interpretation in physics:
a system in thermal equilibrium has a constant temperature.
The following theorem says that, modulo period 2, the distribution of Sn becomes flat for
large n.
Theorem 3.1 Let S be a simple random walk. Then, for every k ∈ Z even,
Proof. Let S ′ denote an independent copy of S starting at S0′ = k. Write P̂ for the joint
probability distribution of (S, S ′ ), and let
T = min{n ∈ N0 : Sn = Sn′ }.
Then
kP(Sn ∈ · ) − P(Sn + k ∈ · )ktv = kP(Sn ∈ · ) − P(Sn′ ∈ · )ktv ≤ 2P̂(T > n).
Now, S̃ = (S̃n )n∈N0 defined by S̃n = Sn′ − Sn is a random walk on Z starting at S̃0 = k with
i.i.d. increments Ỹ = (Ỹi )i∈N given by
This is a simple random walk on 2Z with a “random time delay”, namely, it steps only half
of the time. Since
T = τ̃0 = {n ∈ N0 : S̃n = 0}
and k is even, it follows from the recurrence of S̃ that P̂(T < ∞) = 1. Let n → ∞ to get the
claim.
In analytical terms, Theorem 3.1 says the following. Let p(·, ·) denote the transition kernel
of the simple random walk, let pn (·, ·), n ∈ N, denote the n-fold composition of p(·, ·), and
17
let δk (·), k ∈ Z, denote the vector whose components are 1 at k and 0 elsewhere. Then
Theorem 3.1 says that for k even
(This short-hand notation comes from the fact that δk pn (·) = P(Sn ∈ · | S0 = k).) It
is possible to prove the latter statement by hand, i.e., by computing δk pn (·) and δ0 pn (·),
evaluating their total variation distance and letting n → ∞. However, this computation turns
out to be somewhat involved.
The result in Theorem 3.1 cannot be extended to k odd. In fact, because the simple
random walk has period 2, the laws of Sn and Sn + k have disjoint support when k is odd,
irrespective of n, and so
Proof. We try to use the same coupling as in the proof of Theorem 3.1. Namely, we put
S̃n = Sn′ − Sn , n ∈ N0 , we note that S̃ = (S̃n )n∈N0 is a random walk starting at S̃0 = k whose
i.i.d. increments Ỹ = (Ỹi )i∈N are given by
X
P̃(Y˜1 = z̃) = P(Y1 = z)P(Y1 = z ′ ), z̃ ∈ Z,
z,z ′ ∈Z
z ′ −z=z̃
so that S̃ is an aperiodic random walk, and finally we argue that S̃ is recurrent, i.e.,
P̃(τ̃0 < ∞) = 1,
to complete the proof. However, there is a problem: recurrence may fail! Indeed, even though
S̃ is a symmetric random walk (because P̃(Ỹ1 = z̃) = P̃(Ỹ1 = −z̃), z̃ ∈ Z), the distribution of
18
Ỹ1 may have a thick tail resulting in Ẽ(|Ỹ1 |) = ∞, in which case S̃ is not necessarily recurrent
(see Spitzer [14], Section 3).
The lack of recurrence may be circumvented by slightly adapting the coupling. Namely,
instead of letting the two copies of the random walk S and S ′ step independently, we let
them make independent small steps, but dependent large steps. Formally, we let Y ′′ be an
independent copy of Y , and we define Y ′ by putting
′′
′ Yi if |Yi − Yi′′ | ≤ N,
Yi = (3.3)
Yi if |Yi − Yi′′ | > N,
i.e., S ′ copies the jumps of S ′′ when they differ from the jumps of S by at most N , otherwise
S ′ copies the jumps of S. The value of N ∈ N is arbitrary and will later be taken large enough.
First, we check that S ′ is a copy of S. This is so because, for every z ∈ Z,
P′ (Y1′ = z) = P̂(Y1′ = z, |Y1 − Y1′′ | ≤ N ) + P̂(Y1′ = z, |Y1 − Y1′′ | > N )
= P̂(Y1′′ = z, |Y1 − Y1′′ | ≤ N ) + P̂(Y1 = z, |Y1 − Y1′′ | > N ),
and the first term in the right-hand side equals P̂(Y1 = z, |Y1 − Y1′′ | ≤ N ) by symmetry (use
that Y and Y ′′ are independent), so that we get P′ (Y1′ = z) = P(Y1 = z).
Next, we note from (3.3) that the difference random walk S̃ = S − S ′ has increments
′′
′ Yi − Yi if |Yi − Yi′′ | ≤ N,
Ỹi = Yi − Yi =
0 if |Yi − Yi′′ | > N,
i.e., no jumps larger than N can occur. Moreover, by picking N large enough we also have
that
P̃(Ỹ1 6= 0) > 0 and (3.2) holds.
Remark: The coupling in (3.3) is called the Ornstein coupling. The idea is that S ′ manages
to stay close to S by copying its large jumps.
Remark: Theorem 3.1 may be sharpened by noting that
1
P̂(T > n) = O √ .
n
Indeed, this follows from a classical result for random walks in d = 1 with zero mean and
finite variance, namely P(τz > n) = O( √1n ) for all z 6= 0 with τz the first hitting time of z (see
Spitzer [14], Section 3). Consequently,
1
kP(Sn ∈ · ) − P(Sn + k ∈ · )ktv = O √ ∀ k ∈ Z even.
n
A direct proof of this estimate without coupling turns out to be rather hard, especially for
an arbitrary random walk in d = 1 with zero mean and finite variance. Even a well-trained
analyst typically does not manage to cook up a proof in a day! Exercise 3.2 shows how to
proceed for simple random walk.
Exercise 3.5 Show that, without (3.1), Theorem 3.3 holds if and only if k is a multiple of
the period.
19
3.2 Random walks in dimension d
Question: What about random walks on Zd , d ≥ 2? We know that an arbitrary irreducible
random walk in d ≥ 3 is transient, and so the Ornstein coupling does not work to bring the
two coupled random walks together with probability 1.
Answer: It still works, provided we do the Ornstein coupling componentwise.
Start at S̃0 = z̃ ∈ Zd with all components z̃ 1 , . . . , z̃ d even, and use that S̃ is recurrent in
direction 1, to get that
τ1 = inf{n ∈ N0 : S̃n1 = 0}
satisfies P̃(τ1 < ∞). At time τ1 change the coupling to direction 2, i.e., do the same but now
identify the steps in all directions different from 2 and allow for independent steps only in
direction 2. Put
τ2 = inf{n ≥ τ1 : S̃n2 = 0}
and note that P̃(τ2 − τ1 < ∞) = 1. Continue until all d directions are exhausted. At time
for which P̃(τd − τd−1 < ∞) = 1, the two walks meet. Since P̃(τd < ∞) = 1, the coupling is
successful and the proof is complete.
To get the same result when z̃1 + · · · + z̃d is even (rather than all z̃ 1 , . . . , z̃ d being even),
we argue as follows. There is an even number of directions i for which z̃ i is odd. Pair
these directions in an arbitrary manner, say, (i1 , j1 ), . . . , (il , jl ) for some 1 ≤ l ≤ d. Do
a componentwise coupling in the directions (i1 , j1 ), i.e., the jumps of S in direction i1 are
independent of the jumps of S ′ in direction j1 , while the jumps in all directions other than i1
and j1 are copied. Wait until S ′ − S is even in directions i1 and j1 , switch to the pair (i2 , j2 ),
etc., until all components of S ′ − S are even. After that do the componentwise coupling as
before.
20
Theorem 3.7 Subject to (3.4),
Proof. Combine the componentwise coupling with the “cut out large steps” in the Ornstein
coupling (3.3).
Exercise 3.8 Write out the details of the proof. Warning: The argument is easy when the
random walk can move in only one direction at a time (like simple random walk). For other
random walks a projection argument is needed.
Exercise 3.9 Show that, without (3.4), Theorem 3.7 holds if and only if z is an element of
the minimal sublattice containing z ′ − z : z, z ′ ∈ Zd , P(Y1 = z)P(Y1 = z ′ ) > 0 .
A function f is called harmonic when ∆f ≡ 0, i.e., f is at every site equal to the average of
its values at neighboring sites.
with M = supz∈Zd |f (z)| < ∞. Let n → ∞ and use Theorem 3.7 to get f (x) = f (y). Extend
this equality to x, y ∈ Zd with kx − yk even by first doing the coupling in paired directions, as
in Section 3.2. Hence we conclude that f is constant on the even and on the odd sublattice of
Zd , say, f ≡ ceven and f ≡ codd . But codd = E(f (S1 )) = f (0) = ceven , and so f is constant.
21
Remark: Theorem 3.10 has an interesting interpretation. Simple random walk can be used to
describe the flow of heat in a physical system. Space is discretized to Zd and time is discretized
to N0 . Each site has a temperature that evolves with time according to the Laplace operator.
Indeed, if x 7→ f (x) is the temperature profile at time n, then
1 X
x 7→ f (y) = f (x) + (∆f )(x)
2d
y∈Zd
ky−xk=1
is the temperature profile at time n + 1: heat flows to neighboring sites proportionally to tem-
perature differences. A temperature profile that is in equilibrium must therefore be harmonic,
i.e., ∆f ≡ 0. Theorem 3.10 shows that on Zd the only temperature profile in equilibrium that
is bounded is the one where the temperature is constant.
22
4 Card shuffling
Card shuffling is a topic that combines coupling, algebra and combinatorics. Diaconis [3] gives
key ideas. Levin, Peres and Wilmer [8] provides a broad panorama on mixing poperties of
Markov chains, with Chapter 8 devoted to card shuffling. Two examples of random shuffles
are described in the MSc-thesis by H. Nooitgedagt [12].
In Section 4.1 we present a general theory of random shuffles. In Section 4.2 we look at
a specific random shuffle, called the “top-to-random shuffle”, for which we carry out explicit
computations.
Definition 4.1 A shuffle of the deck is a permutation drawn from PN and applied to the
deck. A random shuffle is a shuffle drawn according to some probability distribution on PN .
Applying independent random shuffles to the deck, we get a Markov chain X = (Xn )n∈N0 on
PN . If each shuffle uses the same probability distribution on PN , then X is time-homogeneous.
In typical cases, X is irreducible and aperiodic, with a unique invariant distribution π that
is uniform on PN . (The latter corresponds to a random shuffle that leads to a “random
deck” after it is applied many times.) Since PN is finite, we know that the distribution of Xn
converges to π exponentially fast as n → ∞, i.e.,
Definition 4.2 (tN )N ∈N is called a sequence of threshold times if limN →∞ tN = ∞ and, for
all ǫ > 0 small enough,
It turns out that for card shuffling threshold times typically grow with N in a polynomial
fashion.
To capture the phenomenon of threshold time, we need the notion of strong uniform time.
23
3. XT and T are independent.
Remark: Think of T = TN as the random time at which the random shuffling of the deck
is stopped such that the arrangement of the deck is “completely random” (this is a form of
distributional coupling defined in Section 2.4). In typical cases the threshold times (tN )N ∈N
are such that
lim E(TN )/tN = 1, lim P(1 − δ < TN /tN < 1 + δ) = 1 ∀ δ > 0. (4.1)
N →∞ N →∞
In Section 4.2 we will construct TN for a special example of a random shuffle.
Remark: Note that T really is the coupling time to a parallel deck that starts in π, even
though this deck is not made explicit.
24
Theorem 4.5 For the top-to-random shuffle the sequence (tN )N ∈N with tN = N log N is a
sequence of threshold times.
Proof. Let T = τ∗ + 1, with
τ∗ = the first time that the original bottom card comes on top.
Exercise 4.6 Show that T is a strong uniform time. Hint: The +1 represents the insertion
of the original bottom card at a random position in the deck after it has come on top.
For the proof it is convenient to view T differently, namely,
D
T =V (4.2)
with V the number of random draws with replacement from an urn with N balls until each
ball has been drawn at least once. To see why this holds, for i = 0, 1, . . . , N put
Ti = the first time there are i cards below the original bottom card,
Vi = the number of draws necessary to draw i distinct balls.
Then
D D i+1
Ti+1 − Ti = VN −i − VN −(i+1) = GEO , i = 0, 1 . . . , N − 1, are independent, (4.3)
N
where GEO(p) = {p(1 − p)k−1 : k ∈ N} denotes the geometric distribution with parameter
p ∈ [0, 1].
P T > (1 + ǫ)N log N = P V > (1 + ǫ)N log N
N
X
= P ∪N
i=1 Ai ≤ P(Ai )
i=1
(1+ǫ)N log N
1
= N 1−
N
log N
= N e−(1+ǫ) log N +O( N
)
∼ N −ǫ , N → ∞,
25
which yields the second line of Definition 4.2 via Theorem 4.4.
To get the first line of Definition 4.2, pick δ > 0, pick j = j(δ) so large that 1/j! < 12 δ, and
define
BN = σ ∈ PN : σN −j+1 < σN −j+2 < . . . < σN
= set of permutations whose last j terms are ordered upwards, N ≥ j.
Then π(BN ) = 1/j!, and {Xn ∈ BN } is the event that the order of the original j bottom
cards is retained at time n. Since the first time the card with label N − j + 1 comes to the
top is distributed like VN −j+1 , we have
P X(1−ǫ)N log N ∈ BN ≥ P VN −j+1 > (1 − ǫ)N log N . (4.4)
Indeed, for the upward ordering to be destroyed, the card with label N − j + 1 must come to
the top and must subsequently be inserted below the card with label N − j + 1. We will show
that, for N ≥ N (δ),
P VN −j+1 ≤ (1 − ǫ)N log N < 21 δ. (4.5)
From this it follows that
P(X(1−ǫ)N log N ) ∈ · ) − π(·) tv
≥ 2 P(X(1−ǫ)N log N ) ∈ BN ) − π(BN )
≥ 2 1 − P VN −j+1 ≤ (1 − ǫ)N log N − π(BN )
≥ 2 [1 − 12 δ − 12 δ] = 2(1 − δ).
The first inequality uses the definition of total variation, the third inequality uses (4.4) and
(4.5). By letting N → ∞ followed by δ ↓ 0, we get the first line of Definition 4.2.
To prove (4.5), we compute
N
X −1
E(VN −j+1 ) = E(VN −i − VN −i−1 )
i=j−1
N
X −1
N N
= ∼ N log ∼ N log N
i+1 j
i=j−1
N
X −1
Var(VN −j+1 ) = Var(VN −i − VN −i−1 )
i=j−1
N
X −1 2 X
N i+1
= 1− ∼ cj N 2 , cj = k−2 .
i+1 N
i=j−1 k≥j
Here we use that E(GEO(p)) = 1/p and Var(GEO(p)) = (1 − p)/p2 . Chebyshev’s inequality
therefore gives
P VN −j+1 ≤ (1 − ǫ)N log N = P VN −j+1 − E(VN −j+1 ) ≤ −ǫN log N [1 + o(1)]
≤ P [VN −j+1 − E(VN −j+1 )]2 ≥ ǫ2 N 2 log2 N [1 + o(1)]
Var(VN −j+1 )
≤ [1 + o(1)]
ǫ2 E(VN −j+1 )2
cj N 2 1
∼ =O .
ǫ2 N 2 log2 N log2 N
26
This proves (4.5).
P
Remark: We have shown that E(TN ) = 1 + N i=1 (N/i) ∼ N log N and Var(TN /E(TN )) → 0
as N → ∞. This in turn implies that tN /TN → 1 in probability as N → ∞ and identifies the
scaling of the threshold time as tN ∼ E(TN ), in accordance with the prediction made in (4.1).
27
5 Poisson approximation
In Section 1.3 we already briefly described coupling in the
context of Poisson approximation.
We now return to this topic. Let BINOM(n, p) = { nk pk (1 − p)n−k : k = 0, . . . , n} be the
binomial distribution with parameters n ∈ N and p ∈ [0, 1]. A classical result from probability
theory is that, for every c ∈ (0, ∞), BINOM(n, c/n) is close to POISSON(c) when n is large.
In this section we will quantify how close, by developing a general theory for approximations
to the Poisson distribution called the Stein-Chen method. After suitable modification, the
same method also works for approximation to other types of distributions, e.g. the Gaussian
distribution, but this will not be pursued.
In Section 5.1 we derive a crude bound for sums of independent {0, 1}-valued random
variables. In Section 5.2 we describe the Stein-Chen method, which not only leads to a better
bound, but also applies to dependent random variables. In Section 5.3 we look at two specific
applications.
5.1 Coupling
Fix n ∈ N and p1 , . . . , pn ∈ [0, 1). Let
D
Yi = BER(pi ), i = 1, . . . , n,
be independent,
P
i.e., P(Yi = 1) = pi and P(Yi = 0) = 1 − pi , and put X = ni=1 Yi .
where the first line uses that e−λi = 1 − pi and the second line uses that the independent sum
of Poisson random variables with given parameters is again Poisson, with parameter equal to
the sum of the constituent parameters. It follows that
n
X n
X
P(X 6= X ′ ) ≤ P(Yi 6= Yi′ ) = P(Yi′ ≥ 2),
i=1 i=1
∞
X k ∞
X
−λi λi λli
P(Yi′ ≥ 2) = e ≤ 12 λ2i e−λi = 21 λ2i ,
k! l!
k=2 l=0
where the second inequality uses that k! ≥ 2(k − 2)! for k ≥ 2. Since
28
Remark: TheP interest in Theorem 5.1 is when n is large, p1 , . . . , pn are small and λ is of order
n 2
Pn i=12 λi ≤ M λ with M =2 max{λ
1. (Note that 1 , . . . , λn }.) A typical example is pi ≡ c/n, in
which case i=1 λi = n[− log(1 − c/n)] ∼ c2 /n as n → ∞.
Remark: In Section 1.3 we derived a bound similar to Theorem 5.1 but with λi = pi . For
pi ↓ 0 we have λi ∼ pi , and so the difference between the two bounds is minor.
29
and hence
X
E λf (Z + 1) = λpλ (k)f (k + 1)
k∈N0
X
= (k + 1)pλ (k + 1)f (k + 1)
k∈N0
X
= pλ (l)lf (l)
l∈N
= E(Zf (Z)).
Lemma 5.3 For λ ∈ (0, ∞) and A ⊂ N0 , let gλ,A : N0 → R be the solution of the recursive
equation
λgλ,A (k + 1) − kgλ,A (k) = 1A (k) − pλ (A), k ∈ N0 ,
(5.5)
gλ,A (0) = 0.
Then, uniformly in A,
Proof. For k ∈ N0 , let Uk = {0, 1, . . . , k}. Then the solution of the recursive equation is given
by gλ,A (0) = 0 and
1
gλ,A (k + 1) = pλ (A ∩ Uk ) − pλ (A)pλ (Uk ) , k ∈ N0 , (5.6)
λpλ (k)
with Ac = N0 \ A.
30
Hence gλ,{j} (k + 1) − gλ,{j} (k) ≤ 0 for k 6= j, while for k = j
X∞ j−1
X
1 pλ (j) pλ (j)
gλ,{j} (j + 1) − gλ,{j} (j) = pλ (l) + pλ (l)
λ pλ (j) pλ (j − 1)
l=j+1 l=0
∞ j−1
1X λX
= pλ (l) + pλ (l)
λ j
l=j+1 l=0
X∞ X j
1 l
= pλ (l) + pλ (l)
λ j
l=j+1 l=1
∞
X
1 1
≤ pλ (l) = (1 − e−λ ) ≤ 1 ∧ λ−1 ,
λ λ
l=1
where the second and third equality use (5.4). It follows from (5.7) that
where we use that the jumps from negative to positive in (5.9) occur at disjoint positions as
j runs through A. Combine the latter inequality with (5.8) to get
31
Proof. Pick any A ⊂ N0 and write
P(W ∈ A) − pλ (A) = E 1A (W ) − pλ (A)
= E λgλ,A (W + 1) − W gλ,A (W )
Xn
= pj E gλ,A (W + 1) − E Yj gλ,A (W )
j=1
n
X
= pj E gλ,A (W + 1) − E gλ,A (W ) | Yj = 1
j=1
Xn
= pj E gλ,A (Uj + 1) − gλ,A (Vj + 1) ,
j=1
where the second equality uses (5.5), the third equality uses (5.1), while the fifth equality uses
(5.2). Applying (5.5) once more, we get
n
X
|P(W ∈ A) − pλ (A)| ≤ (1 ∧ λ−1 ) pj E(|Uj − Vj |),
j=1
Definition 5.6 The above random variables Y1 , . . . , Yn are said to be negatively related if
there exist arrays of random variables
Yj1 , . . . , Yjn
′ ,...,Y ′ j = 1, . . . , n,
Yj1 jn
Yji′ ≤ Yji ∀ i 6= j,
while, for each j with P(Yj = 1) = 0, Yji′ = 0 for j 6= i and Yjj′ = 1.
What negative relation means is that the condition Yj = 1 has a tendency to force Yi = 0
for i 6= j. Thus, negative relation is like negative correlation (although the notion is in fact
stronger).
An important consequence of negative relation is that there exists a coupling such that
Uj ≥ Vj for all j. Indeed, we may pick
n
X n
X
Uj = Yji , Vj = −1 + Yji′ ,
i=1 i=1
32
Theorem 5.7 If Y1 , . . . , Yn are negatively related, then
Proof. The ordering Uj ≥ Vj allows us to compute the sum that appears in the bound in
Theorem 5.5:
n
X n
X
pj E(|Uj − Vj |) = pj E(Uj − Vj )
j=1 j=1
Xn n
X n
X
= pj E(W ) − pj E(W | Yj = 1) + pj
j=1 j=1 j=1
n
X
2
= E(W ) − E(Yj W ) + λ
j=1
= E(W )2 − E(W 2 ) + λ
= −Var(W ) + λ,
Remark: The upper bound in Theorem 5.7 only contains the unknown quantity Var(W ). It
turns out that in many examples this quantity can be either computed or estimated.
n
X
2(1 ∧ λ−1 ) p2i ,
i=1
Exercise 5.8 Check that the right-hand side is a probability distribution. Show that
E(W ) = n m
N = λ,
Var(W ) = n m
N (1 −
m N −n
N ) N −1 .
33
It is intuitively clear that Y1 , . . . , Yn are negatively related: if we condition on urn j to
contain a ball, then urn i with i 6= j is less likely to contain a ball. More formally, recall
′ , . . . , Y ′ as follows:
Definition 5.6 and, for j = 1, . . . , n, define Yj1 , . . . , Yjn and Yj1 jn
• Place a ball in urn j.
• Place the remaining m − 1 balls randomly in the other N − 1 urns.
• Put Yji′ = 1{urn i contains a ball} .
m
• Toss a coin that produces head with probability N .
• ′ , . . . , Y ′ ).
If head comes up, then put (Yj1 , . . . , Yjn ) = (Yj1 jn
• If tail comes up, then pick the ball in urn j, place it randomly in one of the N − m − 1
urns that are empty, and put Yji = 1{urn i contains a ball} .
Exercise 5.9 Check that the above construction produces arrays with the properties required
by Definition 5.6.
We expect that if m/N, n/N ≪ 1, then W is approximately Poisson distributed. The
formal computation goes as follows. Using Theorem 5.7 and Exercise 5.9, we get
34
6 Markov Chains
In Section 1.1 we already briefly described coupling for Markov chains. We now return to this
topic. We recall that X = (Xn )n∈N0 is a Markov chain on a countable state space S, with an
initial distribution λ = (λi )i∈S and with a transition matrix P = (Pij )i,j∈S that is irreducible
and aperiodic.
There are three cases, which will be treated in Sections 6.1–6.3:
1. positive recurrent,
2. null recurrent,
3. transient.
In case 1 there exists a unique stationary distribution π, solving the equation π = πP and
satisfying π > 0, such that limn→∞ λP n = π componentwise on S. The latter is the standard
Markov Chain Convergence Theorem, and we want to investigate the rate of convergence. In
cases 2 and 3 there is no stationary distribution, and limn→∞ λP n = 0 componentwise. We
want to investigate the rate of convergence as well, and see what the role is of the initial
distribution λ.
In Section 6.4 we take a brief look at “perfect simulation”, where coupling of Markov chains
is used to simulate random variables with no error.
where P̂λ,µ denotes any probability measure that couples X and X ′ . We will choose the
independent coupling P̂λ,µ = Pλ ⊗ Pµ , and instead of T focus on
their first meeting time at ∗ (where ∗ is any chosen state in S). Since T ∗ ≥ T , we have
P̂λ,µ (T ∗ < ∞) = 1 ∀ λ, µ.
35
Proof. The successive visits to ∗ by X and X ′ , given by the {0, 1}-valued random sequences
constitute a renewal process: each time ∗ is hit the process of returns to ∗ starts from scratch.
Define
Ŷk = Yk Yk′ , k ∈ N0 .
Then also Ŷ = (Ŷk )k∈N0 is a renewal process. Let
P̂λ,µ (I) = 1,
where ⌊·⌋ denotes the lower integer part. Via the standard coupling inequality this shows that
36
6.2 Case 2: Null recurrent
Null recurrent Markov chains do not have a stationary distribution. Consequently,
It suffices to show that there exists a coupling P̂λ,µ such that P̂λ,µ (T ∗ < ∞) = 1. The proof
of Theorem 6.1 for positive recurrent Markov chains does not carry over because there is no
stationary distribution. However, it is enough to show that there exists a coupling P̂λ,µ such
that P̂λ,µ (T < ∞) = 1, which seems easier because the two copies of the Markov chain only
need to meet somewhere, not necessarily at ∗.
P̂λ,µ (T < ∞) = 1 ∀ λ, µ.
Proof. A proof of this theorem and hence of (6.3) is beyond the scope of the present course.
We refer to Lindvall [11], Section III.21, for more details. As a weak substitute we prove the
“Cesaro average” version of (6.3):
N −1 N −1
1 X 1 X
X recurrent =⇒ lim λP n − µP n =0 ∀ λ, µ.
N →∞ N N
n=0 n=0 tv
The proof uses the notion of shift-coupling, i.e., coupling with a random time shift. Let X
and X ′ be two independent copies of the Markov chain starting from λ and µ. Write 0 instead
of ∗, and let τ0 and τ0′ denote the first hitting times of 0. Couple X and X ′ by letting their
paths coincide after τ0 , respectively, τ0′ :
′
Xk+τ0 = Xk+τ ′ ∀ k ∈ N0 .
0
This definition makes sense because P(τ0 < ∞) = P(τ0′ < ∞) = 1 by recurrence.
37
Fix any event A. Write
N −1 N −1
1 X 1 X
(λP n )(A) − (µP n )(A)
N N
n=0 n=0
N
X −1 N
X −1
1
= P̂λ,µ (Xn ∈ A) − P̂λ,µ (Xn′ ∈ A)
N n=0 n=0
1 X
= P̂λ,µ (τ0 , τ0′ ) = (m, m′ )
N
m,m′ ∈N0
N −1
X NX
−1
× P̂λ,µ Xn ∈ A | (τ0 , τ0′ ) = (m, m ) − ′
P̂λµ Xn′ ∈ A | (τ0 , τ0′ ) = (m, m′ )
n=0 n=0
1 X
≤ P̂λ,µ (τ0 ∨ τ0′ ≥ M ) + P̂λ,µ (τ0 , τ0′ ) = (m, m′ )
N m,m′ ∈N0
m∨m′ <M
( (N −m−1)∧(N −m′ −1)
X
′
× 2(m ∨ m ) +
k=0
)
h i
′
P̂λ,µ Xm+k ∈ A | (τ0 , τ0′ ) = (m, m′ ) − P̂λ,µ Xm ′ ′
′ +k ∈ A | (τ0 , τ0 ) = (m, m )
2
≤ P̂λ,µ (τ0 ∨ τ0′ ≥ M ) + E (τ0 ∨ τ0′ )1{τ0 ∨τ0′ <M } .
N
In the first inequality we take M ≤ N and note that m + m′ + |m − m′ | = 2(m ∨ m′ ) is the
number of summands that are lost by letting the sums start at n = m, respectively, n = m′ ,
shifting them by m, respectively, m′ , and afterwards cutting them at (N −m−1)∧(N −m′ −1).
In the second inequality the sum over k is zero by the shift-coupling.
Since the bound is uniform in A, we get the claim by taking the supremum over A and
letting N → ∞ followed by M → ∞.
the rate of the componentwise coupling. Here is an example of a Markov chain for which (6.3)
fails:
38
At site x the random walk has:
zero drift with pausing for x = 0,
positive drift for x > 0,
negative drift for x < 0.
This Markov chain is irreducible and aperiodic, with limx→∞ Px (τ0 = ∞) = limx→−∞ Px (τ0 =
∞) = 1. As a result, we have
2. A rate of convergence estimate that provides an upper bound on the total variation
distance n 7→ kδi∗ P n − πktv for a given i∗ ∈ S, so that any desired accuracy of the
approximation can be achieved by running the Markov chain long enough.
Both these ingredients give rise to a theory of simulation, for which an extensive literature
exists (see e.g. Levin, Peres and Williams [8]).
The drawback is that the simulation is at best approximate: no matter how long we run
the Markov chain, its distribution is never perfectly equal to ρ (at least in typical situations).
Häggström [5], Chapters 10–12, contain an outline of a different approach, through which it
is possible to achieve a perfect simulation, i.e., to obtain a random sample whose distribution
is equal to ρ with no error (!) In this approach, independent copies of the Markov chain are
started from each site of S “far back in the past”, and the simulation is stopped at time zero
when all the copies “have collided prior to time zero”. The observation of the Markov chain
at time zero provides the perfect sample.
The details of the construction are somewhat delicate and we refer the reader to the relevant
literature. Concrete examples are discussed in [5].
39
7 Probabilistic inequalities
In Chapters 1 and 3–6 we have seen coupling at work in a number of different situations. We
now return to the basic theory that was started in Chapter 2. Like the latter, the present
chapter is somewhat technical.
We will show that the existence of an ordered coupling between random variables or ran-
dom processes is equivalent to the respective probability measures being ordered themselves.
In Sections 7.1 we look at fully ordered state spaces, in Section 7.2 at partially ordered state
spaces. In Section 7.3 we state and derive the Fortuin-Kasteleyn-Ginibre inequality, in Sec-
tion 7.4 the Holley inequality. Both are inequalities for expectations of functions on partially
ordered state spaces.
We say that P′ stochastically dominates P, and write P P′ . In terms of the respective cumu-
lative distribution functions F, F ′ , defined by F (x) = P((−∞, x]) and F ′ (x) = P′ ((−∞, x]),
x ∈ R, this property is the same as
F ′ (x) ≤ F (x) ∀ x ∈ R,
i.e., F ′ ≤ F pointwise.
P̂(X̂ ≤ X̂ ′ ) = 1.
Proof. The proof provides an explicit coupling of X and X ′ . Let F ∗ , F ′∗ denote the generalized
inverse of F, F ′ defined by
X̂ = F ∗ (U ), X̂ ′ = F ′∗ (U ).
D D
Then X̂ = X, X̂ ′ = X ′ , and X̂ ≤ X̂ ′ because F ′ ≤ F implies F ∗ ≤ F ′∗ . This construction, via
a common U , provides the desired coupling.
If F has a point mass (k2 − k1 )δx0 for some k2 > k1 and x0 ∈ R, then this pointmass gives
rise to a flat piece in F ∗ over the interval (k1 , k2 ] at height x1 that solves F (x1 ) = k2 .
40
Exercise 7.2 (Examples 2.5–2.6 repeated) Let U, V be the random variables in Exer-
cises 2.5–2.6. Give a coupling of U and V such that {U ≤ V } with probability 1.
Actually, the converses of Theorems 7.1 and 7.3 are also true, as is easily seen by picking
sets [x, ∞) and functions x 7→ 1[x,∞) for x ∈ R. Therefore the following equivalence holds:
41
where x, y, z are generic elements of E.
Definition 7.7 Given two probability measures P, P′ on E, we say that P′ stochastically dom-
inates P, and write P P′ , if
x ∈ A =⇒ A ⊃ {y ∈ E : x y},
or equivalently if
Z Z
f dP ≤ f dP′ for all f : E → R measureable, bounded and non-decreasing,
E E
x y =⇒ f (x) ≤ f (y).
42
Examples:
• E = {0, 1}Z , x = (xi )i∈Z ∈ E, x y if and only if xi ≤ yi for all i ∈ Z. For p ∈ [0, 1],
let Pp denote the probability measure on E under which X = (Xi )i∈Z has i.i.d. BER(p)
components. Then Pp Pp′ if and only if p ≤ p′ .
Exercise 7.11 Does in Definition 7.7 define a partial ordering on the space of probability
measures?
Definition 7.12 Given two transition kernels K and K ′ on E × E, we say that K ′ stochas-
tically dominates K if
K(x, ·) K ′ (x′ , ·) for all x x′ .
If K = K ′ and the latter condition holds, then we say that K is monotone.
Remark: Not all transition kernels are monotone, which is why we cannot write K K ′
for the property in Definition 7.12, i.e., there is no partial ordering on the set of transition
kernels.
λK n µK ′n for all n ∈ N0 .
43
Proof. The proof is by induction on n. The ordering holds for n = 0. Suppose that the
ordering holds for n. Let f be an arbitrary bounded and non-decreasing function on E n+2 .
Then Z
f (x0 , . . . , xn , xn+1 )(λK n+1 )(dx0 , . . . , dxn , dxn+1 )
E n+2
Z Z (7.1)
= (λK n )(dx0 , . . . , dxn ) f (x0 , . . . , xn , xn+1 )K(xn , dxn+1 ),
E n+1 E
where (λK n )(dx
0 , . . . , dxn ) is an abbreviation for λ(dx0 )K(x0 , dx1 ) × · · · × K(xn−1 , dxn ).
The last integral is a function of x0 , . . . , xn . Since f is non-decreasing and K ′ stochastically
dominates K, this integral is bounded from above by
Z
f (x0 , . . . , xn , xn+1 )K ′ (xn , dxn+1 ), (7.2)
E
where we use Definitions 7.7 and 7.12.
Theorem 7.15 If λ µ and K ′ stochastically dominates K, then there exist E-valued ran-
dom processes
Z = (Zn )n∈N0 , Z ′ = (Zn′ )n∈N0 ,
such that
D
(Z0 , . . . , Zn ) = λK n ,
D ∀ n ∈ N0 ,
(Z0′ , . . . , Zn′ ) = µK ′n ,
and Z0 Z0′ , Z1 Z1′ , . . . a.s. w.r.t. the joint law of (Z, Z ′ ).
Remark: The last ordering is denoted by Z ∞ Z ′ . All components are ordered w.r.t. .
Examples:
D D
1. E = R, becomes ≤. The result says that if λ ≤ µ and K(x, ·) ≤ K ′ (x, ·) for all x ≤ x′ ,
then the two Markov chains on R can be coupled so that they are ordered for all times.
2. E = {0, 1}Z . Think of an infinite sequence of lamps, labelled by Z, that can be either
“off” or “on”. The initial distributions are λ = Pp and µ = Pp′ with p < p′ . The transition
kernels K and K ′ are such that the lamps change their state independently at rates
u v
K: 0−→1, 1−→0,
u′ v′
K′ : 0−→1, 1−→0,
with u′ > u and v ′ < v, i.e., K ′ flips more rapidly on and less rapidly off compared to K.
44
Exercise 7.16 Give an example where the flip rate of a lamp depends on the states of the
two neighboring lamps.
i.e., µ′ is the marginal of µ on S ′ , and f ′ and g ′ are the conditional expectations with respect
to µ given the value on S ′ . To proceed with the proof we need the following lemma.
s 1 s 2 ≥ t1 t2 , s 3 s 4 ≥ t3 t4 , s 2 s 3 ≥ t1 t4 ∨ t2 t3 .
45
Use (7.3) and Lemma 7.18 with a, b ∈ P(S ′ ) and
s1 = µ(a ∪ b) t1 = µ(a)
s2 = µ(a ∩ b) t2 = µ(b)
s3 = µ([a ∪ b] ∪ {s}) t3 = µ(a ∪ {s})
s4 = µ([a ∩ b] ∪ {s}) t4 = µ(b ∪ {s})
The right-hand side is a sum of products of non-negative terms (use (7.3–7.4)), and so f ′ (b) ≥
f ′ (a).
Step 3: µ[f g] ≥ µ′ [f ′ g′ ]:
Write X X
µ[f g] = (f g)(a)µ(a) = (f g)′ (a)µ′ (a),
a∈P(S) a∈P(S ′ )
Hence X
µ[f g] ≥ f ′ (a)g′ (a)µ′ (a).
a∈P(S)
µ′ [f ′ g ′ ] ≥ µ′ [f ′ ]µ′ [g′ ].
But µ′ [f ′ ] = µ[f ] and µ′ [g′ ] = µ[g], and so with Step 3 we are done when µ > 0 on P(S).
46
Exercise 7.21 Explain how to remove the restriction that µ > 0 on P(S).
The two factors in the integrand are either both ≥ 0 or both ≤ 0, and hence µ[f g] ≥ µ[f ]µ[g].
Remark: The intuition behind log-convexity is the following. First, note that the inequality
in (7.3) holds for all a, b ∈ P(S) if and only if
Next, let X ∈ P(S) be the random variable with distribution P(X = a) = µ(a), a ∈ P(S).
Define
p(a, {s}) = P s ∈ X | X ∩ S\{s} = a , ∀ a ∈ P(S), s ∈ S\a, (7.6)
and note that
−1 !−1
µ(a ∪ {s})
p(a, {s}) = 1+ .
µ(a)
In view of (7.6), the latter says: “larger X are more likely to contain an extra point than
smaller X”, a property referred to as “attractiveness”.
Example: [Percolation model]
Take S to be a finite set in Zd , P(S) = {0, 1}S ,
47
Example: [Ising model]
Take S to be a finite torus in Zd (with periodic boundary conditions), P(S) = {0, 1}S ,
µ(a) = Z1β exp β|{x, y ∈ a : kx − yk = 1}| with β ∈ (0, ∞),
(7.9)
where Zβ is the normalizing constant,
A, B ⊂ S and f (·) = 1{·⊃A} , g(·) = 1{·⊃B} .
48
Exercise 7.25 Check property (3) by showing that the allowed transitions preserve the order-
ing of the Markov chains, i.e., if η ⊆ ζ, then the same is true after every allowed transition.
Consequently,
η0 ⊆ ζ0 =⇒ ηt ⊆ ζt ∀ t > 0. (7.11)
Check properties (1) and (2). Condition (7.10) is needed to ensure that
µ2 (η s ) µ1 (ζ s )
≥ when η ⊆ ζ with (η(s), ζ(s)) = (1, 1).
µ2 (η) µ1 (ζ)
From (7.11) we get Eη0 (f (ηt )) ≤ Eζ0 (f (ζt )) for all t ≥ 0 when η0 ⊂ ζ0 , and the Holley
inequality follows because Eη0 (f (ηt )) → µ2 [f ] and Eζ0 (f (ζt )) → µ1 [f ] as t → ∞. Pick η0 = ∅
and ζ0 = S to make sure that η0 ⊆ ζ0 .
Remark: The coupling used in the above proof is a maximal coupling in the sense of Sec-
tion 2.5.
Remark: By viewing the above rates locally, we can extend the Holley inequality to countable
sets S via a “projective limit” argument. The inequality in (7.10) must then be assumed for
arbitrary cylinder sets. It is even possible to extend to uncountable sets S.
What is important about Theorem 7.24 is that it provides an explicit criterion on µ1 , µ2
such that µ2 µ1 , as is evident from Theorem 7.9. Note that “log-convex with respect to” is
not a partial ordering: as noted above, µ is log-convex with respect to itself if and only if it
is log-convex. In particular, the reverse of Theorem 7.24 is false.
Exercise 7.26 Return to the example of the Ising model at the end of Section 7.3. Pick
β1 > β2 , and let µi = µβi , i = 1, 2, with µβ the probability measure in (7.9). Show that µβ1 is
log-convex with respect to µβ2 .
49
8 Percolation
In Sections 8.1 we look at ordinary percolation on Zd , in Section 8.2 at invasion percolation
on Zd . In Section 8.3 we take a closer look at invasion percolation on regular trees, where
explicit computations can be carried through.
A standard reference for percolation theory is Grimmett [4].
Pick p ∈ [0, 1], and partition Zd into p-clusters by connecting all sites that are connected by
edges whose weight is ≤ p, i.e.,
p
x←→y
if and only if there is a path π connecting x and y such that w(e) ≤ p for all e ∈ π. (A
path is a collection of neighboring sites connected by edges.) Let Cp (0) denote the p-cluster
containing the origin, and define
so that
θ(0) = 0, θ(1) = 1, p 7→ θ(p) is non-decreasing.
50
Define
pc = sup{p ∈ [0, 1] : θ(p) = 0}.
It is known that pc ∈ (0, 1) (for d ≥ 2), and that p 7→ θ(p) is continuous for all p 6= pc .
Continuity is expected to hold also at p = pc , but this has only been proved for d = 2 and
d ≥ 19. It is further known that pc = 21 for d = 2, while no explicit expression for pc is known
for d ≥ 3. There are good numerical approximations available for pc , as well as expansions in
1
powers of 2d for d large.
At p = pc a phase transition occurs:
Remark: Note that the Cp (0)’s for different p’s are coupled because we use the same w for
all of them. Indeed, we have
51
In this way we obtain a sequence of growing sets I = (I(n))n∈N0 with I(n) ⊂ Zd and |I(n)| ≤
n+1. (The reason for the inequality is that the vertex at the other end may have been invaded
before. The set of invaded edges at time n has cardinality n.) The invasion percolation cluster
is defined as
CIPC = lim I(n).
n→∞
This is an infinite subset of Zd , which is random because w is random. Note that the sequence
I is uniquely determined by w (because no two edges have the same weight).
Remark: Invasion percolation may serve as a model for the spread of a virus through a
computer network: the virus is “greedy” and invades the network along the weakest links.
The first question we may ask is whether CIPC = Zd . The answer is no:
CIPC ( Zd a.s.
A key result for invasion percolation is the following. Let Wn denote the weight of the edge
that is traversed in the n-th step of the growth of CIPC , i.e., in going from I(n − 1) to I(n).
52
Lemma 8.3 P(τp < ∞) = 1 for all p > pc .
Proof. Each time I “breaks out” of the box with center 0 it is currently contained in, it sees a
“never-before-explored” region containing a half-space. There is an independent probability
θ(p) > 0 that it hits Cp at such a break out time. Therefore it will eventually hit Cp with
probability 1. (This observation tacitly assumes that pc (Zd ) = pc (halfspace).)
In fact, it is trivial to see that equality must hold. Indeed, suppose that Wn ≤ p̃ for all n
large enough for some p̃ < pc . Then
with
τ (p̃) = inf m ∈ N0 : Wn ≤ p̃ ∀ n ≥ m .
But |Cp̃ | < ∞ and |I(τ (p̃))| < ∞ a.s., and this contradicts |CIPC | = ∞. Note that
1
lim sup |BN ∩ CIPC | ≤ θ(p) a.s. ∀ p > pc ,
N →∞ |BN |
53
We again assign independent UNIF(0, 1) weights w = (w(e))e∈(Tσ )∗ to the edges (Tσ )∗ of
the tree, and use this to define ordinary percolation and invasion percolation. We will compare
CIPC with the incipient infinite cluster, written CIIC and defined informally as
CIIC = Cpc | {|Cpc | = ∞},
i.e., take the critical cluster of ordinary percolation and condition it to be infinite. A more
formal construction is
P(CIIC ∈ · ) = lim P(Cpc ∈ · | 0 ↔ Hn )
n→∞
with Hn ⊂ Tσ the set of vertices at height n below the origin. The existence of the limit is
far from trivial.
Theorem 8.5 There exists a coupling of CIPC and CIIC such that CIPC ⊆ CIIC a.s.
Proof. We begin by noting that both CIPC and CIIC consist of a random back bone with
random finite branches hanging off.
IPC: Suppose that with positive probability there is a vertex in CIPC from which there are
two disjoint paths to infinity. Conditioned on this event, let M1 and M2 denote the
maximal weight along these paths. It is not possible that M1 > M2 , since this would
cause the entire second path to be invaded before the piece of the first path above its
maximum weight is invaded. For the same reason M1 < M2 is not possible either. But
M1 = M2 has probability zero, and so there is a single path to infinity.
IIC: The backbone guarantees the connection to infinity. The cluster is a critical branching
process with offspring distribution BIN(σ, 1/σ) conditioned on each generation having
at least one child.
54
We next give structural representations of CIPC and CIIC :
Lemma 8.6
CIIC : The branches hanging off the backbone are critical percolation clusters.
CIPC : The branches hanging off the backbone at height k are supercritical percolation clusters
with parameter Wk > pc conditioned to be finite, where
Proof. We give the proof for CIPC only. By symmetry, all possible backbones are equally likely.
Condition on the backbone, abbreviated BB. Conditional on W = (Wk )k∈N0 , the following is
true for every vertex x ∈ Tσ :
x ∈ CIPC ⇐⇒ every edge on the path between xBB and x has weight < Wk ,
where xBB = xBB (x) is the unique vertex where the path upwards from x to 0 hits BB, and
k = k(x) is the height of xBB .
Therefore, the event {BB = bb, W = w} is the same as the event that for all k ∈ N0 there is
no percolation below level Wk (i.e., for p-percolation with p < Wk ) in each of the branches off
BB at height k, and the forward maximal weights along bb are equal to w = (wk )k∈N0 .
On the tree, there is a nice duality relation between subcritical and supercritical percola-
tion.
Lemma 8.7 A supercritical percolation cluster with parameter p > pc conditioned to stay
finite has the same law as a subcritical percolation cluster with dual parameter p̂ < pc given
by
p̂ = pζ(p)σ−1
with ζ(p) ∈ (0, 1) the probability that the cluster along a particular branch from 0 is finite.
Proof. For v ∈ Tσ , let Cp (v) denote the forward cluster of v for p-percolation. Let U be any
finite subtree of Tσ with, say, m edges, and hence with (σ − 1)m + σ boundary edges. Then
P(U ⊂ Cp (v), |Cp (v)| < ∞)
P U ⊂ Cp (v) | |Cp (v)| < ∞ =
P(|Cp (v)| < ∞)
pm ζ(p)(σ−1)m+σ
= ,
ζ(p)σ
55
the numerator being the probability of the event that the edges of U are open and there is no
percolation from any of the sites in U . The right-hand side equals
which proves the duality. To see that p > pc implies p̂ < pc , note that
We can now complete the proof of CIPC ⊆ CIIC : since CIPC has subcritical clusters hanging
off its backbone, these branches are all stochastically smaller than the critical clusters hanging
off the backbone of CIIC . The subcritical clusters can be coupled to the critical clusters so
that they are contained in them.
56
9 Interacting particle systems
In Section 9.1 we define what an interacting particle system is. In Sections 9.2–9.3 we focus on
shift-invariant spin-flip systems, which constitute a particularly tractable class of interacting
particle systems, and look at their convergence to equilibrium. In Section 9.4 we give three
examples in this class: Stochastic Ising Model, Contact Process, Voter Model. In Section 9.5
we take a closer look at the Contact Process.
The standard reference for interacting particle systems is Liggett [9].
9.1 Definitions
An Interacting Particle System (IPS) is a Markov process ξ = (ξt )t≥0 on the state space
d d
Ω = {0, 1}Z (or Ω = {−1, 1}Z ), d ≥ 1, where
ξt = {ξt (x) : x ∈ Zd }
denotes the configuration at time t, with ξt (x) = 1 or 0 meaning that there is a “particle” or
“hole” at site x at time t, respectively. Alternative interpretations are
1 = infected/spin-up/democrat
0 = healthy/spin-down/republican.
The configuration changes with time and this models how a virus spreads through a popula-
tion, how magnetic atoms in iron flip up and down as a result of noise due to temperature, or
how the popularity of two political parties evolves in an election campaign.
The evolution is modeled by specifying a set of local transition rates
playing the role of the rate at which the state at site x changes in the configuration η, i.e.,
η → ηx
with η x the configuration obtained from η by changing the state at site x (either 0 → 1 or
1 → 0). Since there are only two possible states at each site, such systems are called spin-flip
systems.
Remark: It is possible to allow more than two states, e.g. {−1, 0, 1} or N0 . It is also possible
to allow more than one site to change state at a time, e.g. swapping of states 01 → 10 or
10 → 01. In what follows we focus entirely on spin-flip systems.
If c(x, η) depends on η only via η(x), the value of the spin at x, then ξ consists of indepen-
dent spin-flips. In general, however, the rate to flip the spin at x may depend on the spins in
the neighborhood of x (possibly even on all spins). This dependence models an “interaction”
between the spins at different sites. In order for ξ to be well-defined, some restrictions must
be placed on the family in (9.1), e.g. c(x, η) must depend only “weakly” on the states at “far
away” sites (formally, η 7→ c(x, η) is continuous in the product topology), and must not be
“too large” (formally, bounded away from infinity in some appropriate sense).
57
9.2 Shift-invariant attractive spin-flip systems
Typically it is assumed that
with τy the shift of space over y, i.e., (τy η)(x) = η(x − y), x ∈ Zd . Property (9.2) says that the
flip rate at x only depends on the configuration η as seen relative to x, which is natural when
the interaction between spins is “homogeneous in space”. Another useful and frequently used
assumption is that the interaction favors spins that are alike, i.e.,
′ c(x, η) ≤ c(x, η ′ ) if η(x) = η ′ (x) = 0,
ηη → (9.3)
c(x, η) ≥ c(x, η ′ ) if η(x) = η ′ (x) = 1.
Property (9.3) says that the spin at x flips up faster in η ′ than in η when η ′ is everywhere
larger than η, but flips down slower. In other words, the dynamics preserves the order .
Spin-flip systems with this property are called attractive.
Exercise 9.1 Give the proof of the above statement with the help of maximal coupling.
We next give three examples of systems satisfying properties (9.2) and (9.3).
which means that spins prefer to align with the majority of the neighboring spins.
which means that sites choose a random neighbor at rate 1 and adopt the opinion of
that neighbor.
Exercise 9.2 Check that these three examples indeed satisfy properties (9.2) and (9.3).
In the sequel we will discuss each model in some detail, with coupling techniques playing
a central role. We will see that properties (9.2) and (9.3) allow for a number of interesting
conclusions about the equilibrium behavior of these systems, as well as the convergence to
equilibrium.
58
9.3 Convergence to equilibrium
Write [0] and [1] to denote the configurations η ≡ 0 and η ≡ 1, respectively. These are the
smallest, respectively, the largest configurations in the partial order, and hence
[0] η [1], ∀ η ∈ Ω.
Since the dynamics preserves the partial order, we can obtain information about what happens
when the system starts from any η ∈ Ω by comparing with what happens when it starts from
[0] or [1].
Interacting particles can be described by semigroups of transition kernels P = (Pt )t≥0 .
Formally, Pt is an operator acting on Cb (Ω), the space of bounded continuous functions on Ω,
as
(Pt f )(η) = Eη [f (ξt )], η ∈ Ω, f ∈ Cb (Ω).
If this definition holds on a dense subset of Cb (Ω), then it uniquely determines Pt .
Exercise 9.3 Check that P0 is the identity and that Ps+t = Pt ◦ Ps for all s, t ≥ 0 (where ◦
denotes composition). For the latter, use the Markov property of ξ at time s.
Alternatively, the semigroup can be viewed as acting on the space of probability measures µ
on Ω via the duality relation
Z Z
f d(µPt ) = (Pt f ) dµ, f ∈ Cb (Ω).
Ω Ω
Lemma 9.4 Let P = (Pt )t≥0 denote the semigroup of transition kernels associated with ξ.
Write δη Pt to denote the law of ξt conditional on ξ0 = η (which is a probability distribution
on Ω). Then
t 7→ δ[0] Pt is stochastically increasing,
t 7→ δ[1] Pt is stochastically decreasing.
Proof. For t, h ≥ 0,
δ[0] Pt+h = (δ[0] Ph )Pt δ[0] Pt ,
δ[1] Pt+h = (δ[1] Ph )Pt δ[1] Pt ,
where we use that δ[0] Ph δ[0] and δ[1] Ph δ[1] for any h ≥ 0, and also use Strassen’s theorem
(Theorem 7.8) to take advantage of the coupling representation that goes with the partial
order.
exist as probability distributions on Ω and are equilibria for the dynamics. Any other equilib-
rium π satisfies ν π ν.
59
Proof. This is an immediate consequence of Lemma 9.4 and the sandwich δ[0] Pt δη Pt δ[1] Pt
for η ∈ Ω and t ≥ 0.
The class of all equilibria for the dynamics is a convex set in the space of signed bounded
measures on Ω. An element of this set is called extremal if it is not a proper linear combination
of any two distinct elements in the set, i.e., not of the form pν1 + (1 − p)ν2 for some p ∈ (0, 1)
and ν1 6= ν2 .
and p ∈ (0, 1), it follows that both inequalities must be equalities. Since the integrals of
increasing functions determine the measure w.r.t. which is integrated, it follows that ν1 = ν =
ν2 .
Exercise 9.7 Prove that integrals of increasing functions determine the measure.
Corollary 9.8 The following three properties are equivalent (for shift-invariant spin-flip sys-
tems):
1. ξ is ergodic (i.e., δη Pt converges to the same limit distribution as t → ∞ for all η),
2. there is a unique stationary distribution,
3. ν = ν.
Proof. Obvious because of the sandwiching of all the configurations between [0] and [1].
60
β and is denoted by νβ . In the second case (“low temperature”), there are two extremal
equilibria, both of which depend on β and are denoted by
R
νβ+ = “plus state” with Ω η(0)νβ+ (dη) > 0,
R
νβ− = “minus-state” with Ω η(0)νβ+ (dη) < 0,
which are called the magnetized states. Note that νβ+ and νβ− are images of each other under
the swapping of +1’s and −1’s. It can be shown that in d = 2 all equilibria are a convex
combination of νβ+ and νβ− , while in d ≥ 3 also other equilibria are possible (e.g. not shift-
invariant) when β is large enough. It turns out that β1 = 0, i.e., in d = 1 the SIM is ergodic
for all β > 0.
61
9.5 A closer look at the Contact Process
We will next prove (i-iii) in Lemma 9.9. This will take up some space, organized into Sec-
tions 9.5.1–9.5.4. In the proof we need a property of the CP called self-duality. We will not
explain in detail what this is, but only say that it means the following:
CP locally dies out (in the sense of weak convergence) starting from δ[1] if and only if
CP fully dies out when starting from a configuration with finitely many infections, e.g.,
{0}.
For details we refer to Liggett [9].
62
Pick A0 finite and consider the CP in dimension d with parameter λ starting from the set A0
as the set of infected sites. Let A = (At )t≥0 with At the set of infected sites at time t. Then
where the latter holds because each site in At has at most 2d non-infected neighbors. Now
consider the two random process X = (Xt )t≥0 with Xt = |At | and Y = (Yt )t≥0 given by the
birth-death process on N0 that moves at rate n from n to n − 1 (death) and at rate (2dλ)n
from n to n + 1 (birth), both starting from n0 = |A0 |. Then X and Y can be coupled such
that
P̂(Xt ≤ Yt ∀ t ≥ 0) = 1,
where P̂ denotes the coupling measure. Note that n = 0 is a trap for both X and Y . If
2dλ < 1, then this trap is hit with probability 1 by Y , i.e., limt→∞ Yt = 0 a.s., and hence also
by X, i.e., limt→∞ Xt = 0 a.s. Therefore νλ = δ[0] when 2dλ < 1. Consequently, 2dλd ≥ 1.
πd (x1 , . . . , xd ) = x1 + · · · + xd .
P̂(Bt ⊆ πd (At ) ∀ t ≥ 0) = 1.
which implies that if A dies out then also B dies out. In other words, if λ ≤ λd , then λd ≤ λ1 ,
which implies that dλd ≤ λ1 as claimed.
The construction of the coupling is as follows. Fix t ≥ 0. Suppose that At = A and Bt = B
with B ⊂ πd (A). For each y ∈ B there is at least one x ∈ A with y = πd (x). Pick one such x
for every y (e.g. choose the closest up or the closest down). Now couple:
– If x becomes healthy, then y becomes healthy too.
– If x infects any of the d sites x − ei with i = 1, . . . , d, then y infects y − 1.
– If x infects any of the d sites x + ei with i = 1, . . . , d, then y infects y + 1.
– Anything that occurs at other x′ ’s such that πd (x′ ) = y, has no effect on y.
(This is a mapping that defines how Bt evolves given how At evolves.)
Exercise 9.10 Check that this coupling has the right marginals and preserves the inclusion
Bt ⊆ πd (At ).
Since A0 = B0 = {0} and {0} ⊂ πd ({0}), the proof is complete.
63
9.5.4 Finite critical value in dimension 1
The proof proceeds via comparison with directed site percolation on Z2 . We first make a
digression into this part of percolation theory.
Each site is open with probability p and closed with probability 1 − p, independently of all
other sites, with p ∈ [0, 1]. The associated probability law on configuration space is denoted
by Pp . We say that y is connected to x, written as x ; y, if there is a path from x to y such
that
C0 = {x ∈ H : 0 ; x}
is called the cluster of the origin (C0 = ∅ if 0 is closed). The percolation function is
The uniqueness of pc follows from the monotonicity of p 7→ θ(p) proved in Section 8.1.
80
Lemma 9.11 pc ≤ 81 .
CN = ∪N
i=0 {x ∈ H : (−2i, 0) ; x}
= all sites connected to the lower left boundary of H
(including the origin).
We want to lay a contour around CN . To do so, we consider the oriented lattice that is
obtained by shifting all sites and bonds downward by 1. We call this the dual lattice, because
64
the two lattices together make up Z2 (with upward orientation). Now define
ΓN = the exterior boundary of the set of all faces in the dual lattice
containing a site of CN or one of the boundary sites(−2i + 1, −1),
with i = 1, . . . , N.
Think of ΓN as a path from (0, −1) to (−2N, −1) in the dual lattice, enclosing CN and
being allowed to cross bonds in both directions. We call ΓN the contour of CN (this contour
may be infinite). We need the following observations:
(i) There are at most 4 3n−2 contours of length n.
(ii) Any contour of length n has at least n/4 closed sites adjacent to it on the outside.
If p > 80/81, then 3(1 − p)1/4 < 1 and the sum is < 1 for N sufficiently large, i.e., Pp (|CN | =
∞) > 0 for N ≥ N0 (p). Using the translation invariance, we have
Hence, if p > 80/81, then Pp (|C0 | = ∞) > 0, which implies that pc ≤ 80/81.
The contour argument above is referred to as a Peierls argument. A similar argument
works for many other models as well (such as SIM).
65
Lemma 9.11 is the key to proving that λ1 < ∞, as we next show. The proof uses a
coupling argument showing that the one-dimensional CP observed at times 0, δ, 2δ, . . . with
δ = λ1 log(λ + 1) dominates oriented percolation with p = p(λ) given by
2 2
λ 1 λ
p(λ) = .
λ+1 λ+1
80
Since limλ→∞ p(λ) = 1 and pc ≤ 81 < 1, the infection (locally) survives for λ large enough.
Exercise 9.14 Show that A = (At )t≥0 is the CP with parameter λ starting from A0 = {0}.
66
(ii) between time nδ and (n + 1)δ there are both an i+ and an i− .
Define
Bnδ = the set of x ∈ Z such that 0 ; (x, nδ).
Exercise 9.15 Show that Bnδ = {x ∈ Z : (x, nδ) ∈ C0 }, where C0 is the cluster at the origin
in orientated percolation with p = e−2δ (1 − e−δλ )2 .
Anδ ⊃ Bnδ ∀ n ∈ N0 .
we obtain, with the help of Fact 9.11, that the one-dimensional CP with parameter λ survives
if
80
sup e−2δ (1 − e−δλ )2 > ,
δ>0 81
where in the left-hand side we optimize over δ, which is allowed because the previous estimates
hold for all δ > 0. The supremum is attained at
1
δ= log(λ + 1),
λ
which yields the claim in Fact 9.13.
Since limλ→∞ p(λ) = 1, it follows from Lemma 9.13 that λ1 < ∞.
Remark: The bound in Lemma 9.13 yields λ1 ≤ 1318. This is a large number because the
estimates that were made are crude. The true value is λ1 ≈ 1.6494, based on simulations and
approximation techniques.
67
10 Diffusions
In Section 10.1 we couple diffusions in dimension 1, in Section 10.2 diffusions in dimension d.
with ⌈·⌉ the upper integer part. Here, =⇒ denotes convergence in path space endowed with a
metric that is “a kind of flexible supremum norm”, called the Skorohod norm.
Brownian motion B = (Bt )t≥0 is a Markov process taking walues in R and having con-
tinuous paths. The law of B is called the Wiener measure, a probability measure on the set
of continuous paths such that increments over disjoint time intervals are independent and
normally distributed. To define B properly requires a formal construction that is part of
stochastic analysis, a subarea of probability theory that uses functional analytic machinery to
study continuous-time random processes taking values in R. B is an example of a diffusion.
Definition 10.1 A diffusion X = (Xt )t≥0 is a Markov process on R with continuous paths
having the strong Markov property.
We write Px to denote the law of X given X0 = x ∈ R. The sample space Ω is the space
of continuous functions with values in R, written CR [0, ∞), endowed with the Borel σ-algebra
CR [0, ∞) of subsets of CR [0, ∞) with the Skorohod topology.
Remark: The time interval need not be [0, ∞). It can also be (−∞, ∞), [0, 1], etc., depending
on what X describes. It is also possible that X takes values in Rd , d ≥ 1, etc.
An example of a diffusion is X solving the stochastic differential equation
68
where b(Xt ) denotes the local drift function and σ(Xt ) the dispersion function. The integral
form of (10.1) reads
Z t Z t
Xt = X0 + b(Xs ) ds + σ(Xs ) dBs ,
0 o
where the last integral is a so-called “Itô-integral”. Equation (10.1) is short-hand for the
statement:
The increments of X over the infinitesimal time interval [t, t + dt) is a sum of two
parts, b(Xt )dt and σ(Xt )dBt , with dBt the increment of B over the same time
interval.
Again, a formal definition of (10.1) requires functional analytic machinery. The functions
b : R → R and σ : R → R need to satisfy mild regularity properties, e.g. locally Lipschitz
continuous and modest growth at infinity. The solution of (10.1) is called an Itô-diffusion.
The special case with b ≡ 0, σ ≡ 1 is Brownian motion itself. The interpretation of X is:
s(x) − s(a)
Px (τb < τa ) = ∀ a, b ∈ R, a < x < b,
s(b) − s(a)
for some s : R → R continuous and strictly increasing. This s is called the scale function for
X. A diffusion is “in natural scale” when s is the identity. An example of such a diffusion is
Brownian motion B. More generally, Y = (Yt )t≥0 with Yt = s(Xt ) is in natural scale, and is
an Itô-diffusion with b ≡ 0.
Px (τy < ∞) = 1 ∀ x, y ∈ R.
69
and so recurrence implies that P̂xx′ (T < ∞) = 1 for all x, x′ ∈ R, with P̂xx′ = Px ⊗ Px′ the
independent coupling.
For recurrent diffusions on the full-line a similar result holds. The existence of a successful
coupling is proved as follows. Without loss of generality we assume that X is in natural scale.
Fix x < y and pick 0 < N1 < N2 < · · · such that
|Pz (τAk = Nk ) − 12 | ≤ 1
4 z ∈ Ak−1 , k ∈ N,
with Ak = {−Nk , Nk } and A0 = {x, y}. Then, by the skip-freeness, we have
h
′
2 il
P̂xy XτAk ≤ Xτ ′ for 1 ≤ k ≤ l ≤ 1 − 14 , l ∈ N,
Ak
70
Theorem 10.5 Regular diffusions X have the strong Feller property, i.e., for any bounded
f : R → R and any t > 0, the function Pt f defined by
Exercise 10.6 Prove the latter statement by using an argument of the type given for the
successful coupling on the full-line, but now with shrinking rather than growing intervals.
The Feller property is important because it says that the space of bounded continuous
functions is preserved by the semigroup P = (Pt )t≥0 . Since this set is dense in the space of
continuous functions, the Feller property allows us to control very large sets of functionals of
diffusions.
Theorem 10.7 Let P = (Pt )t≥0 be the semigroup of a regular diffusion. Then
λ≤µ =⇒ λPt ≤ µPt ∀ t ≥ 0.
Proof. This is immediate from the skip-freeness, by which λ ≤ µ allows X0 ≤ X0′ , and hence
Xt ≤ Xt′ for all t ≥ 0, when X0 , X0′ start from λ, µ.
71
and recurrence becomes
i.e., points are replaced by small balls around points in all statements about hitting times.
Itô-diffusions are defined by
where b : Rd → Rd and σ : Rd Rd × Rd are the vector local drift function and the matrix local
dispersion function, both subject to regularity properties.
Diffusions in Rd , d ≥ 2, are more difficult to analyze than in R. A lot is known for special
classes of diffusions (e.g. with certain symmetry properties). Stochastic analysis has developed
a vast arsenal of ideas, results and techniques. The stochastic differential equation in (10.2) is
very important because it has a wide range of application, e.g. in transport, finance, filtering,
coding, statistics, genetics, etc.
72
References
[1] O. Angel, J. Goodman, F. den Hollander and G. Slade, Invasion percolation on regular
trees, Annals of Probability 36 (2008) 420–466.
[2] A.D. Barbour, L. Holst and S. Janson, Poisson Approximation, Oxford Studies in Prob-
ability 2, Clarendon Press, Oxford, 1992.
[3] P. Diaconis, The cutoff phenomenon in finite Markov chains, Proc. Natl. Acad. Sci. USA
93 (1996) 1659–1664.
[5] O. Häggström, Finite Markov Chains and Algorithmic Applications, London Mathemat-
ical Society Student Texts 52, Cambridge University Press, Cambridge, 2002.
[6] F. den Hollander and M.S. Keane, Inequalities of FKG type, Physica 138A (1986) 167–
182.
[8] D.A. Levin, Y. Peres and E.L. Wilmer, Markov Chains and Mixing Times, American
Mathematical Society, Providence RI, 2009.
[9] T.M. Liggett, Interacting Particle Systems, Grundlehren der mathematische Wis-
senschaften 276, Springer, New York, 1985.
[11] T. Lindvall, Lectures on the Coupling Method, John Wiley & Sons, New York, 1992.
Reprint: Dover paperback edition, 2002.
[12] H. Nooitgedagt, Two convergence limits of Markov chains: Cut-off and Metastability,
MSc thesis, Mathematical Institute, Leiden University, 31 August 2010.
[13] J.A. Rice, Mathematical Statistics and Data Analysis (3rd edition), Duxbury Advanced
Series, Thomson Brooks/Cole, Belmont, California, 2007.
[15] H. Thorisson, Coupling, Stationarity and Regeneration, Springer, New York, 2000.
73