0% found this document useful (0 votes)
245 views241 pages

ProbStochProc 1.42 NoSolns PDF

This document is an introduction to probability and stochastic processes. It covers topics such as probability measures, random variables, stochastic processes, Markov chains, Brownian motion, stochastic calculus, stochastic differential equations and more. Each chapter defines key concepts and includes exercises for readers. The introduction aims to provide a comprehensive overview of fundamental probability and stochastic processes concepts.

Uploaded by

cantor5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views241 pages

ProbStochProc 1.42 NoSolns PDF

This document is an introduction to probability and stochastic processes. It covers topics such as probability measures, random variables, stochastic processes, Markov chains, Brownian motion, stochastic calculus, stochastic differential equations and more. Each chapter defines key concepts and includes exercises for readers. The introduction aims to provide a comprehensive overview of fundamental probability and stochastic processes concepts.

Uploaded by

cantor5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 241

Introduction to Probability and Stochastic Processes

Matthew Lorig 1

This version: July 3, 2019

1 Department of Applied Mathematics, University of Washington, Seattle, WA, USA. e-mail: mlorig@uw.edu
ii
Contents

Preface vii

1 Review of probability 1
1.1 Events as sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Infinite probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Uniform Lebesgue measure on (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Infinite sequence of coin tosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Random variables and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.1 Integration in the Lebesgue sense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.2 Computing expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Change of measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Information and conditioning 19


2.1 Information and σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Generating and Characteristic Functions 35


3.1 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iii
iv CONTENTS

3.4 LLN and the CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42


3.5 Large Deviations Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Discrete time Markov chains 49


4.1 Overview of discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Classification of states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Classification of chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Stationary distributions and the limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Chains with finitely many states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Continuous time Markov chains 63


5.1 The Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Overview of continuous time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 Potential and entropy production . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Convergence of random variables 81


6.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Modes of convergence defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3 Convergence in one mode implies convergence in another? . . . . . . . . . . . . . . . . . . 84
6.4 Continuity of probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 lim sup and lim inf of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Borel-Cantelli Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.7 Martingale convergence theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.8 Uniform Integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Brownian motion 103


7.1 Scaled random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.1 Symmetric random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.1.2 Scaled symmetric random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
CONTENTS v

7.3 Quadratic variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


7.4 Markov property of Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.5 First hitting time of Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.6 Reflection principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8 Stochastic calculus 115


8.1 Itô Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2 Itô formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3 Multivariate stochastic calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.4 Brownian bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.5 Girsanov’s Theorem for a single Brownian motion . . . . . . . . . . . . . . . . . . . . . . 132
8.6 Girsanov’s Theorem for d-dimensional Brownian motion . . . . . . . . . . . . . . . . . . . 135
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9 SDEs and PDEs 139


9.1 Stochastic differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.2 Connection to partial differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.3 Kolmogorov forward and backward Equations . . . . . . . . . . . . . . . . . . . . . . . . . 146
9.4 Poisson equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.5 In depth look: scalar time-homogenoues diffusions . . . . . . . . . . . . . . . . . . . . . . 152
9.6 Extensions to higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

10 Jump diffusions 169


10.1 Basic definitions and results on Lévy processes . . . . . . . . . . . . . . . . . . . . . . . . 169
10.2 The Itô formula for Lévy-Itô processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.3 Lévy-Itô SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.4 Change of measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.5 Hawkes processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

11 Stochastic Control 197


11.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.2 The Dynamic Programming Principle and the HJB PDE . . . . . . . . . . . . . . . . . . . 199
11.3 Solving the HJB PDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
vi CONTENTS

11.4 HJB equations associated with other cost functionals . . . . . . . . . . . . . . . . . . . . . 204


11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12 Optimal Stopping 207


12.1 Strong Markov processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
12.2 Optimal Stopping Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
12.3 Variational inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.4 Example: Optimal stopping of resource extraction . . . . . . . . . . . . . . . . . . . . . . 211
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

13 Monte Carlo Methods 215


13.1 Overview of Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
13.2 Generating Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13.2.1 Inverse Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
13.2.2 Acceptance-Rejection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
13.2.3 Generating Gaussian RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
13.3 Simulating Discrete-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.4 Simulating Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.5 Simulating Diffusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
13.5.1 Non-random coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
13.5.2 Ornstein-Uhlenbeck and other Gaussian Processes . . . . . . . . . . . . . . . . . . 224
13.5.3 Euler Discretization of an SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
13.6 Simulating Jump-Diffusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
13.6.1 Euler Discretization on a fixed time grid . . . . . . . . . . . . . . . . . . . . . . . . 226
13.6.2 Euler Discretization on a random time grid . . . . . . . . . . . . . . . . . . . . . . 227
13.7 Variance Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
13.7.1 Control Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
13.7.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
13.7.3 Antithetic Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Preface

These notes are intended to give first-year PhD students in applied mathematics a broad introduction
to measure-theoretic probability and stochastic processes. Because the focus of this course is on the
applied aspects of these topics, we will sometimes forgo mathemical rigor, favoring instead a heuristic
development. The mathematical statements in these notes should be taken as “true in spirit,” but perhaps
not always rigorously true in the mathematical sense. The hope is that, what the notes lack in rigor,
they make up in clarity. Each chapter begins with a list of references, which the interested student can
go to for rigorously true statements.

It should be noted that these notes are a work in progress. Students are encouraged to e-mail the
professor if they find errors.

Acknowledgments
The author of these notes wishes to express his sincere thanks to Weston Barger and Yu-Chen Cheng for
checking and writing homework solutions as well making corrections and improvements to the text.

vii
viii PREFACE
Chapter 1

Review of probability

The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 1) and (Grimmett and
Stirzaker, 2001, Chapters 1–5).

1.1 Events as sets


Definition 1.1.1. The set of all possible outcomes of an experiment is called the sample space. We
denote the sample space as Ω.

We will typically denote by ω a generic element of Ω.

Definition 1.1.2. An event is a subset of the sample space. We usually denote events by capital roman
letters A, B, C, . . ..

Example 1.1.3 (Toss a coin once). When one tosses a coin, there are two possible outcomes: heads
(H) and tails (T). Thus, we have Ω = {H, T}. One element of Ω is, e.g., ω = H. Possible event: “toss a
head.” A = H.

Example 1.1.4 (Toss two distinguishable coins). Ω = {(HH), (HT), (HT), (TT)}. One element of
Ω is, e.g., ω = (HT). Possible event: ‘’second toss a tail.” A = {(HT), (TT)}.

Example 1.1.5 (Toss two indistinguishable coins). Ω = {{HH}, {HT}, {TT}}. One element of Ω
is, e.g., {HH}. Possible event: ‘’coins match.” A = {{HH}, {TT}}.

Example 1.1.6 (Roll a die). Ω = {1, 2, 3, 4, 5, 6}. One element of Ω is, e.g., ω = 2. Possible event:
“roll an odd number.” A = {1, 3, 5}.

1
2 CHAPTER 1. REVIEW OF PROBABILITY

Here, we use (·) to denote an ordered sequence and we use {·} to denote an unordered set. Thus,
(HT) 6= (TH) but {HT} = {TH}.

If A and B are subsets of Ω, we can reasonably concern ourselves with events such as “not A” (Ac ), “A
or B” (A ∪ B), “A and B” (A ∩ B), etc. A σ-algebra is a mathematical way to descibe all possible sets of
interest for a given sample space Ω.

Definition 1.1.7. A collection F of subsets of Ω is called a σ-algebra if it satisfies

1. contains the empty set: ∅ ∈ F;


2. is closed under countable unions: A1 , A2 , A3 , . . . ∈ F ⇒ ∪i Ai ∈ F;
3. is closed under complements: A ∈ F ⇒ Ac ∈ F;

Note that if F is a σ-algebra then Ω ∈ F by items 1 and 3. Note also that F is closed under countable
intersections since

A1 , A2 , A3 , . . . ∈ F ⇒ Ac1 , Ac2 , Ac3 , . . . ∈ F


⇒ ∪i Aci ∈ F
[ c
⇒ Aci ∈F
i
Ai ∈ F.
\

i

Alternatively, one can define of a σ-algebra F as a set of subsets of Ω that contains at least the empty set
∅ and is closed under countable set operations (though, not necessarily closed under uncountable set
operators).

Example 1.1.8 (Trivial σ-algebra). The set of subsets F0 := {∅, Ω} of Ω is commonly referred to as
the trivial σ-algebra.

Example 1.1.9. If A is a subset of Ω then FA := {∅, Ω, A, Ac } is a σ-algebra.

Example 1.1.10. The power set of Ω, written 2Ω is the collection of all subsets of Ω. The power set
F = 2Ω is a σ-algebra.

Definition 1.1.11. Let G be a collection of subsets of Ω. The σ-algebra generated by G, written σ(G),
is the smallest σ-algebra that contains G.

By “smallest” σ-algebra we mean the σ-algebra with the fewest sets. One can show (although we will not
do so in these notes) that σ(G) is equal to the intersection of all σ-algebras that contain G.
1.2. PROBABILITY 3

Example 1.1.12. The collection of sets G = {∅, A, Ω} is not a σ-algebra because it does not contain
Ac . However, we could create a σ-algebra from G by simply adding the set Ac . Thus, we have
σ(G) = {∅, Ω, A, Ac }.
Definition 1.1.13. The pair (Ω, F) where Ω is a sample space and F is a σ-algebra of subsets of Ω is
called a measurable space.

1.2 Probability
So far, we have not yet talked about probabilities at all – only outcomes of a random experiment (elements
ω ∈ Ω) and events (subsets A ⊆ Ω). A probability measure assigns probabilities to events.
Definition 1.2.1. A probability measure defined on (Ω, F) is a function P : F → [0, 1] that satisfies
1. P(Ω) = 1;
2. if Ai ∩ Aj = ∅ for i 6= j then P(∪i Ai ) =
P
i P(Ai ). (countable additivity)
As we shall see in Section 1.3.2, it is very important to recognize that Item 2 holds only for countable
P
unions. For an uncountable union it is not true that P(∪α Aα ) = α P(Aα ).

How can we see the well-known result P(Ac ) = 1 – P(A) from the above definition? Simply note that

1 = P(Ω) = P(A ∪ Ac ) = P(A) + P(Ac ).

A probability measure P does not need to correspond to empirically observed probabilities! For example,
from experience, we know that if we toss a fair coin we have P(H) = P(T) = 1/2. However, we can
always define a measure P
e that assigns different probabilities P(H)
e = p and P(T)
e = 1 – p. As long as
e is a probability measure on (Ω, F) where Ω = {H, T} and F = {∅, Ω, H, T}.
p ∈ [0, 1] the measure P

The triple (Ω, F, P) is often referred to as a probability space or probability triple. To review, the sample
space Ω is the collection of all possible outcomes of an experiment. The σ-algebra F is all sets of interest
of an experiment. And the probability measure P assigns probabilities to these sets.

When a sample space is finite Ω = {ω1 , ω2 , . . . , ωn }, we can always take the σ-algebra as the power set
F = 2Ω and construct a probability measure P on (Ω, F) by specifying the probabilities of each individual
outcome P(ωi ) = pi . However, when the sample space Ω is infinite, choosing an appropriate σ-algebra F,
and constructing a probability measure P on (Ω, F) is a more delicate procedure.

1.3 Infinite probability spaces


In this section, we construct a probability measure on two infinite sample spaces.
4 CHAPTER 1. REVIEW OF PROBABILITY

1.3.1 Uniform Lebesgue measure on (0, 1)


In our first example, we will construct a mathematical model for choosing a number at random on the

open interval (0, 1). Thus, we take Ω = (0, 1). A particular outcome is, e.g., ω = 0.5 or ω = 1/ 2.
Clearly, the number of possible outcomes is infinite. We would like to construct a probability P so that
all numbers in the interval (0, 1) are equally likely to be chosen. It makes sense, then, to choose

P({ω : ω ∈ (a, b)}) = µ((a, b)) := b – a, 0 < a ≤ b < 1, (1.1)

where we have defined the Lebesgue measure µ. Equation (1.1) tells us how to determine the probability
that ω falls within an open interval. But, in fact, (1.1) tells us more than that. If P is to be a probability
measure, then it must satisfy the countable additivity property given in Definition 1.2.1. Thus, we also
know, for example, that

P({ω : ω ∈ (a, b) ∪ (c, d)}) = P({ω : ω ∈ (a, b)}) + P({ω : ω ∈ (c, d)})
= (b – a) + (d – c), 0 < a < b < c < d < 1.

It is natural to ask, what are all of the sub-sets of (0, 1) whose probabilities are determined by (1.1)
and the properties of probability measures given in Definition 1.2.1? Surprisingly, the answer is not
the power set 2(0,1) . It turns out that the power set 2(0,1) has sets whose probabilities are not determined
by (1.1). The sets whose probabilities are uniquely determined by (1.1) and Definition 1.2.1 are the sets
in σ-algebra generated by the open intervals

B((0, 1)) := σ(O), where O := {A ⊆ (0, 1) : A = (a, b), 0 ≤ a < b ≤ 1}.

We call B((0, 1)) the Borel σ-algebra on (0, 1). Thus, the appropriate sample space for our experiment
is (Ω, F, P) = ((0, 1), B((0, 1)), µ).

We can generalize the notion of Borel sets to all topological spaces.

Definition 1.3.1. Let Ω be some topological space and let O(Ω) be the set of open sets in Ω. We define
we define the Borel σ-algebra on Ω, denoted B(Ω) by B(Ω) := σ(O(Ω)).

Remark 1.3.2. Do not worry too much about what exactly Borel σ-algebras are. Just think of them as
“reasonable” sets. In fact, you would have to think very hard to come up with a set that is not a Borel set.

1.3.2 Infinite sequence of coin tosses


In this section we consider and infinite sequence of coin tosses. We define

Ω := the set of infinite sequences of Hs and Ts .


1.3. INFINITE PROBABILITY SPACES 5

Note that this set is not only infinite but uncountably infinite because there is a one-to-one correspondence
between Ω and the set of reals in [0, 1]. We will denote a generic element of Ω as follows:

ω = ω1 ω2 ω3 . . .

where ωi is the result of the i th coin toss. We want to construct a σ-algebra for this experiment.

Let us define some σ-algebras. First, consider the trivial σ-algebra

F0 = {∅, Ω}.

Given no information, I can tell if ω is in the sets in F0 because we know ω ∈ Ω and ω ∈


/ ∅. Next, define
two sets

AH = {ω ∈ Ω : ω1 = H}, AT = {ω ∈ Ω : ω1 = T}.

Noting that AH = AcT we see that

F1 := {∅, Ω, AH , AT },

satisfies the conditions of σ-algebra. Given ω1 it is possible to say whether or not ω is in each of the sets
in F1 . For example, if ω1 = H then ω ∈ AH and ω ∈ Ω, but ω ∈
/ AT and ω ∈
/ ∅. Next define four sets

AHH := {ω ∈ Ω : ω1 = H, ω2 = H}, AHT := {ω ∈ Ω : ω1 = H, ω2 = T},


ATT := {ω ∈ Ω : ω1 = T, ω2 = T}, ATH := {ω ∈ Ω : ω1 = T, ω2 = H}.

We wish to construct a σ-algebra that contains these sets and the sets in F1 . The smallest such σ-algebra
is
 
∅, Ω, AH , AT , AHH , AHT , ATT , ATH , Ac , Ac , Ac , Ac 

HH HT TT TH

F2 = .
HH ∪ ATH , AHH ∪ ATT , AHT ∪ ATH , AHT ∪ ATT
A
 

Given ω1 and ω2 , we can say if ω belongs to each of the sets in F2 . Continuing in this way, we can define
a σ-algebra Fn for every n ∈ N. Finally, we take

F := σ(F∞ ), F∞ = ∪n Fn .

One might ask if we could have simply taken F = F∞ ? Well, F∞ contains every set that can be described
in terms of finitely many coin tosses. However, we may be interested in sets such as “sequences for
which x percent of coin tosses are heads,” and these sets are not in F∞ . It turns out such sets are in F.

Now, we want to construct a probability measure on F. Let us assume the coin tosses are independent (a
6 CHAPTER 1. REVIEW OF PROBABILITY

term we will describe rigorously later on) and that the probability of a head is p. Setting q = 1 – p, it
should be obvious that

P(∅) = 0, P(Ω) = 1, P(AH ) = p, P(AT ) = q,


P(AHH ) = p 2 , P(AHT ) = pq, P(ATH ) = pq, P(ATT ) = q 2 , . . .

Continuing in this way, we can define P(A) for every A ∈ F∞ . What about the sets that are in F but
not in F∞ ? It turns out that once we have defined P for sets in F∞ there is only one way to assign
probabilities to those sets that are in F but not in F∞ . We refer the interested reader to Carathéodory’s
Extension Theorem for details.

Now, let us define


( )
#H in first n coin tosses 1
A = ω : lim = .
n→∞ n 2

The strong law of large numbers (SLLN) tells us that P(A) = 1 if p = 1/2 and P(A) = 0 if p 6= 1/2
(if you have not yet seen the SLLN, you should be able to see this from intuition). Now it should be
clear why uncountable additivity does not hold for probability measures. The probability of any given
sequence of infinite coin tosses is zero: P(ω) = 0. If we were to attempt to compute P(A) by adding up
the probabilities P(ω) of all elements ω ∈ A we would find
X X
P(ω) = 0 = 0 6= 1 = P(A), (when p = 1/2).
ω∈A ω∈A

Thus, uncountable additivity clearly does not hold.

We finish this example (we will come back to it!) with the following definition

Definition 1.3.3. Let (Ω, F, P) be a probability space. If a set A ∈ F satisfies P(A) = 1, we say that
the event A occurs P almost surely (written, P-a.s.).

Note in the example above that, when p = 1/2 we have P(A) = 1 and thus A occurs almost surely. But
6 Ω and Ac 6= ∅. The elements of Ac are part of the sample space Ω,
it is important to recognize that A =
but they have zero probability of occurring.

1.4 Random variables and distributions


A random variable maps the outcome of an experiment to R. We capture this idea with the following
definition.
1.4. RANDOM VARIABLES AND DISTRIBUTIONS 7

Definition 1.4.1. A random variable defined on (Ω, F) is a function X : Ω → R with the property that

{X ∈ A} := {ω ∈ Ω : X(ω) ∈ A} ∈ F,

for all A ∈ B(R).

Observe all any random variables must be defined on a measurable space (Ω, F), as these appear in the
definition. Note, however, that the probability meausre P does not appear in the definition. Random
variables are defined independent of a probability measure P.

What does Definition 1.4.1 mean? Recall that a probability measure P defined on (Ω, F) maps F → [0, 1].
In order for us to answer the question: “what is the probability that X ∈ A?” we need for the set
{X ∈ A} ∈ F. And this is precisely what Definition 1.4.1 requires. Why do we only consider sets A ∈ B(R)
rather than any set A ⊂ R? The answer is rather technical and, frankly, not worth exploring at the
moment.

A word on notation: the standard convention is to use capital Roman letters (typically, X, Y, Z) for
random variables and lower case Roman letters (x , y, z ) for real numbers.

Let us us look at some random variables.

Example 1.4.2 (Discrete time model for stock prices). Consider the infinite sequence of coin
tosses in Section 1.3.2. We Define a sequence of random variables (Sn )n≥0 via

uSn
 if ωn = H
S0 (ω) = 1, Sn+1 (ω) = (1.2)
dSn

if ωn = T

Here, Sn represents the value of a stock at time n. Note that P(S1 = u) = P(AH ) = p. Likewise
P(S2 = ud) = P(AHT ∪ ATH ) = 2pq. More generally, one can show that
!
n k n–k
P(Sn = u k d n–k ) = p q . (1.3)
k
Note if we had simply defined the random variables (Sn )n≥1 as having probabilities given by (1.3) we
would have no information about how, e.g., Sn relates to Sn–1 . From the above construction (1.2),
however, we know that if Sn = u n then Sn–1 = u n–1 . Thus, the structure of a given probability space,
not just the probabilities of events, is very important.

Example 1.4.3. Let (Ω, F) = ((0, 1), B((0, 1)) Define random variables X(ω) = ω and Y(ω) = 1 – ω.
Clearly, we have X = 1 – Y. Now, suppose we defined P(dω) := dω. Then X and Y have the same
distribution. For x ∈ [0, 1] we have
Z x Z x
P(X ≤ x ) = P(ω ≤ x ) = P(dω) = dω = x ,
0 0
8 CHAPTER 1. REVIEW OF PROBABILITY
Z 1 Z 1
P(Y ≤ x ) = P(1 – ω ≤ x ) = P(dω) = dω = x .
1–x 1–x

However, if we defined a new probability measure via P(dω)


e = 2ωdω then X and Y have different
distributions. For x ∈ [0, 1] we have
Z x Z x
P(X ≤ x ) = P(ω ≤ x ) = P(dω)
e = 2ωdω = x 2 ,
0 0
Z 1 Z 1
P(Y ≤ x ) = P(1 – ω ≤ x ) = P(dω)
e = 2ωdω = 1 – (1 – x )2 .
1–x 1–x

The distribution of a random variable X is most easily described through its cumulative distribution
function.

Definition 1.4.4. The distribution function FX : R → [0, 1] of a random variable X defined on a


probability space (Ω, F, P) is given by

FX (x ) := P(X ≤ x ).

Observe that, while a random variable X is defined with respect to (Ω, F) (with no reference to P), the
distribution FX is specific to a probability measure P.

Note that we put the random variable X in the subscript of FX to remind us that FX is the distribution
function corresponding to the random variable X (and not, e.g., Y). It is a good idea to do this.

We give here some obvious properties of FX .

1. We have the following limits: limx →–∞ FX (x ) = 0 and limx →∞ FX (x ) = 1.


2. FX is non-decreasing: x < y implies FX (x ) ≤ FX (y).
3. FX is right-continuous and has left-limits: FX (x +) := limh&0 FX (x + h) = FX (x ).

Many (but not all) random variables fall in to one of two categories: discrete and continuous. We describe
these two categories below.

Definition 1.4.5. A random variable X is called discrete if it takes values in some countable set
A := {x1 , x2 , . . .} ⊂ R. We associate is a discrete random variable a probability mass function
fX : A → R, defined by fX (xi ) := P(X = xi ).

Definition 1.4.6. A random variable X is called continuous if its distribution function FX can be
written as
Z x
FX (x ) = du fX (u), x ∈ R,
–∞

for some fX : R → [0, ∞) called the probability density function.


1.4. RANDOM VARIABLES AND DISTRIBUTIONS 9

It may help to think of the density function fX as fX (x )dx = P(X ∈ dx ).


0 .
Note that for a continuous random variable X we have fX = FX

If X is either discrete or continuous, it is easy to compute P(X ∈ A) for any A ∈ B(R). We have
X
discrete : P(X ∈ A) = fX (xi ),
{i :xi ∈A}
Z
continuous : P(X ∈ A) = dx fX (x ).
A

Remark 1.4.7. Although we have defined FX : R → [0, 1] by FX (x ) := P(X ≤ x ), it is common to also


to utilize FX as a set function FX : B(R) → [0, 1], which means FX (B) := P(X ∈ B). It should always be
clear from the argument of FX , which of the two meanings we intend.

Examples of discrete random variables


The following discrete random variables frequently arise in applications in nature and social sciences.

Example 1.4.8. If X is distributed as a Bernoulli random variable with parameter p ∈ [0, 1], written
X ∼ Ber(p), then

1 –
 p k = 0,
X ∈ {0, 1}, fX (k ) = 
p k = 1.

Example 1.4.9. If X is distributed as a Binomial random variable with parameters n ∈ N and


p ∈ [0, 1], written X ∼ Bin(n, p), then
!
n k
X ∈ {0, 1, 2, . . . , n}, fX (k ) = p (1 – p)n–k .
k
Pn
Note that if Xi ∼ Ber(p) and independent of each other then Y := i =1 Xi ∼ Bin(n, p).

Example 1.4.10. If X is distributed as a Geometric random variable with parameter p ∈ [0, 1], written
X ∼ Geo(p), then

X ∈ N, fX (k ) = p(1 – p)k –1 .

Note that Xi ∼ Ber(p) i = 1, 2, . . . are independent of each other then Y := inf{i : Xi = 1} ∼ Geo(p).

Example 1.4.11. If X is distributed as a Poisson random variable with parameter λ ∈ R+ , written


X ∼ Poi(λ), then

λk –λ
X ∈ {0} ∪ N, fX (k ) = e .
k!
10 CHAPTER 1. REVIEW OF PROBABILITY

Examples of continuous random variables


Before introducing some common continuous random variables, let us introduce a useful function.

Definition 1.4.12. Let A be a set in some topological space Ω (e.g., Ω = Rd ). The indicator function
1A : Ω → {0, 1} is defined as follows

1
 if x ∈ A,
1A (x ) := 
0 if x ∈
/ A.
Notation: We will sometimes write 1A (x ) = 1{x ∈A} .

We now introduce some continuous random variables that frequently arise in applications.

Example 1.4.13. If X is distributed as a Uniform random variable on the interval [a, b] ⊂ R, written
X ∼ U([a, b]), then
1
X ∈ [a, b], fX (x ) = 1[a,b] (x ) .
b–a
Example 1.4.14. If X is distributed as a Exponential random variable with mean λ > 0, written
X ∼ E(λ), then

X ∈ [0, ∞), fX (x ) = 1[0,∞) (x )λe–λx .

Note that fX (x ) = 0 if x < 0 due to the presence of the indicator function.

Example 1.4.15. If X is distributed as a Gaussian or Normal random variable with mean µ ∈ R and
variance σ 2 > 0 (we will give a meaning for “mean” and “variance” below), written X ∼ N(µ, σ 2 ), then
(x – µ)2
!
1
X ∈ R, fX (x ) = √ exp – .
2πσ 2 2σ 2
A random variable Z ∼ N(0, 1) is referred to as standard normal.

1.5 Stochastic Processes


Intuitively, we think of a stochastic process as a process that evolves randomly in time. Now that we
understand what a random variable is, we can define rigourously what we mean when we say stochastic
process.

Definition 1.5.1. A Stochastic process is a collection of random variables X = (Xt )t ∈T where T is


some index set. If the index set T is countable (e.g., T = N0 ) we say that X is a discrete time process.
If the index set T is uncountable (e.g., T = R+ ) we say that X is a continuous time process. The State
Space S of a stochastic process X is union of the state spaces of (Xt )t ∈T .
1.6. EXPECTATION 11

We can think of a stochastic process X : T × Ω → R in (at least) two ways. First, for any t ∈ T we have
that Xt : Ω → R is random variable. Second, for any ω ∈ Ω, we have that X· (ω) : T → R is a function of
time. Both interpretations can be useful.

1.6 Expectation
When we think of averaging we think of weighting outcomes by their probabilities. The mathematical
way to encode this is via the expectation.

Definition 1.6.1. Let X be a random variable defined on (Ω, F, P). The expectation of X, writtien EX,
is defined as
Z
EX := X(ω)P(dω),

where the integral is understood in the Lebesgue sense.

1.6.1 Integration in the Lebesgue sense


For those who have not previously encountered Lesbesgue integration, we now give a brief (very brief !)
overview of this concept.

Definition 1.6.2. Fix a probability space (Ω, F, P). Let A ∈ F. The indicator random variable,
denoted 1A , is defined by

1
 ω ∈ A,
1A (ω) :=
0

ω∈
/ A.

Observe that 1A ∼ Ber(p) with p = P(A). For disjoint sets A and B we have

1A∪B = 1A + 1B , A ∩ B = ∅.

And, for any two sets A and B we have

1A∩B = 1A 1B .

Definition 1.6.3. A collection of non-empty sets (Ai ) is said to be a partition of Ω if Ai ∩ Aj =


6 ∅ for
all i and j and ∪i Ai = Ω.
12 CHAPTER 1. REVIEW OF PROBABILITY

Definition 1.6.4. Let (Ai ) be a finite partition of Ω. A non-negative random variable X, defined on a
probability space (Ω, F, P), which is of the form
n
Ai ∈ F,
X
X(ω) = xi 1Ai (ω), xi ≥ 0,
i =1

is called simple.

Let X be a simple random variable. We define the expectation of X as follows


n
X
EX := xi P(Ai ). (if X is simple)
i =1

Note that, from this definition, we have

E1A = P(A).

Thus, we can always represent probabilities of sets as expectations of indicator random variables.

Now, consider a non-negative random variable X, which is not necessarily simple. Let (Xn )n≥0 be an
increasing sequence of simple random variables that converges almost surely to X. That is

Xi ≤ Xi +1 , lim Xi → X, P-a.s..
i →∞

We define the expectation of a non-negative random variable X as the following limit

EX := lim EXi , (if X is non-negative) (1.4)


i →∞

where each of the expectations on the right-hand side are well-defined because all of the Xi are simple by
construction. Finally, consider a general random variable X that could take either positive or negative
values. Define

X+ = max{X, 0}, X– = max{–X, 0}.

Note that X+ and X– are non-negative and X = X+ – X– . With this in mind, we define

EX := EX+ – EX– ,

where the expectations of X+ and X– are defined via (1.4).

Definition 1.6.1 of EX makes sense if E|X| < ∞ or if EX± = ∞ and EX∓ < ∞. In the latter case, we
have EX = ±∞. If both EX+ = ∞ and EX– = ∞, then we find ourselves in an ∞ – ∞ situation and, in
this case, EX is undefined.
1.6. EXPECTATION 13

1.6.2 Computing expectations

If X is either discrete or continuous Definition 1.6.1 reduces to the formulas one learns as an undergraduate.
X
discrete : EX = xi fX (xi ),
Zi
continuous : EX = dx x fX (x ).
R

In the discrete case, the sum runs over all possible values of x . More generally, we can express the
expected value of X as
Z X  xi + xi +1
 
EX = x FX (dx ) := lim FX (xi +1 ) – FX (xi ) , (1.5)
R kΠk→0 i 2

where Π is a partition of R, meaning

Π = {xi , x2 , . . . , xn }, xi < xi +1 ∀i , kΠk := sup(xi +1 – xi ).


i

The expression on the righ-hand side of (1.5) is known as a Stieltjes integral. The advantage of using
R
the Stieltjes integral x FX (dx ) to compute an expectation is that every random variable X – whether it
be discrete or continuous – has a distribution FX . Thus, by usinig the Stieltjes integral, we avoid having
to treat discrete and continuous cases separately.

Note that E is a linear operator. If X and Y are random variables and a and b are constants, then

E(aX + bY) = aEX + bEY.

How does one compute Eg(X) where g : R → R? Although we have not stated it explicitly, it should be
obvious that if X is a random variable, then Y := g(X) is also a random variable. 1 Thus, we have
Z
EY = Eg(X) = g(X(ω))P(dω),

which in the discrete and continuous cases become


X
discrete : Eg(X) = g(xi )fX (xi ),
Zi
continuous : Eg(X) = dx g(x )fX (x ).
R
1 Rigorously, g should be a measurable function, meaning g –1 (A) ∈ B(R) for all A ∈ B(R). Do not concern yourself too
much with this.
14 CHAPTER 1. REVIEW OF PROBABILITY

1.7 Change of measure


e defined on a measurable space (Ω, F). What is the relation
Consider two probability measures P and P
between P and P?
e The following theorem answers this question.

Theorem 1.7.1. Fix a probability space (Ω, F, P) and let Z ≥ 0 be a random variable satisfying
e : F → [0, 1] by
EZ = 1. Define a P

P(A)
e := EZ1A . (1.6)

e is a probability measure on (Ω, F). Denote by E


Then P e the expectation taken with respect to P.
e

Then

EX
e = EZX, and if Z > 0, then e 1 X.
EX = E (1.7)
Z
where X is a random variable defined on (Ω, F).

Definition 1.7.2. We call the random variable Z in Theorem 1.7.1 the Radon-Nikodym derivative of
P
e with respect to P.

Proof of Theorem 1.7.1. First, we show that P


e satisfies the properties of a probability measure given

in Definition 1.2.1. Let us check that P(Ω)


e = 1. We have

P(Ω)
e = EZ1Ω = EZ = 1.

Next, we check that P


e satisfies countable additivity. Let (A )
i i ≥0 be a sequence of disjoint sets. We have
X X
P(∪
e
i Ai ) = E1∪i Ai Z = E1Ai Z = P(A
e
i ).
i i


P P
Note that interchanging the sum with the expectation E i i E is allowed by Tonelli’s Theorem.
Finally, to show equation (1.7) holds, it is enough to check that it holds for simple random variables
P
X= i xi 1Ai . We have
X X X
EX
e =E
e xi 1Ai = xi EZ1Ai = xi P(A
e
i ),
i i i

which agrees with the definition of Expectation for simple random variables. Finally, if Z > 0, we have

e 1 X = EZ 1 X = EX.
E
Z Z
1.7. CHANGE OF MEASURE 15

Definition 1.7.3. A probability measure P defined on (Ω, F) is absolutely continuous with respect to
e written P  P,
another probability measure P, e if

P(A) = 0 ⇒ P(A)
e = 0.

In order for there to exist a Radon-Nikodym derivative Z = dP/dP


e we must have P  P.
e The reason for

this is that the relationship between P and P


e is multiplicative P(dω)
e = Z(ω)P(dω). If P(A) = 0, then
there is no random variable Z that would result in P(A)
e > 0.

e on (Ω, F) are equivalent, written P ∼ P,


Definition 1.7.4. Two probability measures P and P e if

P(A) = 0 ⇔ P(A)
e = 0.

Two probability measures are equivalent P ∼ P


e if and only if the Radon-Nikodym Derivative that relates

them is strictly positive Z > 0. Equivalent measures agree on which events will happen with probability
zero (and thus, they agree on which events will happen with probability one).

Example 1.7.5. Let us return to Example 1.4.3. We set (Ω, F) = ((0, 1), B((0, 1)). On this measure
space, we define two probability measures P(dω) = dω and P(dω)
e = 2ωdω. Note that we have
Z Z Z
P(A)
e = E1
e
A= 1A (ω)P(dω)
e = 1A (ω)2ωdω = 1A 2ωP(dω) = E1A Z, Z(ω) := 2ω.
Ω Ω Ω

One can easily check that EZ = 1 and Z > 0. Defining P


e by (1.6), one can easily check that (1.7) holds

true.

It is quite common to use the notation

dP
e dP
e
Z(ω) = (ω), P(dω)
e = (ω)P(dω),
dP dP

as a reminder of how the Radon-Nikodym Derivative Z relates P


e to P. For a finite probability space, it is

true that Z(ω) = P(ω)/P(ω).


e However, for an infinite probability space, it makes no sense in general to
define Z(ω) = P(ω)/P(ω)
e since it may be that P(ω) = 0. Nevertheless, the heuristic Z(ω) = P(ω)/P(ω)
e

gives the correct intuition. In particular, for the special case of an infinite probability space in which
P(dω) = p(ω)dω and P(dω)
e = pe (ω)dω and P  P,
e we have Z(ω) = p
e (ω)/p(ω).

Example 1.7.6 (Change of measure Normal random variable). On (Ω, F, P) let X ∼ N(0, 1) and
define Y = X + θ. Clearly, we have Y ∼ N(θ, 1). Now, define a random variable Z by
1 2
Z = e–θX– 2 θ .
16 CHAPTER 1. REVIEW OF PROBABILITY

Clearly Z > 0. We also have EZ = 1. To see this, simply compute


Z 1 2
EZ = dx e–θx – 2 θ fX (x )
R
Z 1 2 1 2
= dx e–θx – 2 θ √ e–x /2
R 2π
Z
1 2
= dx √ e–(x +θ) /2 = 1.
R 2π

Since Z > 0 and EZ = 1, we can define a new probability measure P


e with Z = dP/dP
e as the Radon-
Nikodym derivative. Let us compute the distribution of Y under P.
e We have

1 2
P(Y
e ≤ b) = EZ1{Y≤b} = Ee–θX– 2 θ 1{X≤b–θ}
Z b–θ 1 2
= dx e–θx – 2 θ fX (x )
–∞
Z b–θ
1 2
= dx √ e–(x +θ) /2
–∞ 2π
Z b
1 –z 2 /2
= dz √ e
–∞ 2π
e we see that Y ∼ N(0, 1). The Radon-Nykodym derivative Z changes the mean of Y from
Thus, under P
θ to 0, but it does not affect the variance of Y.

Example 1.7.7 (Change of measure Exponential random variable). On (Ω, F, P), let X ∼ E(λ)
and define
µ –(µ–λ)X
Z= e .
λ
Clearly, we have Z ≥ 0. We also have
Z ∞
µ µ
EZ = E e–(µ–λ)X = dx e–(µ–λ)x fX (x )
Z λ

0 λ
µ –(µ–λ)x –λx Z ∞
= dx e λe = dx µe–µx = 1.
0 λ 0

Thus, we can define a new probability measure P


e with Z = dP/dP
e as the Radon-Nikodym derivative.
Let us compute the distribution of X under P.
e We have

µ
P(X
e ≤ b) = EZ1{X≤b} = E e–(µ–λ)X 1{X≤b}
λ
Z b Z b Z b
µ –(µ–λ)x µ
= e fX (x ) = dx e–(µ–λ)x λe–λx = dx µe–µx
0 λ 0 λ 0

e we have X ∼ E(µ).
Thus, under P,
1.8. EXERCISES 17

1.8 Exercises
Exercise 1.1. Let F be a σ-algebra of Ω. Suppose B ∈ F. Show that G := {A ∩ B : A ∈ F} is a σ-algebra
of B.

Exercise 1.2. Let F and G be σ-algebras of Ω. (a) Show that F ∩ G is a σ-algebra of Ω. (b) Show that
F ∪ G is not necessarily a σ-algebra of Ω.

Exercise 1.3. Describe the probability space (Ω, F, P) for the following three experiments: (a) a biased
coin is tossed three times; (b) two balls are drawn without replacement from an urn which originally
contained two blue and two red balls; (c) a biased coin is tossed repeatedly until a head turns up.

Exercise 1.4. Suppose X is a continuous random variable with distribution FX . Let g be a strictly
increasing continuous function. Define Y = g(X). (a) What if FY , the distribution of Y? (b) What is fY ,
the density of Y?

Exercise 1.5. Suppose X is a continuous random variable with distribution FX . Find FY where Y is
q
given by (a) X2 (b) |X| (c) sin X (d) FX (X).

Exercise 1.6. Suppose X is a continuous random variable defined on a probability space (Ω, F, P). Let
f be the density of X under P and assume f > 0. Let g be the density function of a random variable.
Define Z := g(X)/f (X). (a) Show that Z ≡ dP/dP
e defines a Radon-Nikodym derivative. (b) What is the
density of X under P?
e

Exercise 1.7. Let X be uniformly distributed on [0, 1]. For what function g is the random variable g(X)
exponentially distributed with parameter 1 (i.e. g(X) ∼ E(1))?
18 CHAPTER 1. REVIEW OF PROBABILITY
Chapter 2

Information and conditioning

The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 2).

2.1 Information and σ-algebras


Let us return to the coin-toss example of Section 1.3.2. If we are given no information about ω what can
we say about ω? In other words, what are the subsets of Ω for which we can say: “ω is in this set” or “ω
is not in this set”? The answer is ∅ and Ω, which, together, form the trivial σ-algebra F0 = {∅, Ω}.

Now suppose we are given the value of ω1 . What are the subsets of Ω for which we can say: “ω is in this
set” or “ω is not in this set”? The answer is the sets in F0 as well as AH and AT . Together, these sets
form the σ-algebra F1 = {∅, Ω, AH , AT }. We say the sets in F1 are resolved by the first coin toss.

Now suppose we are given the value of ω1 and ω2 . What are the subsets of Ω for which we can say: “ω is
in this set” or “ω is not in this set”? The answer is the sets in F2 , given by
 
∅, Ω, AH , AT , AHH , AHT , ATT , ATH , Ac , Ac , Ac , Ac 

HH HT TT TH

F2 = .
HH ∪ ATH , AHH ∪ ATT , AHT ∪ ATH , AHT ∪ ATT
A
 

The sets in F2 are resolved by the first two coin tosses.

Continuing in this way, for each n ∈ N we can define Fn as the σ-algebra containing the sets that are
resolved by the first n coin tosses. Note that if a set A ∈ Fn then A ∈ Fn+1 . Thus, Fn ⊂ Fn+1 . In
other words, Fn+1 contains more “information” than Fn . This kind of structure is encapsulated in the
following definition.

Definition 2.1.1. Let Ω be a nonempty set. Let T be a fixed positive number, and assume that for
each t ∈ [0, T] there is a σ-algebra Ft . Assume further that if 0 ≤ s ≤ t ≤ T, then Fs ⊆ Ft . Then we

19
20 CHAPTER 2. INFORMATION AND CONDITIONING

call the sequence of σ-algebras F = (Ft )t ∈[0,T] a continuous time filtration.

A discrete time filtration is a sequence of σ-algebras F = (Fn )n∈N0 that satisfies Fn ⊆ Fn+1 for all n.

Example 2.1.2. Let Ω = C0 [0, T], the set of continuous functions defined on [0, T], starting from zero.
We denote by ω = (ωt )t ∈[0,T] and element of Ω. Let Ft be the σ-algebra generated by observing ω over
the interval [0, t ]. Mathematically, we write this as

Ft := σ(ωs , 0 ≤ s ≤ t ).

It should be obvious that the sequence of σ-algebras F = (Ft )t ∈[0,T] forms a filtration. Below, we define
two sets, one of which is in Ft , one of which is not.

A := {ω : sup ωs ≤ 1} ∈ Ft , / Ft .
B := {ω : ωT ≤ 1} ∈
0≤s≤t

The set A is an element of Ft because, given the path of ω over the interval [0, t ] one can answer the
question: is the maximum of ω over the interval [0, t ] less than 1? The set B is not an element of Ft
because one needs to know ωT in order to answer the question: is ωT ≤ 1?

In the above example we generated a sequence of σ-algebras by observing directly an element ω ∈ Ω.


Suppose that, instead of observing ω we can observe only a random variable X(ω). We can use this
information to generate a σ-algebra as well.

Definition 2.1.3. Let X be a random variable defined on a nonempty sample space Ω. The σ-algebra
generated by X, denoted σ(X), is the collection of all subsets of Ω of the form {X ∈ A} where A ∈ B(R).

Example 2.1.4. Let us return to Example 1.4.2. What is σ(S2 )? From the definition, we need to ask,
which sets are of the form {S2 ∈ A}? Since S2 can only take three values, u 2 , ud and d 2 we check the
following sets

{S2 = u 2 } = AHH , {S2 = ud} = AHT ∪ ATH , {S2 = d 2 } = ATT .

We add to these sets the sets that are necessary to form a σ-algebra (i.e., ∅, Ω and unions and complements
of the above sets) to obtain

σ(S2 ) = σ({AHH , ATT , AHT ∪ ATH }).

Note, that σ(S2 ) ⊂ F2 since AHT , ATH ∈ F2 but AHT , ATH ∈


/ σ(S2 ). The reason is that, if S2 = ud we
cannot say if ω1 = T or ω1 = H.

Definition 2.1.5. Let X be a random variable defined on a nonempty sample space Ω. Let G be a
σ-algebra of subsets of Ω. If σ(X) ⊂ G we say that X is G-measurable, and we write X ∈ G.
2.2. INDEPENDENCE 21

A random variable X is G-measurable if and only if the information in G is sufficient to determine the
value of X. Obviously, if X ∈ G then g(X) ∈ G (assuming is g is a measurable map from (R, B(R)) to
(R, B(R))).

Eventually, we will want to consider stochastic processes X = (Xt )t ∈[0,T] and we will want to know at
each time t if Xt is measureable with respect to σ-algebra Ft .

Definition 2.1.6. Let Ω be a nonempty sample space equipped with a filtration F = (Ft )t ∈[0,T] . Let
X = (Xt )t ∈[0,T] be a collection of random variables indexed by t ∈ [0, T]. We say this collection of
random variables is F-adapted if Xt ∈ Ft for all t ∈ [0, T].

2.2 Independence
When X ∈ G this means that the information in G is sufficient to determine the value of X. On the other
extreme, if X is independent (a term we will define soon) of G this means that the informtion in G tells
us nothing about the value of X.

Definition 2.2.1. Let (Ω, F, P) be a probability space. We say that two sets two sets A and B in F are
independent, written A ⊥⊥ B, if P(A ∩ B) = P(A) · P(B).

Example 2.2.2. Let us return to the coin-toss example of Section 1.3.2. Consider two sets and their
intersection

{ω1 = H} = AH , {ω2 = H} = AHH ∪ ATH {ω1 = H} ∩ {ω2 = H} = AHH

Since the coin tosses are indepdendent, we should have {ω1 = H} ⊥⊥ {ω2 = H} Let us verify that these
events are independent according to Definition 2.2.1. We have

P(AH ) = p, P(AHH ∪ ATH ) = p 2 + qp = p, P(AHH ) = p 2 .

Thus, P(AH ) · P(AHH ∪ ATH ) = P(AHH ), as expected.

Example 2.2.3. Can a set be independent of itself? Surprisingly, the answer is “yes.” Suppose
A⊥
⊥ A. Then, by the definition of independent sets, we have P(A ∩ A) = P(A) · P(A). We also have
P(A ∩ A) = P(A). Combining these equations, we obtain P(A) · P(A) = P(A). This equation has two
solutions P(A) = 1 and P(A) = 0. Thus, a set is independent of itself if the probability of that set is
zero or one.

Having defined independent sets, we can now extend to independent σ-algebras and random variables.
22 CHAPTER 2. INFORMATION AND CONDITIONING

Definition 2.2.4. Let (Ω, F, P) be a probability space, and let G and H be sub-σ-algebras of F (i.e.,
G, H ⊆ F). We say these two σ-algebras are independent, written G ⊥⊥ H, if

P(A ∩ B) = P(A) · P(B), ∀ A ∈ G, ∀ B ∈ H.

Let X and Y be random variables on (Ω, F, P). We say these two random variables are independent,
written X ⊥⊥ Y, if σ(X) ⊥⊥ σ(Y). Lastly, we say the random variable X is independent of the σ-algebra
G, written X ⊥⊥ G, if σ(X) ⊥⊥ G.

Recall from Definition 2.1.3 that σ(X) contains all sets of the form {X ∈ A}, where A ∈ B(R). Combining
this with Definition 2.2.4 we see that

X ⊥⊥ Y ⇔ P(X ∈ A, Y ∈ B) = P(X ∈ A) · P(Y ∈ B), ∀ A, B ∈ B(R). (2.1)

It follows from (2.1) that

X ⊥⊥ Y ⇒ EXY = EX · EY.

Note that EXY = EX · EY does not imply X ⊥⊥ Y.

The above notion of independence is called pairwise independence. If X ⊥⊥ Y and Y ⊥⊥ Z, this notion of
independence not imply X ⊥⊥ Z (for example, what if Z = X?). Thus, at times, we may need a stronger
notion of independence.

Definition 2.2.5. Let (Ω, F, P) be a probability space, and let G1 , G2 , . . . , Gn be sub-σ-algebras of F.


We say the sequence of σ-algebras are independent, if
n
P(∩n ∀ A1 ∈ G1 , ∀ A2 ∈ G2 , . . . , ∀An ∈ Gn .
Y
i =1 Ai ) = P(Ai ),
i =1

Let X1 , X2 , . . . , Xn be a sequence of random variables on (Ω, F, P). We say the sequence of random
variables are independent if the σ-algebras σ(X1 ), σ(X2 ), . . . , σ(Xn ) are independent.

As with with a pair of random variables, a sequence of random variables (Xi )i ≥1 is independent if and
only if
n
P (∩n ∀ A1 ∈ B(R), ∀ A2 ∈ B(R), . . . , ∀ An ∈ B(R).
Y
i =1 {Xi ∈ Ai }) = P(Xi ∈ Ai ),
i =1

We will often say that a sequence of random variables (Xi )i ≥0 is independent and identically distributed
(iid), by which me mean all Xi have the same distribution and (Xi )1≤i ≤n are independent for every
n ∈ N.
2.2. INDEPENDENCE 23

Example 2.2.6. Let us return to the coin-toss example of Section 1.3.2. Let us define a squence of
random variables (Xi )i ≥1 via
  
1
 ω1 = H, 1
 ω2 = H, X1
 i odd,
X1 (ω) = , X2 (ω) = , Xi (ω) = ,
0

ω1 = T, 0

ω2 = T, X

i even.
2

Clearly, since the coin tosses are independent we have X1 ⊥⊥ X2 and Xi ⊥⊥ Xj if i is even and j is odd.
But, the sequence (Xi )1≤i ≤n is not independent for any n ≥ 3 since Xi = Xi +2n for any i , n ∈ N.

It is not easy to verify if two random variables X and Y are independent using Expression (2.1), since
the equation must be verified for all Borel sets A, B ∈ B(R). In fact, there is an easier way to check
independence.

Definition 2.2.7. The joint distribution function FX,Y : R2 → [0, 1] of two random variables X and Y
defined on a probability space (Ω, F, P) is given by

FX,Y (x , y) := P(X ≤ x , Y ≤ y).

Again, we have two special cases for jointly discrete and jointly continuous random variables.

Definition 2.2.8. Two random variables X and Y are called jointly discrete if the pair (X, Y) takes
values in some countable set A = {x1 , x2 , . . .} × {y1 , y2 , . . .} ⊂ R2 . We associate is a discrete random
variable a probability mass function fX,Y : A → R, defined by fX,Y (xi , yj ) := P(X = xi , Y = yj ).

Definition 2.2.9. A pair of random variables X and Y is called jointly continuous if its joint distribution
function FX,Y can be written as
Z x Z y
FX,Y (x , y) = dudv fX,Y (u, v ), (x , y) ∈ R2 ,
–∞ –∞

for some fX,Y : R2 → [0, ∞) called the joint probability density function.

As in the one-dimensional case, it may help to think of the joint density function fX,Y as fX,Y (x , y)dx dy =
P(X ∈ dx , Y ∈ dy).

Note that for jointly continuous random variables X and Y we have fX,Y (x , y) = ∂x ∂y FX (x , y).

If the pair (X, Y) is either jointly discrete or jointly continuous, it is easy to compute P((X, Y) ∈ A) for
any A ∈ B(R2 ). We have
X
discrete : P((X, Y) ∈ A) = fX,Y (xi , yj ),
{i ,j :(xi ,yj )∈A}
24 CHAPTER 2. INFORMATION AND CONDITIONING
Z
continuous : P((X, Y) ∈ A) = dx dy fX,Y (x , y).
A
To recover the marginal distribution FX from FX,Y , simply note that

FX (x ) = P(X ≤ x ) = P(X ≤ x , Y ≤ ∞) = FX,Y (x , ∞).

It follows that for the discrete and continuous cases, we have, respectively
X
discrete : fX (xi ) = fX,Y (xi , yj ),
j
Z
continuous : fX (x ) = dy fX,Y (x , y).
R
The following theorem gives some easy-to-check conditinos for independence.

Theorem 2.2.10. Let X and Y be random variables definied on a probability space (Ω, F, P). The
following conditions are equivalent (that is, if one of them holds, all of them hold)

1. X ⊥⊥ Y.
2. FX,Y (x , y) = FX (x )FY (y) for every (x , y) ∈ R2 .
3. Discrete case: fX,Y (x , y) = fX (x )fY (y) for every (x , y) ∈ R2 .
Continuous case: fX,Y (x , y) = fX (x )fY (y) for ‘almost’ every (x , y) ∈ R2 .
4. E[eiuX+iv Y ] = EeuiX · Eeiv Y for all (u, v ) ∈ R2 .

Together with expectation, the most important statistical properties of a random variable (or pair) are
the variance and co-variance.

Definition 2.2.11. The variance of a random variable X, written VX is defined by

VX = E(X – EX)2 = EX2 – (EX)2 ,

whenever the expectation exists.

Definition 2.2.12. The co-variance of two random variables X and Y, written CoV[X, Y] is defined by

CoV[X, Y] = E(X – EX)(X – EY) = EXY – EX · EY,

whenever the expectation exists.

Note that CoV[X, X] = VX.


Note V· is not a linear operator, since

V[aX + bY] = a 2 VX + b 2 VY + 2ab CoV[X, Y].

where a and b are constants.

Definition 2.2.13. We say two random variables are un-correlated if CoV[X, Y] = 0.

Note that X ⊥⊥ Y implies X and Y are uncorrelated. However, the converse is not true.
2.3. CONDITIONAL EXPECTATION 25

2.3 Conditional expectation


Let X be a random variable defined on (Ω, F, P) and let G be a sub-σ algebra of F. When X ∈ G this
means that the information in G is sufficient to determin the value of X. When X ⊥⊥ G, this means that
the information in G gives us no information at all about X. Usually, however, the information in G gives
us some information about X, but not enough to determine X exactly. And this brings us to the notion
of conditioning.

Presumably, you have run across the following formula for the conditional probability of a set A given B

P(A ∩ B)
P(A|B) = , P(B) > 0.
P(B)

When (X, Y) have are jointly discrete or jointly continuous, this readily leads to conditional probability
mass function

P(X = xi ∩ Y = yj ) fX,Y (xi , yj )


discrete : fX|Y (xi , yj ) := P(X = xi |Y = yj ) = = ,
P(Y = yj ) fY (yj )
P(X ∈ dx ∩ Y ∈ dy) fX,Y (x , y)
continuous : fX|Y (x , y)dx := P(X ∈ dx |Y = y) = = dx .
P(Y ∈ dy) fY (y)

And from this, we can define E[X|Y = y], the conditional expectation of X given Y = y
X
discrete : E[X|Y = yj ] := xi fX|Y (xi , yj ),
i
Z
continuous : E[X|Y = y] := dx x fX|Y (x , y)
R

Note that E[X|Y = y] is simply a function of y – there is nothing random about it.

Unfortunately, there are cases for which the pair (X, Y) are neither jointly discrete nor jointly continuous.
And, for these cases we need a more general notion of conditional expectation. Here we will make two
conceptual leaps:

1. We will condition with respect to a σ-algebra rather than conditioning on an event.


2. The conditional expectation will be a random variable.

We will just hop in with our new definition of conditional expectation and then we will see, through an
example, that his new definition makes sense.

Definition 2.3.1. Let (Ω, F, P) be a probability space, let G be a sub-σ-algebra of F, and let X be a
random variable that is either nonnegative or integrable. The conditional expectation of X given G,
denoted E[X|G], is any random variable that satisfies
26 CHAPTER 2. INFORMATION AND CONDITIONING

1. Measurability: E[X|G] ∈ G.
2. Partial averaging: E[1A E[X|G]] = E[1A X] for all A ∈ G.
Alternatively, E[ZE[X|G]] = E[ZX] for all Z ∈ G.

When G = σ(Y) we shall often use the short-hand notation E[X|Y] := E[X|σ(Y)].

Conditional probabilities are defined from conditional expectations using

P(A|G) = E[1A |G].

Admittedly, Definition 2.3.1 is rather abstract (and, for the purposes of computation, useless). In fact, it
is not at all clear from Definition 2.3.1 that E[X|G] even exists! It does exist, though we will not prove
this here.

Conditional expectation has an interesting L2 interpretation. Consider a probability space (Ω, F, P). Let
G be a sub-σ-algebra of F (i.e., G ⊂ F). Define

L2 (Ω, F, P) := {X : Ω → R s.t. X–1 (B) ∈ F ∀B ∈ B(R) and E|X|2 < ∞},

and likewise for L2 (Ω, G, P). Clearly, since G ⊂ F we have L2 (Ω, G, P) ⊂ L2 (Ω, F, P). Next, define an
inner product

hX, Yi := EXY, X, Y ∈ L2 (Ω, F, P).

From the definition of conditional expectation, we have

0 = hX – E[X|G], Zi = E(X – E[X|G])Z, ∀ Z ∈ G.

Thus, E[X|G] is the projection of X ∈ L2 (Ω, F, P) onto the subspace L2 (Ω, G, P).

When conditioning on the σ-algebra generated by a random variable, it is easiest to use the following
formula

E[X|Y] = ψ(Y), ψ(y) := E[X|Y = y]. (2.2)

The following example should help to build some intuition for conditional expectation.

Example 2.3.2. Let Ω = {a, b, c, d, e, f }, F = 2Ω and P(ω) = (1/6) for ω = a, b, . . . , f . Define two
random variables X and Y on (Ω, F, P) as follows
   
ω a b c d e f
   
X(ω)  =  1 3 3 3 5 7 .
   

   
Y(ω) 2 2 1 1 7 7
2.3. CONDITIONAL EXPECTATION 27

Let G = σ(Y). Next, let us compute E[X|Y] using (2.2) and check if it agrees with Definition 2.3.1. We
have
   
ω a b c d e f
   
Y(ω) =  2 2 1 1 7 7 .
   
 
   
E[X|Y](ω) 2 2 3 3 6 6

Le us check measurability: is E[X|Y] ∈ σ(Y)? In other words, is σ(E[X|Y]) ⊆ σ(Y)? We have

σ(Y) = σ({a, b}, {c, d}, {e, f }),


σ(E[X|Y]) = σ({a, b}, {c, d}, {e, f }).

So, yes, E[X|Y] ∈ σ(Y). Another way to think of measurability is to simply ask: given the value of Y,
can one determine the value of E[X|Y]? Clearly, in this case the answer is “yes.” Next, let us check the
partial averaging property: does E[1A E[X|Y]] = E[1A X] for all A ∈ σ(Y). Rather than check this for
every A ∈ σ(Y), let us just check that this holds for the sets {a, b}, {c, d} and {e, f }. We have

A = {a, b}, E[1A E[X|Y]] = P(a)2 + P(b)2 = 4/6, E[1A X] = P(a)1 + P(b)3 = 4/6,
A = {c, d}, E[1A E[X|Y]] = P(c)3 + P(d)3 = 6/6, E[1A X] = P(c)3 + P(d)3 = 6/6,
A = {e, f }, E[1A E[X|Y]] = P(e)6 + P(f )6 = 12/6, E[1A X] = P(e)5 + P(f )7 = 12/6.

The following properties are arguably more important to remember than the definition of conditional
expectation. Memorize them!

Theorem 2.3.3. Let (Ω, F, P) be a probability space and let G be a sub-σ-algebra of F. Conditional
expectations satisfy the following properties.

1. Linearity: E[aX + bY|G] = aE[X|G] + bE[Y|G].


2. Taking out what is known: if X ∈ G then E[XY|G] = XE[Y|G].
3. Iterated conditioning: if H is a sub-σ-algebra of G then E[E[X|G]|H] = E[X|H].
4. Independence: if X ⊥⊥ G then E[X|G] = EX.

Theorem 2.3.3 can be proved directly from Definition 2.3.1, though we will not do so here. In addition to
the above properties, the following Theorem, which we state without proof is often useful:

Theorem 2.3.4 (Jensen’s inequality). Let X be a random variable defined on (Ω, F, P) and let G
be a sub-σ-algebra of F. Suppose φ : R → R is a convex function. Then we have

ϕ(E[X|G]) ≤ E[ϕ(X)|G], P – a.s. (2.3)


28 CHAPTER 2. INFORMATION AND CONDITIONING

In order to keep straight which direction the inequality in (2.3) goes, it is helpful to remember that
φ(x ) = x 2 is a convex function and the conditional variance of a function satisfies

V[X|G] = E[X2 |G] – E[X|G]2 ≥ 0.

Now that we have defined conditional expectation and established some of its key properties, we can
define “Markov process” and “martingale” – two seemingly similar, but very distinct concepts.

Definition 2.3.5. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ∈[0,T] be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process
M = (Mt )t ∈[0,T] . We say that M is

a martingale if E[Mt |Fs ] = Ms , ∀ 0 ≤ s ≤ t ≤ T,


a sub-martingale if E[Mt |Fs ] ≥ Ms , ∀ 0 ≤ s ≤ t ≤ T,
a super-martingale if E[Mt |Fs ] ≤ Ms , ∀ 0 ≤ s ≤ t ≤ T.

We have given above the definition of a continuous time martingale (resp. sub-, super-). We can also
define discrete-time martingales by making the obvious modifications.

Admittedly, the definition of sub- and super-martingales seems backwards; sub-martingales tend to rise
in expectation, whereas a super-martingales tend to fall in expectation.

Note: when we say that a process M is a martingale (or sub- or super-martingale) this is with respect to
a fixed probability measure and filtration. If P and P
e are two probability measures and F and G are two

filtrations, it is entirely possible that a process M may be a martingale with respect to (P, F) and may
not be a martingale with respect to (P,
e F), (P, G) or (P,
e G).

Definition 2.3.6. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ∈[0,T] be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process
X = (Xt )t ∈[0,T] . Assume that for all 0 ≤ s ≤ t ≤ T and for every nonnegative, Borel-measurable function
f , there is another Borel-measurable function g (which depends on s, t , and f ) such that

E[f (Xt )|Fs ] = g(Xs ).

Then we say X is a Markov process or simply “X is Markov.”

Identifying g(Xs ) ≡ E[f (Xt )|Xs ] we can write the Markov property as follows

E[f (Xt )|Fs ] = E[f (Xt )|Xs ]


2.3. CONDITIONAL EXPECTATION 29

A Markov process is a process for which the following holds: given the present (i.e., Xs ), the future
(i.e, Xt , t ≥ s) is independent of the past (i.e, Fs ). What this means in practice is that

P(Xt ∈ A|Fs ) = P(Xt ∈ A|Xs ), ∀s ≤ t , ∀ A ∈ B(R). (if X is Markov) (2.4)

If Xt is a discrete or continuous random variable for every t then we have a transition kernel, written as
P in the discrete case and Γ in the continuous case.

discrete : P(s, x ; t , y) := P(Xt = y|Xs = x ),


continuous : Γ(s, x ; t , y)dy = P(Xt ∈ dy|Xs = x ).

If you can write the transition kernel of a process explicitly, then you have essentially proved that the
process is Markov.

Note that any process that has independent increments is Markov since, if Xt – Xs ⊥⊥ Xs for t ≥ s, then

P(Xt ∈ A|Fs ) = P(Xt – Xs + Xs ∈ A|Xs ),

Where Fs is the filtration generated by observing X up to time s.

Markov processes and Martingales are entirely separate concepts. A process X can be both a martingale
and a Markov process, it can be a martingale but not a Markov process, it can be a Markov process but
not a martingale, and it can be neither a Markov process nor a martingale. We illustrate the difference
with an example.

Example 2.3.7. Let us return to the stock price Example 1.4.2. Let us show that S = (Sn )0≤n is a
Markov process. Recall that Fm is the σ-algebra generated by observing ω1 , ω2 , . . . , ωm . Observe that
Sm ∈ Fm . Next, note that
!
n k n–k
P(Sn+m = Sm u k d n–k |Sm ) = p q .
k

Since we have written the transition kernel explicitly, we have established that S is Markov. Let us also
find the function g in Definition 2.3.6. For any f : R → R we have
n !
n k n–q
u k d n–k ) ·
X
E[f (Sn+m )|Fm ] = f (Sm p q =: g(Sm ).
k =0
k

Thus, we have found g. Now, to see if S is a martingale note that

E[Sn+1 |Fn ] = E[Sn+1 |Sn ] = p · uSn + q · dSn = (p · u + q · d)Sn .


30 CHAPTER 2. INFORMATION AND CONDITIONING

Thus, if (p · u + q · d) = 1, then E[Sn+1 |Fn ] = Sn . Let us assume that (p · u + q · d) = 1. Then we have

E[Sn+m |Fn ] = E[E[Sn+m |Fn+m–1 ]|Fn ] = E[Sn+m–1 |Fn ]


= E[E[Sn+m–1 |Fn+m–2 ]|Fn ] = E[Sn+m–2 |Fn ]
= ...
= E[E[Sn+1 |Fn ]|Fn ] = E[Sn |Fn ] = Sn ,

Therefore, the process S is a martingale. Similarly, if (p · u + q · d) ≥ 1, then S is a sub-martingale and if


(p · u + q · d) ≤ 1, then S is a super-martingale.

2.4 Stopping times


Roughly speaking, a Stopping time is a particular kind of random time τ marks the time at which a
stochastic process X = (Xt )t ≥0 exhibits a particular kind of behavior. A stopping time is often defined
by a stopping rule, e.g., when the process X does this, then stop. Mathematically, a stopping time is
defined as follows:

Definition 2.4.1. Fix a probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 . A random time
τ : Ω → [0, ∞] is called a F-stopping time if it satisfies

{τ ≤ t } ∈ Ft , ∀ t ∈ [0, ∞). (2.5)

Above, we have focused on the continuous-time setting. If we are working with a discrete-time filtration
F = (Fn )n∈N0 , then a stopping time is a random time τ : Ω → N0 ∪ {∞} that satisfies

{τ ≤ n} ∈ Fn , ∀ n ∈ N0 .

Observe that, stopping times, like martingales, are defined with respect to a specific filtration F. Also
note that a stopping time may be infinite. The meaning of a stopping time should be fairly clear from
(2.5). If τ is an F-stopping time, then for any t ≥ 0 we should be able to say whether or not τ has
occurred given the information in Ft . Below, we give two examples of random times, one of which is a
stopping time, one of which is not.

Example 2.4.2. Let X = (Xt )t ≥0 be a continuous time stochastic process on (Ω, F, P) and let F = (Ft )t ≥0
be the filtration generated by observing the path of X, that is, Ft = σ(Xs , 0 ≤ s ≤ t ). Define

τ := inf{t ≥ 0 : Xt = a}, τ = sup{t ≥ 0 : Xt = a}, a ∈ R.


2.4. STOPPING TIMES 31

The first random time τ is clearly a F-stopping time. At any time t ≥ 0, using the information in Ft , we
can clearly answer the question: has X hit a? On the other hand, the second random time τ is not a
stopping time. At time t we will not be able to answer the question: when is the last time that X will
hit a?

The following theorem, will prove to be incredibly useful.

Theorem 2.4.3. Suppose a process M = (Mt )t ≥0 is a martingale with respect to a filtration F =


(Ft )t ≥0 . Suppose further that τ is an F-stopping time. Then the stopped process Mτ := (Mt ∧τ )t ≥0
is a martingale. That is, for any 0 ≤ t ≤ T < ∞, we have

E[MT∧τ |Ft ] = Mt ∧τ .

In particular, M0 = E[Mt ∧τ ] for all t ≥ 0.

We will not prove Theorem 2.4.3 rigorously. However, we will provide a bit of intuition for why the
theorem is true. For any t < τ we have Mτ ∧t = Mt and M is a martingale. Likewise, for any t ≥ τ we
have Mτt = Mτ , which is a constant (and thus trivially a martingale). As the process Mτ is a martingale
both prior to and after τ , it is reasonable to expect that Mτ is in fact a martingale.

Just as we can construct a σ-algebra Ft of information up to a fixed time t , we can construct a σ-algebra
Fτ of information up to a stopping time τ .

Definition 2.4.4. Fix a probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 . Suppose τ is an
F-stopping time. Then we define the σ-algebra Fτ at the stopping time τ as follows

Fτ := {A ∈ F∞ : A ∩ {τ ≤ t } ∈ Ft ∀ t ≥ 0}.

The idea underlying Definition (2.4.4) is that if a set A is observable at time τ then for any time t , its
restriction to the set {τ ≤ t } should be in Ft . The restriction to sets in A ∈ F∞ is to take account of the
possibility that the stopping time can be infinite and ensures that A ∩ {τ ≤ ∞} ∈ F∞ . If A ∈
/ F∞ then
we could have A ∩ {τ ≤ ∞} ∈ / F∞ . From the above definition, a random variable X is Fτ -measurable if
and only if 1{τ ≤t } X ∈ Ft for all t ∈ [0, ∞].

Now, suppose a process M = (Mt )t ≥0 is a martingale with respect to a filtration F = (Ft )t ≥0 . Suppose
further that M has a well-defined limit M∞ := limt →∞ Mt . As M is a martingale we have (by definition)
that Mt1 = E[Mt2 |Ft1 ] for any 0 ≤ t1 ≤ t2 ≤ ∞. Now, consider two F-stopping times τ1 and τ2 satisfying
0 ≤ τ1 ≤ τ2 ≤ ∞. One may ask: is it true that Mτ1 = E[Mτ2 |Fτ1 ]? Unfortunately, the answer to this
question is “no, not in general.” However, under certain conditions, provided in the following theorem,
we will have Mτ1 = E[Mτ2 |Fτ1 ].
32 CHAPTER 2. INFORMATION AND CONDITIONING

Theorem 2.4.5 (Doob’s Optional Stopping). Let M = (Mt )t ≥0 be martingale with respect to a
filtration F = (Ft )t ≥0 . Suppose that for any ε > 0 there exists a constant Kε ∈ [0, ∞) such that 1

sup E1{|Mt |>Kε } Mt < ε. (2.6)


t ≥0

Suppose that τ1 and τ2 are F-stopping times and that 0 ≤ τ1 ≤ τ2 . Then we have

Mτ1 = E[Mτ2 |Fτ1 ]. (2.7)

In particular, we have M0 = EMτi for i = 1, 2.

We will not prove Theorem 2.4.5. Rather, we will demonstrate its usefulness through an example.

Example 2.4.6. Consider a symmetric random walk


n
X
Sn = Xi , S0 = x ∈ Z,
i =1

where the (Xi ) are i.i.d. random variables with P(Xi = 1) = P(Xi = –1) = 1/2. Let F = (Fn )n∈N0
where Fn := σ(Si , 0 ≤ i ≤ n). For any a ∈ Z define the first hitting time to a as follows

τa := inf{n ∈ N0 : Sn = a}.

Now, suppose a < x < b. We wish to find P(τa < τb ). To this end, we define

τ := τa ∧ τb .

Observe that τ is an F-stopping time and S is an F martingale. The stopped process Sτ := (Sn∧τ )n∈N
is a martingale by Theorem 2.4.3. Moreover, because a < Sτn < b we see that Sτ satisfies (2.6). Thus, we
have by Theorem 2.4.5 that

Sτ0 = x = ESττ = aP(τa < τb ) + bP(τb < τa ),

where we have used the fact that Sττ = a when τa < τb and Sττ = b when τb < τa . We also have

1 = P(τa < τb ) + P(τb < τa ).

Thus, using the above two equations, we find


b–x x –a
P(τa < τb ) = , P(τb < τa ) = .
b–a b–a
1 We shall see in Definition 6.8.2 that this condition is called uniform integrability.
2.5. EXERCISES 33

The condition given in (2.6) is essential in order for (2.7) to hold, as we shall see in the next example.

Example 2.4.7. Let B = (Bi )i ∈N be an iid sequence of Bernoulli random variables: B ∼ Ber(p) with
p = 1/2. Construct a sequence X = (Xi ) of random variables as follows

X0 = 1, Xn+1 = 2Bn+1 Xn .

Define a filtration F = (Fn )n∈N0 where Fn = σ(Bi , 0 ≤ i ≤ n). Observe that both B and X are
F-adapted. It is easy to see that X is a martingale with respect to F because

E[Xn+1 |Fn ] = 2Xn E[Bn+1 ] = Xn .

Note, however, that condition (2.6) does not hold. Now, let us define the first hitting time of X to zero:
τ := inf{n ≥ 0 : Xn = 0}. Then we have

EXτ = E0 = 0 6= X0 .

Thus, we see that, without condition (2.6), we cannot expect (2.7) to hold.

2.5 Exercises
Exercise 2.1. Let Ω = {a, b, c, d} and let F = 2Ω (the set of all subsets of Ω). We define a probability
measure P as follows

P(a) = 1/6, P(b) = 1/3, P(c) = 1/4, P(d) = 1/4,

Next, define three random variables

X(a) = 1, X(b) = 1, X(c) = –1, X(d) = –1,


Y(a) = 1, Y(b) = –1, Y(c) = 1, Y(d) = –1,

and Z = X + Y. (a) List the sets in σ(X). (b) What are the values of E[Y|X] for {a, b, c, d}? Verify the
partial averaging property: E[1A E[Y|X]] = E[1A Y] for all A ∈ σ(X). (c) What are the values of E[Z|X]
for {a, b, c, d}? Verify the partial averaging property.

Exercise 2.2. Fix a probability space (Ω, F, P). Let Y be a square integrable random variable: EY2 < ∞
and let G be a sub-σ-algebra of F. Show that

V(Y – E[Y|G]) ≤ V(Y – X), ∀X ∈ G.


34 CHAPTER 2. INFORMATION AND CONDITIONING

Exercise 2.3. Give an example of a probability space (Ω, F, P), a random variable X and a function
f such that σ(f (X)) is strictly smaller than σ(X) but σ(f (X)) 6= {∅, Ω}. Give a function g such that
σ(g(X)) = {∅, Ω}.

Exercise 2.4. On a probability space (Ω, F, P) define random variables X and Y0 , Y1 , Y2 , . . . and suppose
E|X| < ∞. Define Fn := σ(Y0 , Y1 , . . . , Yn ) and Xn = E[X|Fn ]. Show that the sequence X0 , X1 , X2 , . . .
is a martingale under P with respect to the filtration (Fn )n≥0 .

Exercise 2.5. Let X0 , X1 , . . . be i.i.d Bernoulli random variables with parameter p (i.e., P(Xi = 1) = p).
Pn
Define Sn = i =1 Xi where S0 = 0. Define

1 – p 2Sn –n
 
Zn := , n = 0, 1, 2, . . . .
p

Let Fn := σ(X0 , X1 , . . . , Xn ). Show that Zn is a martingale with respect to this filtration.


Chapter 3

Generating and Characteristic Functions

The notes from this chapter are taken primarily from (Grimmett and Stirzaker, 2001, Chapter 5).

3.1 Generating functions


The generating function is a powerful took for analyzing sums of random variables.

Definition 3.1.1. Suppose X is a discrete random variable taking values in {0} ∪ N. We define the
probability generating function of X, written GX (s), by

GX (s) := Es X = s k fX (k ).
X

where fX (k ) is the probability mass function of X.

The generating function exists at least when |s| ≤ 1 (since


P
k fX (k ) = 1) and may exist in a larger
interval. As we will see, we usually only care about the value of GX and its derivatives at the point s = 1.
You can always integrate and differentiate a convergent power series term-by-term within its radius of
convergence.

Why is GX call the “probability generating function”? The reason is that the coefficient fX (k ) of the
O(s k ) term in the series expansion of GX (s) is precisely P(X = k ). So, one can expand GX as a power
series and obtain the probability mass function fX .

Example 3.1.2. Suppose X ∼ Ber(p) as in Example 1.4.8. Then

GX (s) = qs 0 + ps 1 = q + ps.

35
36 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

Example 3.1.3. Suppose X ∼ Geo(p) as in Example 1.4.10. Then


∞ ∞ sp 1
s k pq k –1 = sp (sq)k –1 =
X X
GX (s) = , |s| < .
k =1 k =1
1 – sq q
(n)
Theorem 3.1.4. Let GX denote the nth derivative of GX . Then

(n) X!
GX (1) = E . (3.1)
(X – n)!
Proof. We have
(n) dn dn X X!
GX (s) = n G X (s) = E n s =E s X–n . (3.2)
ds ds (X – n)!
Taking s = 1 in (3.2) yields (3.1)

Example 3.1.5. How would you find VX using GX ? Note that


0 (1) = EX,
GX 00 (1) = EX(X – 1) = EX2 – EX.
GX (3.3)

Using the above, we compute


00 (1) – (G0 (1))2 + G0 (1).
VX = EX2 – (EX)2 = GX X X

Given two independent random variables X and Y taking values in {0}∪N, one can compute P(X+Y = k )
using
n
X
P(X + Y = n) = P(X = k )P(Y = n – k ).
k =0
Alternatively, one can compute the probability generating function of X + Y and then expand the
generating function to obtain the probability P(X + Y = n).

Theorem 3.1.6. Suppose X ⊥⊥ Y. Then GX+Y (s) = GX (s)Gy (s).

Proof. We have

GX+Y (s) = Es X+Y = Es X · Es Y = GX (s)Gy (s).

The real use of generating functions arises when one wants to compute P( n
P
i =1 Xi = k ) where (Xi )i ≥1
is an iid sequence of random variables. In this case, computing the generating function GSn (s) with
Pn
Sn = i =1 Xi is relatively easy. Upon computing the generating function GSn , one can compute
probabilities of the form P(Sn = k ) by expanding GSn . By contrast, computing P(Sn = k ) directly from
the probability mass function of Xi is difficult.
3.1. GENERATING FUNCTIONS 37
Pn
Theorem 3.1.7. Suppose (Xi )i ≥1 are iid with common distribution X. Define Sn := i =1 Xi . Then

GSn (s) = [GX (s)]n .

Proof. We have

GSn (s) = Es X1 +X2 +...+Xn = (Es X )n = [GX (s)]n .

Pn
Example 3.1.8. Let (Xi )i ≥1 be iid with each Xi ∼ Ber(p). Define Sn := i =1 Xi . Recall from Example
1.4.9 that Sn ∼ Bin(n, p). We have

GSn (s) = [GX (s)]n = (q + ps)n .

Thus, we have obtained the probability generating function of the binomial random variable Sn . Although
we could have have computed GSn using the probability mass function fSn directly, this approach would
have been much more work.
Pn
Theorem 3.1.9. Suppose (Xi )i ≥1 are iid with common distribution X. Define Sn := i =1 Xi . Let
N be independent of (Xi )i ≥1 . Then

GSN (s) = GN (GX (s)).

Proof. We have

GSN (s) = Es X1 +X2 +...+XN = EE[s X1 +X2 +...+XN |N]


= E(Es X )N = E(GX (s))N = GN (GX (s)).

One can extend the notion of a probability generating function to multiple random variables in the
obvious way.

Definition 3.1.10. Suppose X and Y are discrete random variables taking values in {0} ∪ N. We define
the joint probability generating function of (X, Y), written GX,Y (s, t ), as

GX,Y (s, t ) = Es X t Y = s k t m fX,Y (k , m).


XX

k m

where fX,Y (k , m) is the joint probability mass function of (X, Y).


38 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

The coefficient fX,Y (k , m) of the O(s k t m ) term in the power series expansion of GX,Y (s, t ) about (0, 0)
gives P(X = k , Y = m).
(n,m)
Theorem 3.1.11. Let GX,Y denote the (n, m)th partial derivative of GX with respect to the first
and second arguments. Then

(n,m) X! Y!
GX,Y (1, 1) = E .
(X – n)! (Y – m)!

Proof. The proof follows the proof the Theorem 3.1.4.

3.2 Branching processes


Suppose a population evolves in generations. Let Zn be the size of the nth generation and assume
Z0 = 1. We assume that the i th member of the nth generation gives birth to to a random number Xn,i
of individuals. Then, clearly, the number of individuals in the (n + 1)th generation is given by

Zn+1 = Xn,1 + Xn,2 + . . . + Xn,Zn . (3.4)

We would like to know what the probability mass function, mean and variance of Zn are. Our route for
obtaining these will be through the generating function GZn .

Theorem 3.2.1. Assume Zn+1 is given by (3.4) and that the collection of random variables (Xn,i )
are iid. Define Gn (s) := Es Zn and G(s) := Es X . Then

Gn+m (s) = Gn (Gm (s)) = Gm (Gn (s)), and thus Gn (s) = G(G(. . . G(s) . . .)) .
| {z }
n-fold iteration

Proof. Let Ym,i denote the number of offspring in the (m + n)th generation that descend from the
i th member of the mth generation. Clearly, the (Ym,i ) are iid and Ym,i ∼ Zn . The number of members
of the (m + n)th generation is given by

Zm+n = Ym,1 + Ym,2 + . . . + Ym,Zm .

We have a random sum of iid random veriables. By Theorem 3.1.9 it follows that Gm+n (s) = Gm (Gn (s)).
Obiouvsly, we can interchange m ↔ n and obtain Gm+n (s) = Gn (Gm (s)). Finally, we have

Gn (s) = G(Gn–1 (s)) = G(G(Gn–2 (s)) = . . . = G(G(. . . G(s) . . .)),


| {z }
n-fold iteration

as claimed.
3.3. CHARACTERISTIC FUNCTIONS 39

In principle, one can obtain the probality mass function fn of the nth generation Zn by expanding the
generating function Gn (s) as a power series about s = 0. In practice, this may be difficult to carry out.
However, moments of Zn can usually be computed with relative ease.

Theorem 3.2.2. Suppose EZ1 = µ and VZ1 = σ 2 . Then



 nσ 2 µ = 1,
EZn = µn , VZn = σ 2 (µn –1)µn–1 (3.5)

µ–1 µ 6= 1.
Proof. We have
d

0 (G(1)) · G0 (1) = G0 (1) · µ = EZ
EZn = Gn (s) = Gn–1 n–1 n–1 · µ,
ds s=1

where we have used G(1) = E1Z1 = 1. Iterating, we obtain EZn = µn . Next, from (3.3), we compute
d2 d 0

EZ2n – EZn = G (s)

= G (G(s)) · G 0 (s)
n
n–1
ds 2
s=1 ds
s=1
00 (G(s)) · (G0 (s))2 + G0 (G(s)) · G00 (s)
= Gn–1

n–1
s=1

= 00 (1) · (G0 (1))2 + G0 (1) · G00 (1)


Gn–1 n–1
   
= EZn–1 – EZn–1 · (EZ1 )2 + EZn–1 · EZ21
2 – EZ1 . (3.6)

Thus, using (3.6), as well as EZk = µk and EZ2k = VZk + (µk )2 , we obtain

VZn = EZ2n – (EZn )2


   
= EZ2n–1 – EZn–1 · (EZ1 )2 + EZn–1 · EZ21 – EZ1 – (EZn )2 + EZn
   
= VZn–1 + (EZn–1 )2 – EZn–1 · (EZ1 )2 + EZn–1 · VZ21 + (EZ1 )2 – EZ1 – (EZn )2 + EZn
   
= VZn–1 + µ2(n–1) – µn–1 · µ2 + µn–1 · σ 2 + µ2 – µ – µ2n + µn .

Thus, we have obtained an expression for VZn in terms of VZn–1 , µ and σ 2 . Solving this explicitly
yields (3.5).

3.3 Characteristic functions


Suppose a random variable X has a state space that is not contained in the non-negative integers.
When we cannot use the probability generating function GX (s) to analyze the properties of X. A
natural candidate to replace the probability generating function is the moment generating function
MX (t ) := Eet X . Moment generating functions, however, are limited in their use because, for many
random variables, Eet X = ∞ for all t 6= 0. However, there is a simple fix to this. We consider instead
the characteristic function.
40 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

Definition 3.3.1. The characteristic function of a random variable X is the function φX : R → C


defined by

φX (t ) := Eeit X , i= –1.

The characteristic function (unlike the moment generating function) always exists since E|eit X | = 1.
Clearly, if GX and φX both exist, then we have

φX (t ) = GX (eit ) since Eeit X = Es X , s = eit .

We have the following obvious properties.

Theorem 3.3.2. Characteristic functions have the following properties.

1. Let a, b be constants. Then φaX+b (t ) = eibt φ(at ).


2. If X and Y are independent then φX+Y (t ) = φX (t )φY (t ).
Pn
3. If (Xi )1≤i ≤n are iid and Sn = i =1 Xi , then φSn (t ) = (φX (t ))n .

Example 3.3.3. Let X ∼ E(λ) as in Example 1.4.14. Then


Z ∞
λ
φX (t ) = dx eitx λe–λx = .
0 λ – it

Example 3.3.4. Let X ∼ N(µ, σ 2 ) as in Example 1.4.15. Then

–(x – µ)2
Z ∞ !
1 1
 
φX (t ) = dx eitx √ exp = exp iµt – σ 2 t 2 .
–∞ 2πσ 2 2σ 2 2

Characteristic functions have several uses. First, they can be used to capture the moments of a random
variable when they exist.

Theorem 3.3.5. Let φX (t ) be the characteristic function of a random variable X. Then

φ(n) (0) = in EXn , if E|X|n < ∞.

Proof. We have
dn dn it X
φ (t ) = E e = in EXn eit X ,
dt n X dt n
Now set t = 0 to complete the proof.

The characteristic function uniquely determines the distribution of a random variable. In other words,
there is a one-to-one correspondence between FX and φX . We show this for a continuous random variable.
3.3. CHARACTERISTIC FUNCTIONS 41

Theorem 3.3.6 (Inversion). Suppose X is a continuous random variable with density fX . Then
1 Z
fX (x ) = dt e–itx φX (t ),
2π R
for all x where fX is differentiable.
Rx
To obtain FX from fX simply use FX (x ) = –∞ dy fX (y).

Proof. The proof of Theorem 3.3.6 follows from standard Fourier results. We have
1 Z –itx
Z
1 Z –itx X 1 Z
fX (x ) = dt e ity
dy e fX (y) = dt e Ee it = dt e–itx φX (t ).
2π R R 2π R 2π R

An inversion theorem for random variables that are not continuous also exists. Though, it is not
particularly useful for the purposes of computation.

Perhaps the most important property of characteristic functions is they can be used to prove the
convergence of a sequence of random variables to a limiting distribution.

Definition 3.3.7. We say that a sequence of distribution functions (Fn )n≥1 converges to a distribution
F, written Fn → F, if limn→∞ Fn (x ) = F(x ) at all points x where F is continuous.

Theorem 3.3.8 (Continuity Theorem). Let (Fn )n≥1 be a sequence of distributions functions with
corresponding characteristic functions (φn )n≥1 .

1. If Fn → F where F is a distribution function with corresponding characteristic function φ,


then φn → φ (pointwise).
2. Conversely, if φ(t ) := limn→∞ φn (t ) exists and is continuous at t = 0, then φ is the charac-
teristic function of a distribution F and Fn → F.

Item 2 in Theorem 3.3.8 if particularly powerful. If Fn and φn are, respectively, the distribution and
characteristic function of a sum of n independent random variables, it is often easier to compute φn
than Fn . If we can compute φn and find its limit, then we can obtain F.

Example 3.3.9. In Example 3.1.3 we found that


sp
X ∼ Geo(p) ⇒ GX (s) = .
1 – sq
From this, one can easily show that EX = 1/p. Now consider a sequence of random variables (Yn )n≥0
defined by
1
Yn = Xn , Xn ∼ Geo(λ/n),
n
42 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

where λ > 0 is a fixed constant. For n large enough, λ/n < 1. Note that EYn = n1 EXn = n1 nλ = λ1 for
all n. We would like to know what the limiting distribution of the sequence (Yn )n≥0 is; we will use the
Continuity Theorem 3.3.8 to do this. We have

eit /n (λ/n) eit /n λ eit /n λ


φYn (t ) = φXn (t /n) = = =
1 – eit /n (1 – λ/n) n – eit /n (n – λ) n(1 – eit /n ) – eit /n λ
(1 + it /n)λ λ
= → = φZ (t ),
n(1 – 1 – it /n – . . .) – (1 + it /n + . . .)λ λ – it
where Z ∼ E(λ) (see Example 3.3.3). By the Continuity Theorem 3.3.8, since φYn → φZ it follows that
FYn → FZ .

3.4 LLN and the CLT


In this section we will use Theorem 3.3.8 to prove the Law of Large Numbers (LLN) and the Central
Limit Theorem (CLT). First, we must define the notion of convergence in distribution.

Definition 3.4.1. We say that a sequence of random variables (Xn )n≥1 converges in distribution to a
D
random variable X, written Xn → X, if FXn → FX .

Theorem 3.4.2 (Law of Large Numbers). Let (Xn )n≥1 be a sequence of iid random variables
D
with EXn = µ. Define a sequence of random variables (Sn )n≥1 by Sn := n1
Pn
i =1 Xi . Then Sn → µ.

Proof. From Theorem 3.3.2 we have


n
iµt

n 2
φSn (t ) = (φX (t /n)) = 1 + + O((t /n) ) → eiµt , as n → ∞.
n
Note that eiµt = φµ (t ), where φµ is the characteristic function of a constant random variable µ. Thus,
we have shown that φSn → φµ . By Theorem 3.3.8 we have FSn → Fµ and thus, from Definition 3.4.1, we
D
have Sn → µ.

In the proof of Theorem 3.4.2 we actually assumed that EX2n < ∞ when we wrote the O((t /n)2 ). In
fact, the Law of Large Numbers (LLN) holds even when EX2n = ∞.

Theorem 3.4.3 (Central Limit Theorem). Let (Xn )n≥1 be a sequence of iid random variables
with EXn = µ and VXn = σ 2 . Define two sequences of random variables (Sn )n≥1 and (Un )n≥1 by
n
X Sn – nµ
Sn = Xi , Un = √ .
i =1 nσ 2
D
Then Un → Z where Z ∼ N(0, 1).
3.5. LARGE DEVIATIONS PRINCIPLE 43

Proof. First, we re-write Un as follows


1 X n Xi – µ
Un = √ Y , Yi := .
n i =1 i σ

Note that EYn = 0 and VYn = 1. Next, we compute the characteristic function of Un . From Theorem
3.3.2 we have
!n
√ n t2 1 2
  
φUn (t ) = φY (t / n) = 1 – 3
+ O((t /n) ) → exp – t , n → ∞.
2n 2
 
From Example 3.3.4, we see that exp – 12 t 2 = φZ (t ) where Z ∼ N(0, 1). Thus, we have shown that
D
φUn → φZ . By Theorem 3.3.8 we have FUn → FZ and thus, from Definition 3.4.1, we have Un → Z, as
claimed.

In the proof of Theorem 3.4.3 we assume E|Xi |3 < ∞ when we wrote the O((t /n)3 ) term. The Central
Limit Theorem (CLT) holds even when E|Xi |3 = ∞.

3.5 Large Deviations Principle


Let Sn be the sum of n iid random variables (Xi ) with common distribution X whose mean a variance
are µ and σ, respectively
n
VX = σ 2 .
X
Sn = Xi , EX = µ,
i =1
The SLLN asserts that, as n → ∞, we roughly have Sn ∼ nµ. Similarly, the CLT asserts that, as n → ∞,

we roughly have |Sn – nµ| ∼ n. But, of course, Sn is a random variable, not a constant. And it is quite
possible that |Sn – nµ| is O(n α ) where α > 1/2. The theory of large deviations studies the asymptotic
behavior of P(|Sn – nµ| > n α ) as n → ∞ for α > 1/2.

Let us introduce ΛX , the cumulant generating function of X, which is given by

ΛX (t ) = log MX (t ), MX (t ) = Eet X . (3.7)


0 (0) = µ because
We note that ΛX

M0 (0) EXet X

0 (0)
ΛX = X = = EX = µ.
MX (0) Eet X t =0
We also note that ΛX is convex because
00 (t ) – (M0 (t ))2
MX (t )MX Eet X EX2 et X – (EXet X )2
00 (t )
ΛX = X = ≥ 0, (3.8)
M2X (t ) M2X (t )
44 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

where we have used the Cauchy-Schwarz inequality

(EYZ)2 ≤ EY2 EZ2 , with Y = Xet X/2 , Z = et X/2 .

Next, we define Λ∗X , the Fenchel-Legendre transform for ΛX as follows

Λ∗X (a) = sup {at – ΛX (t )}, a ∈ R. (3.9)


t ∈R

We are now in a position to state the main theorem of this section

Theorem 3.5.1 (Large Deviations Principle). Let (Xi ) be a sequence of iid random variables
with common distribution X. Define µ := EX and suppose the the moment generating function
MX (t ) := Eet X is finite in some neighborhood of t = 0. Let ΛX and Λ∗X be given by (3.7) and (3.9),
respectively. Suppose a > µ and P(X > a) > 0. Then Λ∗ (0) > 0 and

1 n
log P(Sn > na) = –Λ∗X (a),
X
lim Sn = Xi . (3.10)
n→∞ n
i =1

Theorem 3.5.1 asserts that, under appropriate conditions, we have P(Sn > na) ∼ exp(–nΛ∗X (a)).
Although the theorem appears to deal only with deviations of Sn in excess of its mean, the corresponding
result for deviations of Sn below the mean can be obtained by considering the sequence of iid random
variables (–Xi ).

Proof of Theorem 3.5.1. Without loss of generality, we may assume that µ = 0 (if µ 6= 0, we can
define Yi = Xi – µ and translate the result accordingly). We begin by proving that Λ∗X (a) > 0. We have

eat 1 + at + O(t 2 )
   
at – ΛX (t ) = log = log , where σ 2 := VX.
MX (t ) 1 + σ 2 t 2 /2 + O(t 3 )
For t > 0 sufficiently small we have

1 + at + O(t 2 )
> 1.
1 + σ 2 t 2 /2 + O(t 3 )
Hence, we have

1 + at + O(t 2 )
 
Λ∗X (a) = sup {at – ΛX (t )} ≥ sup log > 0.
t ∈R t ∈R+ 1 + σ 2 t 2 /2 + O(t 3 )

We make two notes for future use. First, by assumption we have a > µ = 0. As ΛX is convex with
0 (0) = µ = 0 and Λ (0) = 0 it follows that
ΛX X

Λ∗X (a) = sup{at – ΛX (t )}, a > 0. (3.11)


t >0
3.5. LARGE DEVIATIONS PRINCIPLE 45

To see (3.11), it may help to draw a picture. Second,

00 exists.
ΛX is strictly convex at points where ΛX (3.12)

To see this, note that VX > 0 under the hypotheses of the theorem, implying by (3.8) and the Cuachy-
00 (t ) > 0.
Schwarz inequality that ΛX

We now proceed to derive an upper bound for P(Sn > na). Using the fact that et Sn > enat 1{Sn >na} for
all t > 0 we obtain
 n
P(Sn > na) = E1{Sn >na} ≤ e–nat Eet Sn = e–at MX (t ) = e–n(at –ΛX (t )) , ∀ t > 0. (3.13)

Note that (3.13) is valid for all t > 0 and thus

1
log P(Sn > na) ≤ – sup{at – ΛX (t )} = –Λ∗X (a), (3.14)
n t >0

where the equality follows from (3.11).

Before obtaining lower bound for P(Sn > na) let us define

T := sup{t : MX (t ) < ∞}.

Throughout this proof we will assume that

t ∗ := argmax{at – ΛX (t )} ∈ (0, T). (3.15)


t

This assumption is not necessary, but simplifies the proof considerably. As at – ΛX (t ) has a maxumm at
t ∗ , and as ΛX is C∞ on (0, T), the derivative of at – ΛX (t ) equals zero at t = t ∗ and hence

0 (t ∗ ) = a.
ΛX

Let us denote by FX the distribution of X. We can define an exponentially tilted distribution FX


e as
follows
Z x
1 ∗
FX
e (x ) = ∗ et y dFX (y). (3.16)
MX (t ) –∞

Let (X e = Pn X
e ) be a sequence of iid random variables with common distribution F and define S
n i =1 i .
e
i X
e
Observe that

Z Z
e(t +t )x MX (t + t ∗ )
MX
e (t ) := etx dF (x ) = dF X (x ) = .
R X R MX (t ∗ ) MX (t ∗ )
e
46 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

The first two moments of X


e can be obtained by direct computation

0 (t ∗ )
MX
EX
e = M (0) = 0 (t ∗ ) = a,
= ΛX
X
e
M(t ∗ )
2 e 2 = M00 (0) – (M0 (0))2 = Λ00 (t ∗ ) ∈ (0, ∞).
VX
e = EX
e – (EX)
X
e X
e X
00 (t ∗ ) < ∞ follow from (3.12) and the assumption (3.15). Noting that S and
where, the fact that 0 < ΛX n
S
e are sums of iid random variables, we have
n

MSn (t ) = Mn
X (t ), MeS (t ) = Mne (t ).
n X

Using the above and (3.16), it is easy to show that FSn and FSe are related as follows
n
Z x
1 t ∗ y dF (y).
FeS (x ) = n e Sn
n MX (t ∗ ) –∞
Now, let b > a. We have
Z ∞
P(Sn > na) = dFSn (x )
Zna

= Mn ∗ –t ∗ x dF (x )
X (t )e Sn
na
e
Z nb
≥ Mn ∗ –nt ∗ b
X (t )e dFeS (x )
n
na
∗ ∗
≥ e–n(t b–ΛX (t )) P(na <S
e < nb).
n

As EXe = a and VX e > na) → 1/2 as n → ∞, and by the SLLN we


e > 0 we have by the CLT that P(S
n
have that P(Sn < nb) → 1. Therefore, we have
e

1 1
log P(Sn > na) ≥ –(t ∗ b – ΛX (t ∗ )) + log P(na < S
e < nb)
n
n n
→ –(t ∗ b – ΛX (t ∗ )) as n → ∞
→ –(t ∗ a – ΛX (t ∗ )) = –Λ∗X (a) as b → a. (3.17)

Equation (3.10) follows from (3.14) and (3.17).

Example 3.5.2. Let (Xi ) be a sequence of iid random variables with P(Xi = 1) = P(Xi = –1) = 1/2.
We claim that
1
P(Sn > an)1/n → q , 0 < a < 1,
(1 + a)1+a (1 – a)1–a

as n → ∞. To see this, we have from (3.10) that



P(Sn > na)1/n → e–ΛX (a)
3.6. EXERCISES 47

We must compute Λ∗X (a). First, we have


 
ΛX (t ) = log Eet X = log 21 e–t + 12 et = log cosh t .

Next, to find t ∗ := argmax{at – ΛX (t )} we compute



= a – tanh t ∗ , t ∗ = tanh–1 a.

0 = ∂t at – ΛX (t ) ⇒
t =t ∗
Therefore, we have

Λ∗X (a) = sup{at – ΛX (t )} = at ∗ – ΛX (t ∗ ) = a tanh–1 a – log cosh tanh–1 a.


t

Lastly, using 2 tanh–1 a = log(1 + a) – log(1 – a), we have


–1
∗ –1 –1 e–a tanh a
e–ΛX (a) = e–a tanh a+log cosh tanh a = √
q
1 – a2
(1 + a)–a (1 – a)a 1
= q =q ,
(1 + a)(1 – a) (1 + a)1+a (1 – a)1–a
as claimed. Note that, if a ∈
/ (0, 1), then we have P(Sn > na) = 0 for all n.

3.6 Exercises
Exercise 3.1. Let X ∼ Bin(n, U) where U ∼ U((0, 1)). What is the probability Generating function
GX (s) of X? What is P(X = k ) where k ∈ {0, 1, 2, . . . , n}?

Exercise 3.2. Let Zn be the size of the nth generation in an ordinary branching process with Z0 = 1,
EZ1 = µ and VZ1 > 0. Show that EZn Zm = µn–m EZ2m for m ≤ n. Use this to find the correlation
coefficient ρ(Zm , Zn ) in terms of µ, n, and m. Consider the case µ = 1 and the case µ 6= 1.

Exercise 3.3. Consider a branching process with generation sizes Zn satisfying Z0 = 1 and P(Z1 = 0) = 0.
Pick two individuals at random with replacement from the nth generation and let L be the index of the
generation which contains their most recent common ancestor. Show that P(L = r ) = EZ–1 –1
r – EZr +1 for
0 ≤ r < n.

Exercise 3.4. Consider a branching process with immigration


Zn
X
Z0 = 1, Zn+1 = Xn,i + Yn
i =1
where the (Xn,i ) are iid with common distribution X, the (Yn ) are iid with common distribution Y and
the (Xn,i ) and (Yn ) are independent. What is GZn+1 (s) in terms of GZn (s), GX (s) and GY (s)? Write
GZ2 (s) explicitly in terms of GX (s) and GY (s).
48 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS

2
Exercise 3.5. Find φX2 (t ) := Eeit X where X ∼ N(µ, σ 2 ).

Exercise 3.6. Let Xn have cumulative distribution function


sin(2nπx )
 
FXn (x ) = x – 1{0≤x ≤1} + 1{x >1}
2nπ
(a) Show that FXn is a distribution function and find the corresponding density function fXn . (b) Show
thatFXn converges to the uniform disribution function FU as n → ∞, but that the density function fXn
converge to fU . Here, U ∼ U((0, 1)).

Exercise 3.7. A coin is tossed repeatedly, with heads turning up with probability p on each toss. Let
N be t he minimum number of tosses required to obtain k heads. Show that, as p → 0, the distribution
function of 2Np converges to that of a gamma distribution. Note that, if X ∼ Γ(λ, r ) then
1 r r –1 –λx
fX (x ) = λ x e 1{x ≥0} .
Γ(r )

Exercise 3.8. Show that, as n → ∞, we have


! Z x
n 1 2
2–n dy √ e–y /2
X

k ∈An
k –x 2π

where An := {k : |k – n/2| ≤ x n/2}.
Chapter 4

Discrete time Markov chains

The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 6).

4.1 Overview of discrete time Markov chains


We have already defined what we mean by Markov process in Definition 2.3.6. In this chapter, we will
derive some results that are specific for Markov discrete time processes that have a countable state space.

Definition 4.1.1. A discrete time Markov chain X = (Xn )n∈N0 is a discrete-time Markov process
with a countable state space S.

Recall from (2.4) that a Markov process defined on a probability space (Ω, F, P), which is equipped with a
filtration F = (Fn )n∈N0 , is a proxess X = (Xn )n∈N0 that satisfies P(Xn+k ∈ A|Fn ) = P(Xn+k ∈ A|Xn ).
What the Markov property means in practice is that the evolution of a Markov chain is described by
its one-step transition probabilities P(Xn+1 = j |Xn = i ). In general, such probabilities may depend on
the time n. Throughout this Chapter, we will restrict our attention to Markov chains whose transition
probabilities do not depend on n.

Definition 4.1.2. A discrete time Markov chain X is called homogeneous if

P(Xn+1 = j |Xn = i ) = P(X1 = j |X0 = i ) =: p(i , j ) ∀ n ∈ N0 , ∀ i , j ∈ S.

We call the |S| × |S| matrix P = (p(i , j )) the one-step transition matrix.

The one-step transition matrix P satisfies some obvious properties:


P
1. j p(i , j ) = 1.

49
50 CHAPTER 4. DISCRETE TIME MARKOV CHAINS

2. p(i , j ) ≥ 0 for all i , j ∈ S.

Since P(Xn+1 = j |Xn = i ) = P(X1 = j |X0 = i ), it follows that P(Xn+m = j |Xm = i ) = P(Xn =
j |X0 = i ).

Definition 4.1.3. Let X be a discrete time Markov chain and let

pn (i , j ) := P(Xn = j |X0 = i ) ∀ n ∈ N0 , ∀ i , j ∈ S.

We call the |S| × |S| matrix Pn = (pn (i , j )) the n-step transition matrix.

From the one-step transition matrix P, one can easily derive the n-step transition matrix Pn .

Theorem 4.1.4 (Chapman-Kolmogorov Equation). Let P and Pn the the one-step and n-step
transition matrices of a homogeneous discrete-time Markov chain. Then

Pm+n = Pm Pn , Pn = Pn ,

where Pn is the nth power of P.

Proof. For any i , j ∈ S we have

pm+n (i , j ) = P(Xm+n = j |X0 = i )


X
= P(Xn+m = j |Xm = k )P(Xm = k |X0 = i )
k
X X
= P(Xn = j |X0 = k )P(Xm = k |X0 = i ) = pm (i , k )pn (k , j )
k k
Thus, we have established that Pm+n = Pm Pn . Next, note that

Pn = Pn–1 P = Pn–2 P2 = . . . = Pn ,

as claimed.

Lemma 4.1.5. Let X be a homogeneous discrete time Markov chain. Denote the probability mass
function of Xn by

µn (i ) := P(Xn = i ), µn = (µn (1), µn (2), . . . , µn (|S|)).

Then µn+m = µn Pm and thus µm = µ0 Pm .

Proof. For any j ∈ S we have


X X
µn+m (j ) = P(Xn+m = j ) = P(Xn+m = j |Xn = i )P(Xn = i ) = µn (i )pm (i , j ).
i i
Thus, we have established that µn+m = µn Pm . To finish the proof, take n = 0 and use Pm = Pm .
4.2. CLASSIFICATION OF STATES 51

Example 4.1.6. Consider a random walk on a circle with n nodes. At each step a particle moves
clockwise with proability p and counter-clockwise with probability q. Let us write the one-step transition
matrix. We have
 

0 p 0 0 ... 0 0 q 
 q 0 p 0 ... 0 0 0 
 
 
 
 0 q 0 p ... 0 0 0 
P= .
 
 .. .. .. .. . . .. .. .. 

 . . . . . . . .  
 
0 0 0 0 ... q 0 p 
 

 
p 0 0 0 ... 0 q 0

Example 4.1.7. Consider a random walk on N0 . At each step a particle moves to the up one unit with
probability p and returns to the origin with probability q. Let us write the one-step transition matrix.
We have
 
 q p 0 0 0 ... 
 
 q 0 p 0 0 ... 
P= .
 

 q 0 0 p 0 ... 

 .. .. .. .. .. . . 
. . . . . .

4.2 Classification of states


Definition 4.2.1. Let X be a Markov chain. We say that state i is persistent or recurrent if

P(Xn = i for some n|X0 = i ) = 1.

Otherwise, we say that state i is transient.

If a Markov chain visits a persistent state once, then it is guarnateed to return to that state. In fact, it is
guaranteed to return to that state infinitely often.

We would like to find conditions on the transition matrix P that enable us to classify a state as either
persistent or transient.

Definition 4.2.2. Let X be a Markov chain. We define the first passage to state j by

τj := inf{n ≥ 1 : Xn = j )

and we denote by fij the probability mass function of τj given X0 = i . That is

fij (n) = P(τj = n|X0 = i ).


52 CHAPTER 4. DISCRETE TIME MARKOV CHAINS

Theorem 4.2.3. Define the following generating functions


∞ ∞
s n pn (i , j ), s n fij (n),
X X
Pij (s) := p0 (i , j ) := δi ,j , Fij (s) := fij (0) := 0.
n=0 n=0
Then, for any |s| < 1, we have

Pii (s) = 1 + Fii (s)Pii (s), Pij (s) = Fij (s)Pjj (s), i 6= j . (4.1)

Note that we must restrict |s| < 1 because P is not a probability generating function. When we need
s = 1 we can take a limit as s % 1 and use Abel’s Theorem. which states that

X
lim Pij (s) = pn (i , j ). (4.2)
s%1 n=0
Proof. Fix i , j ∈ S and define

Am = {Xm = j }, Bm = {τj = m}.

Since the sets (Bm )m≥1 are disjoint we have


m
X
pm (i , j ) = P(Am |X0 = i ) = P(Am ∩ Bk |X0 = i ). (4.3)
k =1
Note that

P(Am ∩ Bk |X0 = i ) = P(Am |Bk , X0 = i ) · P(Bk |X0 = i )


= P(Am |Xk = j ) · P(Bk |X0 = i ) = pm–k (j , j ) · fij (k ). (4.4)

Inserting (4.4) into (4.3) we obtain


m
X
pm (i , j ) = fij (k )pm–k (j , j ).
k =1
Thus

s m pm (i , j )
X
Pij (s) =
m=0
∞ m
sm
X X
= δij + fij (k )pm–k (j , j )
m=1 k =1
∞ X m
s k fij (k )s m–k pm–k (j , j )
X
= δij +
m=1 k =1
∞ ∞
s k fij (k ) · s m–k pm–k (j , j )
X X
= δij +
k =1 m=k
= δij + Fij (s)Pjj (s),

which is precisely the claimed result.


4.2. CLASSIFICATION OF STATES 53

Corollary 4.2.4. The following holds:

pn (j , j ) = ∞, and if this holds then pn (i , j ) = ∞ for all i such


P P
1. State j is persistent if n n
that Fij (1) > 0.
2. State j is transient if n pn (j , j ) < ∞, and if this holds then n pn (i , j ) < ∞ for all i .
P P

pn (j , j ) = ∞. First, by Definition 4.2.1, we


P
Proof. We will show that j is persistent if and only if n
have

X
j is persistent ⇔ Fjj (1) = fjj (n) = 1.
n=1

Next, from (4.1) and Abel’s Theorem (4.2), we have


X 1 1
pn (j , j ) = lim Pjj (s) = lim = .
n s%1 s%1 1 – Fjj (s) 1 – Fjj (1)

pn (j , j ) = ∞. We also have from (4.1) and Abel’s Theorem (4.2)


P
Thus, j is persistent if and only if n
that
lims%1 Pij (s)
P
X n pn (i , j )
pn (j , j ) = lim Pjj (s) = = .
n s%1 Fij (1) Fij (1)

pn (i , j ) = ∞ whenever Fij (1) > 0.


P
Thus, if j is persistent then n

Corollary 4.2.5. If j is transient then pn (i , j ) → 0 as n → ∞ for all i .

pn (i , j ) < ∞ for all i . It follows that


P
Proof. From Corollary 4.2.4, if j is transient, we have n
pn (i , j ) → 0 as n → ∞ for all i .

There are a few other state classifications that are important

Definition 4.2.6. The mean recurrence time of a state j is defined as

τ̄j = E[τj |X0 = j ],

where τj is defined in Definition 4.2.2.

Clearly, τ̄j = ∞ if a state is transient. However, τ̄j may be infinite even if j is recurrent.

Definition 4.2.7. A recurrent state j is said to be null if τ̄j = ∞ and non-null or positive if τ̄j < ∞.

This is a simple condition which differentiates null recurrent from positive recurrent states.

Theorem 4.2.8. A persistent state j is null if and only if pn (j , j ) → 0 as n → ∞; if this holds then
pn (i , j ) → 0 as n → ∞ for all i .
54 CHAPTER 4. DISCRETE TIME MARKOV CHAINS

We will not prove of Theorem 4.2.8.

Definition 4.2.9. The period of a state i , denoted d(i ), is defined by d(i ) = gcd{n : pn (i , i ) > 0}. If
d(i ) = 1 we say that state i is aperiodic. Here gcd means “greatest common divisor.”

Example 4.2.10. Let X be a Markov Chain with one-step transition matrix


 
0 1 0
 
P=
 0 0 1 .

 
1 0 0

One can easily see that



 1 n = 3, 6, 9, . . .
pn (i , i ) =
 0 n 6= 3, 6, 9, . . .

Since gcd{3, 6, 9, 12, . . .} = 3, we have d(i ) = 3 for all i .

Definition 4.2.11. A state is called ergodic if it is persistent, non-null, and aperiodic.

4.3 Classification of chains


Definition 4.3.1. We say state i communicates with state j , written i → j , if pn (i , j ) > 0 for
some n ≥ 1 (in other words, it is possible to reach state j from state i ). We say that states i and j
intercommunicate, written i ↔ j , if i → j and j → i .

Theorem 4.3.2. Suppose i ↔ j . Then

1. i is transient if an only if j is transient;


2. i is null persistent if and only if j is null persistent.
3. i and j have the same period: d(i ) = d(j ).

Proof. We will prove Item 1. If i ↔ j then there exists k , n such that α := pk (i , j )pn (j , i ) > 0.
Therefore, for any m, we have

pk +m+n (i , i ) ≥ pk (i , j )pm (j , j )pn (j , i ) = αpm (j , j ).

Therefore, if i is transient. by Corollary 4.2.4, we have


X X X
∞> pm (i , i ) ≥ pk +m+n (i , i ) ≥ α pm (j , j ),
m m m

and thus j is transient is well. To show the converse simply switch i and j in the above argument.
4.3. CLASSIFICATION OF CHAINS 55

Definition 4.3.3. Let C ⊂ S. We say that C is closed if p(i , j ) = 0 for all i ∈ C and j ∈
/ C. We say
that C is irreducible if i ↔ j for all i , j ∈ C.

If a closed set C consists of a single state (e.g., C = {j }), then we call this state absorbing.

Theorem 4.3.4 (Markov chain decomposition). The state space S of a Markov chain can be
uniquely partitioned as follows

S = T ∪ C1 ∪ C2 ∪ . . . ,

where T is the set of transient states and C1 , C2 , . . . are closed sets of persistent states.

The importance of the decomposition Theorem is as follows. Suppose X0 ∈ Ck , where Ck is a set of


closed persistent states. Then X remains in Ck forever; as such, we may as well analyze only Ck rather
than the entire states space S. On the other hand, if X0 ∈ T, then the chain X either remains in T
forever or it eventually visits one of the closed persistent states Ck , where it remains forever. If |S| < ∞,
it turns out that it is impossible for X to remain in T forever.

Lemma 4.3.5. If S is finite, then at least one state is persistent and all persistent states are
non-null.

Proof. First, suppose all state are transient. Then, by Corollary 4.2.5, we have
X X
0= lim pn (i , j ) = lim pn (i , j ) = 1.
n→∞ n→∞
j j

Thus, we have a contradiction and we conclude that at least one state is persistent. Now, suppose j ∈ Ck ,
where Ck is a closed set of persistent states. Suppose j is null. Then by Theorem 4.3.2, all states i ∈ Ck
are null. Then, by Theorem 4.2.8 we have
X X
0= lim pn (i , j ) = lim pn (i , j ) = 1.
n→∞ n→∞
j j

Thus, j cannot be null.

Example 4.3.6. Let S = {1, 2, 3, 4, 5, 6} and


 
 p q 0 0 0 0 
 
 q p 0 0 0 0 
 
 
 r s t
u 0 0 
P= , q + p = 1, r + s + t + u = 1.
 


 0 0 u t s r 

 

 0 0 0 0 p q 

 
0 0 0 0 q p
56 CHAPTER 4. DISCRETE TIME MARKOV CHAINS

The states subsets C1 = {1, 2} and C2 = {5, 6} are closed, persistent, non-null sets, since, once the chain
visits these sets, it cannot escape. The subset T = {3, 4} is transient. If X0 ∈ T, it will eventually move
to either C1 or C2 , where it will remain forever.

4.4 Stationary distributions and the limit theorem


In this section, we will be interested in studying the long-run behavior of a Markov chain X.

Definition 4.4.1. Let X be a Markov chain with one-step transition matrix P. We say that a row vector

π = (π(1), π(2), . . . , π(|S|)),

is a stationary or invariant distribution of X if it satisfies


X
π = πP, π(i ) = 1,
i
and π(i ) ≥ 0 for all i

The vector π is called a stationary distribution for the following reason: suppose the Markov chain X
has an initial distribution µ0 = π. Then we have

µn = πPn = πPPn–1
= πPn–1 = πPPn–2
= ...
= πP = π.

Thus, if µ0 = π then µn = π for all n.

Example 4.4.2. Consider a two-state Markov chain with state space S = {1, 2} and with one-step
transition matrix
 
1–p p
P= , q, p ∈ (0, 1).
q 1–q
Let us find π. From π = πP we derive

π(1) = π(1)(1 – p) + π(2)q,


π(2) = π(1)p + π(2)(1 – q).

We also have π(1) + π(2) = 1. Solveing for π(1) and π(2), we obtain
q p
π(1) = , π(2) = .
p+q p+q
4.4. STATIONARY DISTRIBUTIONS AND THE LIMIT THEOREM 57

For a given Markov chain X, a stationary distribution may not exist. And, if it does exist, it may not be
unique.

Theorem 4.4.3. An irreducible chain X has a stationary distribution π if an only if all the states
a non-null persistent; in this case π is unique and is given by

π(i ) = 1/τ̄i , ∀ i ∈ S,

where τ̄i is given in Definition 4.2.6.

It is typically much easier to compute π rather than the recurrence times τ̄i . Theorem 4.4.3 gives us a
method of computing τ̄i from π(i ).

Example 4.4.4. Consider a Markov chain with state space S = {0, 1, 2, . . .} and with one-step transition
matrix
 
 q p 0 0 0 ... 
 q 0 p 0 0 ...
 

 
 
 0 q 0 p 0 ...
P= , p + q = 1.

 

 0 0 q 0 p ... 

 .. .. .. .. .. . . 
. . . . . .

Note that the chain is irreducible. Thus, if an invariant distribution π exists, it is unique. Let us find π
(if it exists). From π = πP we have
1–q p
π(0) = qπ(0) + qπ(1), ⇒ π(1) = π(0) = π(0),
q q
!2
1 p
 
π(1) = pπ(0) + qπ(2), ⇒ π(2) = π(1) – pπ(0) = π(0),
q q
!3
1 p
 
π(2) = pπ(1) + qπ(3), ⇒ π(3) = π(2) – pπ(1) = π(0).
q q

This suggests that π(n) = (p/q)n π(0). Assume this is true for every i ≤ n. We show it holds for n + 1
(and thus, for every n). We have
1
 
π(n) = pπ(n – 1) + qπ(n + 1), ⇒ π(n + 1) = π(n) – pπ(n – 1)
q
 !n !n–1 
1 p p
=  π(0) – p π(0)
q q q
!n+1
p
= π(0).
q
58 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
P
Next, we use n π(n) = 1 to find π(0). We have
∞ ∞ 1
(p/q)n = π(0)
X X
1= π(n) = π(0) , if (p/q) < 1.
n=0 n=0 1 – (p/q)

Thus, if (p/q) < 1 then π(0) = 1 – (p/q). Finally, let us find the mean recurrence time of the nth state.
We have
!n !n
1 1 q 1 q
τ̄n = = = .
π(n) π(0) p 1 – (p/q) p

Note that if (p/q) > 1, then there is no invariant distribution π. In this case, the chain is transient and
the mean recurrence time for every state is infinite. What happens ini p = q? It turns out that in this
case, all states are null recurrent (need to check).

4.5 Reversibility
Definition 4.5.1. Let X = (Xn )0≤n≤N be a Markov chain. The time reversal of X is the process
Y = (Yn )0≤n≤N , where

Yn = XN–n .

Theorem 4.5.2. Let X = (Xn )0≤n≤N be an irreducible Markov chain with a one-step transition
matrix P and invariant distribution π. Suppose X0 ∼ π (so that µn = π for every n). Then Y, the
time reversal of X, is a Markov chain and its one-step transition matrix, denoted Q = (q(i , j ))ij ,
is given by

π(j )
q(i , j ) = p(j , i ). (4.5)
π(i )

Proof. Let Fn = σ(Yk , 0 ≤ k ≤ n). We cannot assume, a priori that Y is Markov. Thus, in computing
P(Yn+1 = in+1 |Fn ) we must condition on the entire history Fn of Y rather than conditioning on Yn
only. We have

P(Yn+1 = in+1 |Yk = ik , 0 ≤ k ≤ n)


P(Yk = ik , 0 ≤ k ≤ n + 1)
=
P(Yk = ik , 0 ≤ k ≤ n)
P(XN–k = ik , 0 ≤ k ≤ n + 1)
=
P(XN–k = ik , 0 ≤ k ≤ n)
P(XN–k = ik , 0 ≤ k ≤ n – 1|XN–n = in )P(XN–n = in |XN–(n+1) = in+1 )P(XN–(n+1) = in+1 )
=
P(XN–k = ik , 0 ≤ k ≤ n – 1|XN–n = in )P(XN–n = in )
4.6. CHAINS WITH FINITELY MANY STATES 59

P(XN–n = in |XN–(n+1) = in+1 )P(XN–(n+1) = in+1 )


=
P(XN–n = in )
p(in+1 , in )π(in+1 )
= .
π(in )

Thus, we have show that Y is Markov and its one-step transition matrix is defined by (4.5).

Definition 4.5.3. Let X = (Xn )n≥0 be a irreducible Markov chain with a one-step transition matrix P
and invariant distribution π. Suppose X0 ∼ π (so that µn = π for every n). Let Y be the time reversal
of X and denote by Q the one-step transition matrix of Y. We say that X is reversible if P = Q, or
equivalently (by Theorem 4.5.2), if

π(i )p(i , j ) = π(j )p(j , i ), ∀ i , j ∈ S.

Definition 4.5.4. Let P be an |S| × |S| one step transition matrix. We say that a distribution
λ = (λ(1), λ(2), . . . , λ(|S|)) is in detailed balance with P if

λ(i )p(i , j ) = λ(j )p(j , i ), ∀ i , j ∈ S. (4.6)

Theorem 4.5.5. Let P be the one step transition matrix of an irreducible Markov chain X. Suppose
there exists a distribution λ such that (4.6) holds. Then λ is a stationary distribution for X. That
is, λ = λP.

Proof. Assuming (4.6) holds, we have


X X X
λ(i )p(i , j ) = λ(j )p(j , i ) = λ(j ) p(j , i ) = λ(j ).
i i i

Thus, we have proved λ = λP, and thus λ is an stationary distribution for X.

4.6 Chains with finitely many states


In this section, we will give some important results that apply to Markov chains with finitely many
states. By Theorem 4.3.2 and Lemma 4.3.5, if the state space S is finite and irreducible then all states
are necessarily non-null persistent and have the same period.

Theorem 4.6.1 (Perron-Frobenius). If P is a one-step transition matrix of a finite irreducible


Markov chain X with period d, then

1. λ1 = 1 is an eigenvalue of P.
60 CHAPTER 4. DISCRETE TIME MARKOV CHAINS

2. The d complex roots of unity

λn = ω n–1 , ω = e2πi/d , n = 1, 2, . . . , d,

are eigenvalues of P.
3. The remaining eigenvalues λd+1 , λd+2 , . . . , λ|S| , satisfy |λj | < 1.

Suppose the eigenvalues of P are distinct. Then it is well-known that P has the decomposition
   
 e1   λ1 0 ... 0 
   
 e2   0 λ2 . . . 0 
P = U–1 ΛU, U= 
 .. ,

Λ= 
 .. .. . .

,

 . 


 . . . ... 

   
e|S| 0 0 . . . λ|S|
where (ei )1≤i ≤|S| are the left eigenvectors of P.

Pn = U
|
–1 ΛU · U–1 ΛU · . . . · U–1 ΛU = U–1 Λn U,
{z } (4.7)
n times

where we have used the fact that U–1 U = I.

Expression (4.7) allows us to study the long-run behavior n → ∞ of a Markov chain X. In what follows,
assume the period of X is one so that the eigenvalues of P satisfy 1 = λ1 > |λ2 | > |λ3 | > . . . > |λ|S| |. Let
X0 = µ0 . As the eigenvectors (ei )1≤i ≤|S| form a basis for R|S| we can express µ0 as
|S|
X
µ0 = ci ei , (4.8)
i =1
fo some constants (ci ). Next, using (4.7) and (4.8), we obtain
|S| |S| |S|
Pn U–1 Λn U ci λn ci λn
X X X
µn = µ0 = ci ei = i ei . = c1 απ + i ei ,
i =1 i =1 i =2
where, in the last step, we used λ1 = 1 and e1 = απ for some constant α. As n → ∞, the terms in the
sum go to zero at least as fast as |λ2 |n = exp(n log |λ2 |). Thus, µn → π exponentially fast with a rate
determined by the second eigenvalue.

4.7 Exercises
Exercise 4.1. A six-sided die is rolled repeatedly. Which of the following a Markov chains? For those
that are, find the one-step transition matrix. (a) Xn is the largest number rolled up to the nth roll. (b)
X – n is the number of sixes rolled in the first n rolls. (c) At time n, Xn is the time since the last six
was rolled. (d) At time n, Xn is the time until the next six is rolled.
4.7. EXERCISES 61

Exercise 4.2. Let Yn = X2n . Compute the transition matrix for Y when (a) X is a simple random
walk (i.e., X increases by one with probability p and decreases by 1 with probability q and (b) X is a
branching process where G is the generating function of the number of offspring from each individual.

Exercise 4.3. Let X be a Markov chain with state space S and absorbing state k (i.e., p(k , j ) = 0 for
all j ∈ S). Suppose j → k for all j ∈ S. Show that all states other than k are transient.

Exercise 4.4. Suppose two distinct states i , j satisfy

P(τj < τi |X0 = i ) = P(τi < τj |X0 = j ),

where τk := inf{n ≥ 1 : Xn = j }. Show that, if X0 = i , the expected number of visits to j prior to


re-visiting i is one.

Exercise 4.5. Let X be a Markov chain with transition matrix


 
1 – 2p 2p 0
 
P= p 1 – 2p p , p ∈ (0, 1).
 
 
0 2p 1 – 2p

Find Pn , the invariant distribution π and the mean-recurrence times τ̄j for j = 1, 2, 3.

Exercise 4.6. Let Xn be the number of mistakes in the nth addition of a book. Between the nth and
the (n + 1)th addition and editor corrects each mistake independently with probability p and introduces
Yn new mistakes where the (Yn ) are iid and Poisson distributed with parameter λ. Find the invariant
distribution π of the number of mistakes in the book.

Exercise 4.7. Give an example of a transition matrix P that admits multiple stationary distributions π.

Exercise 4.8. A Markov chain on S = {0, 1, 2, . . . , n} has transition probabilities p(0, 0) = 1 – λ0 ,


p(i , i + 1) = λi and p(i + 1, i ) = µi +1 for i = 0, 1, . . . n – 1, and p(n, n) = 1 – µn . Show that the process
is reversible in equilibrium.
62 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
Chapter 5

Continuous time Markov chains

The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 6).

5.1 The Poisson process


Many stochastic process change values at random times rather than at fixed intervals. For example, the
value of a stock does not change once per second every second. Rather, a stock price changes when there
is a buy or sell order. And these orders arrive at random times. Thus, it makes no sense to model a
stock price as a discrete time Markov chain; rather, we must move to continuous time. The simplest
continuous time process is a counting process.

Definition 5.1.1. A counting process is a stochastic process N = (Nt )t ≥0 taking values in S =


{0, 1, 2, . . .} such that the following hold:

1. N0 = 0.
2. If s < t then Ns ≤ Nt .

As its name suggests, a counting process simply counts the number of times an event occurs prior to a
given time. For example, we could define Nt := {number of buy orders prior to time t }. Of course, there
are many ways to model a counting process. And, depending on how we model N, the counting process
may or may not be Markov. However, because Markov processes allow for many analytically tractable
computations, we will focus on Markov counting processes; the most basic of these is the Poisson process.

Definition 5.1.2. A Poisson process with intensity λ is a stochastic process N = (Nt )t ≥0 taking values
in S = {0, 1, 2, . . .} such that the following hold.

1. N0 = 0.

63
64 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

2. If s < t then Ns ≤ Nt .
3. If s < t then (Nt – Ns ) ⊥⊥ Fs , where Fs = σ(Nr , 0 ≤ r ≤ s).
4. Lastly,



 λs + O(s 2 ) m = 1,

P(Nt +s = n + m|Nt = n) =  O(s 2 ) m ≥ 1,
1 – λs + O(s 2 ) m = 0,


as s → 0+ .

Note that Item 3 implies that N is Markov. To see this, define Ft := σ(Ns , 0 ≤ s ≤ t ) and observe that

X
E[g(Nt )|Fs ] = E[g(Nt – Ns + Ns )|Ns ] = g(n + Ns )fNt –Ns (n).
n=0

Let us find the probability mass function of Nt .

Theorem 5.1.3. Let N = (Nt )t ≥0 be a Poisson process with parameter λ. Then, for all t ≥ 0 we
have Nt ∼ Poi(λt ). That is

(λt )j –λt
pt (j ) := P(Nt = j ) = e . (5.1)
j!
Proof. We have
j
X
P(Nt +s = j ) = P(Nt +s = j |Nt = i )P(Nt = i )
i =0
X j
= P(Nt +s – Nt = j – i )P(Nt = i )
i =0
= P(Nt +s – Nt = 0)P(Nt = j ) + P(Nt +s – Nt = 1)P(Nt = j – 1)
jX
–2
+ P(Nt +s – Nt = j – i )P(Nt = i )
i =0
= (1 – λs)P(Nt = j ) + λsP(Nt = j – 1) + O(s 2 ).

Subtracting P(Nt = j ) =: pt (j ) from both sides, dividing by s and taking a limit as s → 0 we obtain

∂t pt (0) = –λpt (0), p0 (0) = 1, (5.2)


∂t pt (j ) = –λpt (j ) + λpt (j – 1), p0 (j ) = 0, j ≥ 1, (5.3)

where we have included the obvious boundary conditions. Thus, we have obtained a sequence of nested
ODEs for (pt (j ))j ≥0 . One can easily verify that (5.1) satisfies (5.2)-(5.3).
5.1. THE POISSON PROCESS 65

For any counting process (whether or not it is Poisson), we may be interested to know the arrival
time of the nth event and the time between the nth and (n + 1)th events.

Definition 5.1.4. Let N = (Nt )t ≥0 be a counting process. We define Sn , the the nth arrival time, by

S0 = 0, Sn = inf{t ≥ 0 : Nt = n}, n ≥ 1, (5.4)

with inf{∅} = ∞. We define τn , the nth inter-arrival time, by

τn := Sn – Sn–1 , n ≥ 1. (5.5)

Given the full history of N, we can construct the sequence (τn )n≥1 using (5.4)–(5.5). Alternatively, given
the sequence (τn )n≥1 , we can construct the path of N using
n
X ∞
X
Sn = τi , n ≥ 1, Nt = n1{Sn ≤t <Sn+1 } (5.6)
i =1 n=1

We now undertake this alternative construction.

Theorem 5.1.5. Suppose the inter-arrival times τi (see Definition 5.1.4) of a counting process
N = (Nt )t ≥0 are iid and exponentially distributed with parameter λ. Then N is a Poisson process
with parameter λ.

Proof. We will show that if the inter-arrival times (τi )i ≥1 are iid with τi ∼ E(λ) then Nt , given by
(5.6) has a probability mass function given by (5.1). First, we show that Sn , as defined in (5.6), has a
Gamma density

(λx )n–1 –λx


P(Sn ∈ dx ) = fSn (x )dx , fSn (x ) = 1{x ≥0} λe . (5.7)
(n – 1)!

Note that S1 = τ1 and fτ (x ) = fS1 (x ) = 1{x ≥0} λe–λx . Thus, (5.7) is correct for n = 1. We now assume
(5.7) holds for n and show that it holds for n + 1. As Sn+1 = Sn + τn+1 we have
Z
fSn+1 (x ) = 2
dydz fSn (y) · fτ (z ) · δ(y + z – x )
ZR
= dy fSn (y) · fτ (x – y)
R
Z
(λy)n–1 –λy
= dy 1{y≥0} λe · 1{x –y≥0} λe–λ(x –y)
R (n – 1)!
λn+1 e–λx Z x (λx )n –λx
= 1{x ≥0} dy y n–1 = 1{x ≥0} λe ,
(n – 1)! 0 (n)!
66 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

which agrees with (5.7). Thus, by induction, (5.7) holds for every n. Using the above result, we now
show that Nt , as defined in (5.6), has probability mass function (5.1). For n ≥ 1 we clearly have
Z t
(λx )n–1 –λx
Z t
P(Nt ≥ n) = P(Sn ≤ t ) = dx fSn (x ) = dx λe .
0 0 (n – 1)!
Using this result, we compute
Z t
(λx )n –λx (λt )n –λt
P(Nt ≥ n + 1) = dx λe =– e + P(Nt ≥ n),
0 (n)! n!
where we have used integration by parts to obtain the second equality. Thus, we have
(λt )n –λt
P(Nt = n) = P(Nt ≥ n) – P(Nt ≥ n + 1) = e ,
n!
which agrees with (5.1). To complete the proof, we observe that

P(Nt = 0) = P(S1 > t ) = P(τ1 > t ) = e–λt ,

which shows that (5.1) holds for n = 0.

The inter-arrival times (τi )i ≥1 of a Poisson process N are memoryless in the sense that

P(τi > s + t , τi > s) P(τi > s + t ) e–λ(t +s)


P(τi > s + t |τi > s) = = = = e–λt = P(τi > t ).
P(τi > s) P(τi > s) e–λs
The memoryless property of the inter-arrival times is what gives the Poisson process its Markov property.
If we were to construct a counting process whose inter-arrival times were not exponentially distributed, we
would find that this process is not Markov. The reason is that if inter-arrival times are not exponentially
distributed, then the amount of time we have to wait for the next event arrival will depend on how much
time has passed since the last event arrival.

It should be clear from the construction of N that the Poisson process has stationary increments

Nt – Ns ∼ Nt –s ,

since Nt – Ns ⊥⊥ Ns .

5.2 Overview of continuous time Markov chains


We have already defined what we mean by Markov process in Definition 2.3.6. In this section, we will
derive some results that are specific for Markov continuous time processes that have a countable state
space.
5.2. OVERVIEW OF CONTINUOUS TIME MARKOV CHAINS 67

Definition 5.2.1. A continuous time Markov chain X = (Xt )t ≥0 is a Markov process with a countable
state space S.

Recall from (2.4) that a Markov process satisfies P(Xt +s ∈ A|Fs ) = P(Xt +s ∈ A|Xs ). Here, we take
the filtration Fs to be the natural filtration for X. That is, Fs = σ(Xu , 0 ≤ u ≤ s). What the
Markov property means in practice is that the evolution of a Markov chain is described by its one-step
transition probabilities P(Xt +s = j |Xs = i ). In general, such probabilities may depend on the times t , s.
Throughout this Chapter, we will restrict our attention to Markov chains whose transition probabilities
do not depend on s.

Definition 5.2.2. A continuous time Markov chain X is called homogeneous if

P(Xt +s = j |Xs = i ) = P(Xt = j |X0 = i ) =: pt (i , j ) ∀ t , s ∈ R+ , ∀ i , j ∈ S.

Let Pt be the |S| × |S| matrix Pt = (pt (i , j )). We call the collection of matrices (Pt )t ≥0 the transition
semigroup.

Theorem 5.2.3. The transition semigroup (Pt )t ≥0 satisfies the following properties.

1. P0 = I.
P
2. j pt (i , j ) = 1.
3. Pt Ps = Pt +s .
P
Proof. Item 1 follows from P(X0 = j |X0 = i ) = δi ,j . Item 2 follows from j P(Xt = j |X0 = i ) = 1.
Lastly, to show Item 3 note that
X
pt +s (i , j ) = P(Xt +s = j |X0 = i ) = P(Xt +s = j |Xt = k ) · P(Xt = k |X0 = i )
k
X X
= P(Xs = j |X0 = k ) · P(Xt = k |X0 = i ) = pt (i , k )ps (k , j ).
k k

We now construct a continuous-time Markov chain in very much the same manner we constructed the
Poisson process in Definition 5.1.2. The following derivation is purely formal (i.e., not rigorous). Some
of what we write is not true in general, but holds in many practical applications.

Definition 5.2.4. A continuous time Markov chain with generator G = (g(i , j ))i ,j is a stochastic
process satisfying

1. If s < t then Xt – Xs ⊥⊥ FsX := σ(Xu , 0 ≤ u ≤ s).


68 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

2. As s → 0 we have P(Xt +s = j |Xt = i ) = δij + g(i , j )s + O(s 2 ).

The essence of Definition 5.2.4 is as follows. In a small time interval [t , t + s) the process X either
remains at its current state Xt = i or it jumps to a new state j 6= i . The probability of remaining at i is
1 + g(i , i )s + O(s 2 ). The probability of jumping to state j is g(i , j )s + O(s 2 ). Since probabilities should
always fall in [0, 1] we must have

g(i , i ) ≤ 0, g(i , j ) ≥ 0, i 6= j .

Morever, since X should be found somewhere with probability one we must also have
X
1= P(Xt +s = j |Xt = i )
j

(δij + g(i , j )s) + O(s 2 ) = 1 + s g(i , j ) + O(s 2 )


X X X
= ⇒ 0= g(i , j ).
j j j

Note that item 2 in Definition 5.2.4 can be written more compactly as

1
 
G = lim Ps – I .
s→0 s
+

Thus, it is clear that we can obtain G from knowledge of Pt . We can also obtain Pt from G.

Theorem 5.2.5. Let X = (Xt )t ≥0 be a continuous time Markov chain with generator G. The
transition semigroup of X satisfies the following ODEs

d
Kolmogorov Forward equation : P = Pt G, (5.8)
dt t
d
Kolmogorov Backward equation : P = GPt . (5.9)
dt t

Proof. To derive the forward equation (5.8) we computed P(Xt +s = j |X0 = i ) by conditioning on the
value of Xt . We have
X X
pt +s (i , j ) = P(Xt +s = j |Xt = k )P(Xt = k |X0 = i ) = pt (i , k )ps (k , j )
k k
pt (i , k )g(k , j )s + O(s 2 ).
X
= pt (i , j )(1 + g(j , j )s) +
k :k 6=j

Subtracting pt (i , j ) from both sides, dividing by s and taking a limit as s goes to zero, we obtain

1 d
  X
lim pt +s (i , j ) – pt (i , j ) = pt (i , j ) = pt (i , k )g(k , j ),
s&0 s dt k
5.2. OVERVIEW OF CONTINUOUS TIME MARKOV CHAINS 69

which is the component-wise representation of (5.8). To derive the backward equation (5.9) we computed
P(Xt +s = j |X0 = i ) by conditioning on the value of Xs . We have
X X
pt +s (i , j ) = P(Xt +s = j |Xs = k )P(Xs = k |X0 = i ) = ps (i , k )pt (k , j )
k k
g(i , k )s pt (k , j ) + O(s 2 ).
X
= (1 + g(i , i )s)pt (i , j ) +
k :k 6=i

Subtracting pt (i , j ) from both sides, dividing by s and taking a limit as s goes to zero, we obtain

1 d
  X
lim pt +s (i , j ) – pt (i , j ) = pt (i , j ) = g(i , k )pt (k , j ),
s&0 s dt k

which is the component-wise representation of (5.9).

If Pt solves the forward equation (5.8) then it also solves the backward equation (5.9) and vice versa.
The solution to (5.8) and (5.9), subject to the initial condition P0 = I is
∞ 1 n n
et G
X
Pt = := t G .
n=0 n!

If G has an eigenvector decomposition G = U–1 ΛU as in Section 4.6 then



 X
1

Pt = U–1 (t Λ)n U = U–1 et Λ U. (5.10)
n=0 n!

Example 5.2.6. Let the generator G of a continuous time Markov chain X be given by
 
–α α
G= , α, β > 0. (5.11)
β –β

We wish to find the semigroup Pt generated by G. We note that G has left eigenvectors and eigenvalues
given by
1 1
λ0 = 0, e0 = q (β, α), λ1 = –α – β, e1 = √ (1, –1).
α2 + β 2 2

Thus, from (5.10) we have


 –1   
e0 et λ0 0 e0
Pt = U–1 et Λ U =    
e1 0 et λ1 e1
 
1  β + αe–(α+β)t α – αe–(α+β)t 
= . (5.12)
α+β β – βe–(α+β)t α + βe–(α+β)t
70 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

For a Poisson process with intensity λ, we found that we could construct N from a sequence of iid
inter-arrival times (τi )i ≥1 that were exponentially distributed: τi ∼ E(λ). A continuous time Markov
chain can be constructed in a similar manner.

Let G = (g(i , j ))i ,j ∈S be the generator of continuous time Markov chain. Consider a discrete time
Markov chain Y = (Yn )n∈N0 with one-step transition matrix P = (p(i , j ))i ,j ∈S . Suppose the entries of
P are given by

p(i , j ) = –g(i , j )/g(i , i ), i 6= j , p(i , i ) = 0.

Observe that p(i , j ) ≥ 0 ∀i , j ∈ S and p(i , j ) = 1 ∀i as required. Now, let (τi )i ∈N be a sequence of
P
j
independent random variables satisfying

τn |Yn ∼ E(–g(Yn , Yn )).

Let (Sn )n∈N0 be given by


n
X
Sn := τi , S0 := 0.
i =1

We can construct a continuous time process X = (Xt )t ≥0 as follows



X
Xt = Yn 1{Sn ≤t <Sn+1 } .
n=0

We claim that the process X is a continuous time Markov chain with generator G. We call (Sn )n∈N0 the
jump times and (τn )n∈N the inter-jump times. Compare this construction of a Markov chain to the
second construction we gave for a Poisson process (see Theorem 5.1.5 and the text preceeding it).

The above construction makes clear what the dynamics of a CTMC X with generator G are. Specifically,
once X jumps into state i , the amount of time it remains in that state is exponentially distributed with
parameter –g(i , i ). When a jump occurs, the probability that X jumps from i to j is –g(i , j )/g(i , i ).

To see how this construction is equivalent to Definition 5.2.4, note that, for small s, the probability of
two jumps in the time interval [0, s) is O(s 2 ) because

P(S2 ≤ s) ≤ P(τ1 ≤ s)P(τ2 ≤ s) ≤ (1 – egs )(1 – egs ) = (1 – (1 + gs) + O(s 2 ))2 = O(s 2 ),

where we have defined g := inf i ∈S {–g(i , i )}. As the probability of two jumps is O(s 2 ), for small s we
have

ps (i , i ) = P(Xs = i |X0 = i ) = P(τ1 > s|X0 = i ) + O(s 2 )


5.2. OVERVIEW OF CONTINUOUS TIME MARKOV CHAINS 71

= exp(g(i , i )s) + O(s 2 ) = 1 + g(i , i )s + O(s 2 ),


ps (i , j ) = P(Xs = j |X0 = i ) = P(τ1 ≤ s|X0 = i ) · P(Xτ1 = j |X0 = i ) + O(s 2 )
g(i , j )
= (1 – exp(g(i , i )s)) · + O(s 2 )
–g(i , i )
g(i , j )
= –g(i , i )s · + O(s 2 ) = g(i , j )s + O(s 2 ),
–g(i , i )
which agrees with Item 2 in Definition 5.2.4. The independent increments property Xt – Xs ⊥⊥ Xs follows
from the memoryless property of the exponentially distributed waiting times.

For discrete time Markov chains, we found that µn+m = µn Pm . We defined an invariant distribution
π as any probability mass function that satisfied π = πP, which implied that π = πPn for all n. In
continuous time, we have the following analogs

µt +s = µt Ps , ∀ t , s ≥ 0,
π = πPt , ∀ t ≥ 0.

It is not always easy to find an invariant distribution π from Pt . However, π can also be obtained from
the generator G.

Theorem 5.2.7. Let X be a continuous time Markov chain with Generator G and let Pt be the
semigroup generated by G. Then

π = πPt , ⇔ πG = 0.

Proof. Recalling that G0 ≡ I we have

πG = 0 ⇔ πGn = 0, ∀ n ≥ 1,
∞ 1 n
t πGn = 0
X

n=1 n!

X 1 n n
⇔ π t G =π
n=0 n!
⇔ πPt = π.

Example 5.2.8. Let us return to Example 5.2.6. Let us find the stationary distribution for the Markov
process X with generator G given by (5.11). We have
  !
–α α β α
0= (π(0), π(1))  , 1 = π(0) + π(1), π(i ) ≥ 0, ⇒ π= , .
β –β α+β α+β
72 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

Note that from (5.12) we have for any µ0 that


 
1  β α 
lim µt = µ0 lim Pt = (µ0 (0), µ0 (1)) = π,
t →∞ t →∞ α+β β α

where we have used µ0 (0) + µ1 (1) = 1.

As in the discrete time case, for a given Markov chain X a stationary distribution π may not exist. If
a stationary distribution does exist, then it may not be unique. Given the existence of a stationary
distribution, the condition for uniqueness is that

for all i , j ∈ S, there exists t > 0 such that pt (i , j ) > 0. (5.13)

If (5.13) is satisfied, then we say the chain X is irreducible.

Example 5.2.9. Consider a continuous time Markov chain X with generator


 
 –λ0 λ0 0 0 0 ... 
 µ1 –(µ1 + λ1 ) λ1 0 0 ... 
 
 
 
G= 
 0 µ2 –(µ2 + λ2 ) λ2 0 ... 
.
 

 0 0 µ3 –(µ3 + λ3 ) λ3 . . . 

 .. .. .. .. .. . . 
. . . . . .

The chains X is called a a birth-death process since, when Xt = i it either jumps to i + 1 (i.e., a birth)
or jumps down to i – 1 (i.e., a death). The expected waiting time τ in state i is exponentially distributed
τ ∼ E(λi + µi ). When the process does jump out of state i , the probability of an up-jump is λi /(λi + µi )
and the probability of a down jump is µi /(λi + µi ). If λ0 > 0 then the chain is irredicible. Let us see if
we can find the stationary distribution π. We see π such that πG = 0, which leads to

0 = –λ0 π(0) + µ1 π(1),


0 = λn–1 π(n – 1) – (µn + λn )π(n) + µn+1 π(n + 1), n ≥ 1.

One can check by direct substitution that


λ λ . . . λn–1
π(n) = 0 1 π(0).
µ1 µ2 . . . µn
In order for π to be a distribution, the following normalization condition must be satisfied
 –1

X ∞
λ0 λ1 . . . λn–1
X ∞
λ0 λ1 . . . λn–1 
X
1= π(n) = π(0) + π(0) , ⇒ π(0) = 1 + .
n=0 n=1 µ 1 µ 2 . . . µn n=1 µ1 µ2 . . . µn
Clearly, we can only normalize if

λ0 λ1 . . . λn–1
X
< ∞.
n=1 µ 1 µ 2 . . . µ n
5.3. REVERSIBILITY 73

While a stationary distribution π can often be computed with relative ease, it is almost always difficult
(or impossible) to compute µt explicitly. Nevertheless, we can still obtain moments EXn
t if we can
compute the generating function GXt (s) := Es Xt .

Example 5.2.10. Consider the birth-death process given in Example 5.2.9 and assume

λn = nλ, µn = nµ, λ, µ > 0.

Note that the state Xt = 0 is an absorbing state since, λ0 = 0. Assume X0 = i ≥ 1. The Kolmogorov
forward equations (5.8) become
d
p (i , 0) = µpt (i , 1),
dt t
d
p (i , j ) = λ(j – 1)pt (i , j – 1) – (λ + µ)jpt (i , j ) + µ(j + 1)pt (i , j + 1), j ≥ 1.
dt t
Multiplying the j th equation by s j and summing all of the equations, we obtain
∂ ∂GXt ∂GXt ∂GXt
GXt = λs 2 – (λ + µ)s +µ , (5.14)
∂t ∂s ∂s ∂s
where we have used

GXt (s) := E[s Xt |X0 = i ] = s j pt (i , j ),
X

j =0

from which it follows that


∂GXt (s) ∞ ∂p (i , j ) ∂GXt (s) ∞
sj t js j –1 pt (i , j ).
X X
= , =
∂t j =0
∂t ∂s j =1

One can check by direct substitution that the solution to PDE (5.14) with initial condition GX0 (s) = s i
is given by
 i
µ(1 – s) – (µ – λs)e–(λ–µ)t 
GXt (s) =  , µ 6= λ.
λ(1 – s) – (µ – λs)e–(λ–µ)t
From the generating function we can compute the first to moments of Xt easily. We have
λ + µ (λ–µ)t  (λ–µ)t 
EXt = i e(λ–µ)t , VXt = i e e –1 , µ 6= λ.
λ–µ

5.3 Reversibility
In Section 4.5, we introduced a notion of reversibility for discrete time Markov chains. In this section,
we will discuss reversibility of continuous time Markov chains. As we shall see, most of the definitions
and theorems discussed below are direct analogs of the definitions and theorems given in Section 4.5.
74 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

Definition 5.3.1. Let X = (Xt )0≤t ≤T be a continuous time Markov chain. The time reversal of X is
the process Y = (Yt )0≤t ≤T defined as follows

Yt := lim XT–t +δ = X(T–t )+ .


δ&0

Compare Definition 5.3.1 for the time reversal of a continuous time Markov chain with Definition 4.5.1
for the time reversal of a discrete time Markov Chain. Notice the similarities.

Theorem 5.3.2. Let X = (Xt )0≤t ≤T be an irreducible continuous time Markov chain with generator
G, transition semigroup Pt = et G and invariant distribution π and suppose X0 ∼ π. Let
Y = (Yt )0≤t ≤T be the time reversal of X. Then Y is a continuous time Markov chain with
generator H and transition semigroup Qt = et H , which satisfy
π(j ) π(j )
h(i , j ) = g(j , i ), qt (i , j ) = p (j , i ). (5.15)
π(i ) π(i ) t
Rather than prove Theorem 5.3.2, we will explain intuitively why it is true. Fix a large N ∈ N, define
δ := T/N  1 and consider the discrete-time processes Xb = (X
n 0≤n≤N where Xn := Xnδ . The process
b ) b

X
b is a discrete time Markov chain with transition probabilities p
b (i , j ) := P(X
b
n+1 = j |Xn = i ) given by
b

pb (i , j ) = δi ,j + g(i , j )δ + O(δ 2 ). (5.16)

Now, let Y
b = (Y
b )
n 0≤n≤N be the time-reversal of X. That is Y n := XN–n = XT–nδ = Y nδ . We know from
b b b

Theorem 4.5.2 that Y


b is a Markov process with transition probabilities q
b (i , j ) := P(Y
b
n+1 = j |Yn = i )
b

given by
π(j )
qb (i , j ) = pb (j , i ). (5.17)
π(i )

If we accept that Y, like its discrete-time counterpart Y,


b is a Markov process, then we musut have

qb (i , j ) = δi ,j + h(i , j )δ + O(δ 2 ). (5.18)

for some generating matrix H = (h(i , j ))i ,j ∈S . Inserting (5.16) and (5.18) in (5.17) we obtain

π(j )
δi ,j + h(i , j )δ = (δ + g(j , i )δ) + O(δ 2 )
π(i ) j ,i

The terms of order O(1) cancel (you can check they are equal for both for i = j and for i 6= j ). In order
for the O(δ) terms to match, we must have

π(j )
h(i , j ) = g(j , i ),
π(i )
5.3. REVERSIBILITY 75

which is the left-hand side of equation (5.15). Now, note that

π(i )h(i , j ) = π(j )g(j , i ) ⇒ π(i )(Hn )ij = π(j )(Gn )ji , ∀n ∈ N0 ,

which implies

π(i )(et H )ij = π(j )(et G )ji , ∀ t ∈ [0, T].

The right-hand side of (5.15) follows from (et G )ij = (Pt )i ,j = pt (i , j ) and (et H )ij = (Qt )ij = qt (i , j ).

Definition 5.3.3. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with transition
semigroup Pt = et G and invariant distribution π. Suppose X0 ∼ π. Let Y be the time reversal of X
and denote by Qt = et H its time-reversal. We say that X is reversible if Pt = Qt for all t ∈ [0, T], or
equivalently (by Theorem 5.3.2), if either of the following holds

π(i )g(i , j ) = π(j )g(j , i ), or π(i )pt (i , j ) = π(j )pt (j , i ), ∀ t ∈ [0, T].

Once again, it is instructive to compare Definition 5.3.3 with its discrete time analog Definition 4.5.3.

Definition 5.3.4. Let Pt = et G the transition semigroup of a continuous time Markov process X with
state space S. We say that a distribution λ = (λ(1), λ(2), . . . , λ(|S|)) is in detailed balance with Pt or G
if

λ(i )g(i , j ) = λ(j )g(j , i ), or λ(i )pt (i , j ) = λ(j )pt (j , i ), ∀ t ∈ [0, T]. (5.19)

Theorem 5.3.5. Let Pt = et G be transition semigroup of an irreducible Markov chain X. Suppose


there exists a distribution λ such that (5.19) holds. Then λ is a stationary distribution for X. That
is, λG = 0 and λ = λPt for all t .

Proof. Assuming (5.19) holds, we have


X X X
λ(i )pt (i , j ) = λ(j )pt (j , i ) = λ(j ) pt (j , i ) = λ(j ).
i i i
Thus, we have proved λ = λPt , and thus λ is n stationary distribution for X.

5.3.1 Potential and entropy production


Definition 5.3.6. The generator G of a continuous time Markov chain X is said to satisfy the Kolmogorov
cycle condition if, for any finite and feasible cycle of states (s0 , s1 , . . . , sn , s0 ) we have
g(s0 , s1 )g(s1 , s2 ) . . . g(sn–1 , sn )g(sn , s0 )
= 1. (5.20)
g(s1 , s0 )g(s2 , s1 ) . . . g(sn , sn–1 )g(s0 , sn )
Here, “feasible” means that X can jump directly from sk to sk +1 for all k and also from sn to s0 .
76 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

Theorem 5.3.7. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with state
space S, transition semigroup Pt = et G and stationary distribution π. Then

π is in detalied balance with G ⇔ G satisfies the Kolmogorov cycle condition.

Proof. (⇒) Suppose π is in detailed balance with G, then π(i )g(i , j ) = π(j )g(j , i ). Thus

π(s0 )g(s0 , s1 )π(s1 )g(s1 , s2 ) · · · π(sn–1 )g(sn–1 , sn )π(sn )g(sn , s0 )


1=
π(s1 )g(s1 , s0 )π(s2 )g(s2 , s1 ) · · · π(sn )g(sn , sn–1 )π(s0 )g(s0 , sn )
g(s0 , s1 )g(s1 , s2 ) · · · g(sn–1 , sn )g(sn , s0 )
= .
g(s1 , s0 )g(s2 , s1 ) · · · g(sn , sn–1 )g(s0 , sn )

(⇐) The Kolmogorov cycle condition (5.20) is equivalent to a conservative vector field:
!
X g(sk , sk +1 )
0= log ,
cycle{sk }
g(sk +1 , sk )

Therefore, we can define a potential function U : S → R relative to a reference state s0 ∈ S via


jX
–1 !
g(sk , sk +1 )
U(sj ) := log .
k =0
g(sk +1 , sk )

From the potential function U(s), we can construct a distribution λ = (λ(1), λ(2), . . . , λ(|S|)) for any
state s ∈ S as follows
 X –1
λ(s) := AeU(s) , A= eU(sk ) .
sk ∈S

Note that A is just a normalization factor. It is easy to check that λ(i ) and λ(j ) satisfy

λ(i )g(i , j ) = λ(j )g(j , i ),

for any states states i and j . Therefore, by Definition (5.3.4), λ is in detailed balance with G. Additionally,
by Theorem (5.3.5), λ is a stationary distribution for X. As X is irreducible, the stationary distrubtion
is unique, which implies λ = π. Therefore, π is in detailed balance with G.

For any distributions µ and ν on some countable state space S we can define
X
Entropy : H[µ] := – µ(i ) log µ(i ),
i ∈S
X µ(i )
Relative Entropy : K[µ||ν] := µ(i ) log .
i ∈S
ν(i )

Relative Entropy is sometimes called Kullback–Leibler divergence.


5.3. REVERSIBILITY 77

Consider a physical system with N  1 particles, each of which must be in some state s ∈ S. Let us
define
#{particles in state i at time t }
µt (i ) := .
N
Then µt is a distribution on S. Suppose this system has an invariant distribution π. We can compute
the relative entropy of µt with respect to π. We have
 X   X 
K[µt ||π] =– – µt (i ) log µt (i ) + – µt (i ) log π(i ) .
i ∈S i ∈S
| {z }
free energy of µt | {z } | {z }
entropy of µt mean energy of µt

where we have indicated the the names that the above quantities are given in physics.

Theorem 5.3.8. Let X be a irreducible Markov chain with semigroup semigroup Pt = et G and
invariant distribution π. Suppose π is in detailed balance with G. Then we have
d
K[µt ||π] ≤ 0.
dt
That is, the relative entropy K[µt ||π] is non-increasing.

Proof. Suppose X0 ∼ δi (i.e., all mass at i ). Then µt (j ) = pt (i , j ) and we have

d d X µ (j )
K[µt ||π] = µt (j ) log t
dt dt j ∈S π(j )
X µt (j ) d

= 1 + log µ (j )
j ∈S
π(j ) dt t
X p (i , j ) d

= 1 + log t p (i , j )
j ∈S
π(j ) dt t
d X X p (i , j ) d

= pt (i , j ) + log t pt (i , j )
dt j ∈S j ∈S
π(j ) dt
X p (i , j ) d

log t
X
= p (i , j ) (as pt (i , j ) = 1)
j ∈S
π(j ) dt t j ∈S
X p (i , j ) X
  
= log t pt (i , k )g(k , j ) (by the KFE)
j ∈S
π(j ) k ∈S
X X p (i , j )
 
= log t pt (i , k )g(k , j )
j ∈S k ∈S
π(j )
X X pt (i , j )
 
= log pt (i , k )g(k , j )
j ∈S k ∈S
π(j )
78 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS

X π(k )
 X
+ log pt (i , k ) g(k , j )
k ∈S
pt (i .k ) j ∈S
| {z P }
this term is zero since j ∈S g(k ,j )=0
X X pt (i , j )π(k )
 
= log pt (i , k )g(k , j ) (algebra)
j ∈S k ∈S
π(j )pt (i , k )
X X pt (i , j )g(j , k )
 
= log pt (i , k )g(k , j ) (π is in detailed balance with G)
j ∈S k ∈S
pt (i , k )g(k , j )
1X X pt (i , j )g(j , k )
 
= log
2 j ∈S k ∈S pt (i , k )g(k , j )
 
× pt (i , k )g(k , j ) + pt (i , k )g(k , j )
1X X p (i , j )g(j , k )
 
= log t
2 j ∈S k ∈S pt (i , k )g(k , j )
 
× – pt (i , j )g(j , k ) + pt (i , k )g(k , j ) (algebra)
1 X X p (i , j )g(j , k )

=– log t
2 j ∈S k ∈S pt (i , k )g(k , j )
| {z }
A
 
× pt (i , j )g(j , k ) – pt (i , k )g(k , j )
| {z }
B
=: f (i ) ≤ 0, (as A and B have the same sign)

Thus, for a general starting condition X0 ∼ µ0 we have


d X
K[µt ||π] = µ0 (i )f (i ) ≤ 0.
dt i ∈S

In statistical physics, the term A is called the thermodynamic force, the term B is called thermodynamic
d K[µ ||π] ≤
flux, the produce A·B is called entropy production or free energy dissipation and the result dt t
0 is the Second Law of Thermodynamics.

5.4 Exercises
Exercise 5.1. Patients arrive at an emergency room as a Poisson process with intensity λ. The time to
treat each patient is an independent exponential random variable with parameter µ. Let X = (Xt )t ≥0 be
the number of patients in the system (either being treated or waiting). Write down the generator of X.
5.4. EXERCISES 79

Show that X has an invariant distribution π if and only if λ < µ. Find π. What is the total expected
time (waiting + treatment) a patient waits when the system is in its invariant distribution?

Exercise 5.2. Let X = (Xt )t ≥0 be a Markov chain with stationary distribution π. Let N be an
independent Poisson process with intensity λ and denote by τn the time of the nth arrival of N. Define
Yn := Xτn + (i.e., Yn is the value of X immediately after the nth jump). Show that Y is a discrete time
Markov chain with the same stationary distribution as X.

Exercise 5.3. Let X = (Xt )t ≥t be a Markov chain with state space S = {0, 1, 2, . . .} and with a generator
G whose i th row has entries

gi ,i –1 = i µ, gi ,i = –i µ – λ, gi ,i +1 = λ,

with all other entries being zero (the zeroth row has only two entries: g0,0 and g0,1 ). Assume X0 = j .
Find GXT (s) := Es Xt . What is the distribution of Xt as t → ∞?

Exercise 5.4. Let N be a time-inhomogeneous Poisson process with intensity function λ(t ). That is,
the probability of a jumps of size one in the time interval (t , t + dt ) is λ(t )dt and the probability of two
jumps in that interval of time is O(dt 2 ). Write down the Kolmogorov forward and backward equations of
N and solve them. Let N0 = 0 and let τ1 be the time of the first jump of N. If λ(t ) = c/(1 + t ) show
that Eτ1 < ∞ if and only if c > 1.

Exercise 5.5. Let N be a Poisson process with a random intensity Λ with is equal to λ1 with probability
p and λ2 with probability 1 – p. Find GNt (s) = Es Nt . What is the mean and variance of Nt ?

Exercise 5.6. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with state space S,
transition semigroup Pt = et G and stationary distribution π. Prove the following six statements are
equivalent
(i) Its stationary distribution satisfies detailed balance: π(i )g(i , j ) = π(j )g(j , i ), ∀i , j ∈ S
(ii) Any path connecting states i and j : i ≡ s0 , s1 , s2 , ..., sn ≡ j , has a path independent
! ! !
g(s0 , s1 ) g(s1 , s2 ) g(sn–1 , sn )
log + log + · · · + log = log π(sn ) – log π(s0 )
g(s1 , s0 ) g(s2 , s1 ) g(sn , sn–1 )
(iii) It defines a time reversible stationary Markov process.
(iv) Its G matrix satisfies Kolmogorov cycle condtion for every sequence of states.
(v) There exists a positive diagonal matrix Π such that matrix GΠ is symmetric.
(vi) Its stationary process has zero entropy production rate. Entropy production rate, ep (µt ), is defined
by !
1X µt (i )g(i , j )
ep (µt ) = (µ (i )g(i , j ) – µt (j )g(j , i )) log
2 i ,j t µt (j )g(j , i )
80 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
Chapter 6

Convergence of random variables

The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 7). The
goals of this chapter are (i) to introduce various modes of convergence of random variables and (ii) to
establish when convergence of one mode implies convergence of another mode.

6.1 A motivating example


We begin this chapter with a motivating example.

Example 6.1.1. Consider the infinite sequence of coin tosses described in Section 1.3.2. Suppose you
have an initial wealth X0 . Just before the n th coin toss, you bet your entire wealth on the outcome of
the n th toss being a heads. It is easy to see that, at time n your wealth is given by

2n X0
 if ωi = H for i = 1, 2, . . . n,
Xn (ω) =
0

else.

Suppose the probability of a heads is p ∈ (0, 1). As each of the coin tosses are independent, we deduce
that

P(Xn = 2n X0 ) = p n , P(Xn = 0) = 1 – p n .

As p n → 0 as n → ∞ we see that

lim P(Xn = 0) = lim 1 – p n = 1


n→∞ n→∞

Thus, it seems reasonable to say Xn → X∞ where X∞ (ω) := 0. Note that

lim P(|Xn – X∞ | > ε) = 0, ∀ ε > 0. (6.1)


n→∞

81
82 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

From this computation one might expect that E|Xn – X∞ | → 0 as n → ∞. However, this is not correct,
as



 ∞ if p ∈ (1/2, 1),


E|Xn – X∞ | = EXn = 2n X0 · p n → X0 if p = 1/2,



0 if p ∈ (0, 1/2).

Thus, we see that (6.1) does not imply E|Xn – X∞ | → 0. This simple example forces us to consider more
carefully what we mean by “convergence” of a random variable.

6.2 Modes of convergence defined


There are many (many!) ways to define convergence of a sequence of random variables. The most
important of these modes are the four we define here

Definition 6.2.1. Consider a sequence of random variables (Xn )n≥0 defined on a probability space
a.s.
(Ω, F, P). We say that Xn converges to X∞ almost surely, written Xn → X∞ , if

P( lim |Xn – X∞ | = 0) = 1.
n→∞

Lp
We say that Xn converges to X∞ in Lp , written Xn → X∞ , if E|Xn |p < ∞ for all n and

lim E|Xn – X∞ |p = 0. (6.2)


n→∞

P
We say that Xn converges to X∞ in probability, written Xn → X∞ , if

lim P(|Xn – X∞ | > ε) = 0, ∀ ε > 0.


n→∞

D
And we say that Xn converges to X∞ in distribution, written Xn → X∞ , if

lim |FXn (x ) – FX∞ (x )| = 0,


n→∞

for all points x ∈ R at which FX∞ is continuous.

It will be helpful to comment a bit more on the above definitions. Observe that almost sure convergence
does not require that Xn (ω) → X∞ (ω) for every ω ∈ Ω. It may be helpful to define

A := {ω : lim |Xn (ω) – X∞ (ω)| = 0}.


n→∞
6.2. MODES OF CONVERGENCE DEFINED 83

Almost sure convergence simply asks that P(A) = 1; it does not ask that A = Ω. There may be infinitely
a.s.
many ω ∈ Ac (uncountably infinite, in fact). But, if Xn → X∞ then P(Ac ) = 0. At times, it is
helpful to write that Xn → X∞ P-a.s. to indicate that Xn converges to X∞ almost surely under a
specific probability measure P. Note that, that Xn → X∞ P-a.s. does not implyl that Xn → X∞ P-a.s..
e

However, if two measure are equivalent P ∼ P


e then almost sure convergence under P implies almost
e and vice-versa because P(A) = 1 ⇔ P(A)
sure convergence under P e = 1. Finally, we mention that the
phrases convergence with probability one and convergence almost everywhere (a.e.) are sometimes
used to indicate almost sure convergence.

For any p ≥ 1 we can define Lp (Ω, F, P), the space of random variables X : Ω → R with a finite p th
moment.

E|X|p < ∞.

We can define the p-norm k·kp of a random variable X ∈ Lp (Ω, F, P) as


 1/p
kXkp := E|X|p .

Then, from (6.2) we see that convergence in Lp is equivalent to convergence in the p-norm

lim kXn – X∞ kp = 0.
n→∞

As with almost sure convergence, it is sometimes necessary to specify a particular probability measure
under which Lp convergence occurs by writing Xn → X∞ in Lp (P) or, even more specifically, Lp (Ω, F, P).
It is common to say Xn → X∞ in mean and in mean square when Xn → X∞ in L1 and L2 , respectively.

Convergence in probability is based on the following intuition: two random variables X and Y are “close
to each other” if there is a low probability that their difference is very large: P(|X – Y| > ε) ≈ 0 for ε  1.
In some sense, the “distance” between X and Y is measured by dε (X, Y) := P(|X – Y| > ε). Convergence
in probability simply asks that this measure of distance goes to zero for any ε > 0 as n → ∞. That is,
limn→∞ dε (Xn , X∞ ) = 0.

Note that we have already defined convergence in distribution (see Definition 3.4.1). We have simply
repeated the definition here for convenience. In a sense, convergence in distribution is the weakest of
the four modes of convergence, as the definiton makes no reference to the underlying probability space
(Ω, F, P). Indeed, suppose Z ∼ N(0, 1) and Xn = –Z for all n. Then it is clear that FXn = FZ for all n.
But, it is easy to see that Xn does not converge to Z in probability, in Lp or almost surely. Convergence
w
in distribution is sometimes alternatively called weak convergence, written Xn → X∞ , or convergence
L
in law, written Xn → X∞ .
84 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

Sometimes, we may wish to see if a sequence of random variables (Xn ) converges (in some sense) to a
random variable X∞ , but we do know know what the random variable X∞ is. In such cases, it can be
helpful to check for Cauchy convergence of some sort.

Definition 6.2.2. Consider a sequence of random variables (Xn )n≥0 defined on a probability space
(Ω, F, P). We say that (Xn ) is Cauchy convergent almost surely if

P( lim |Xn – Xm | = 0) = 1.
n,m→∞

We say that (Xn ) is Cauchy convergent in Lp if E|Xn |p < ∞ for all n and

lim E|Xn – Xm |p = 0.
n,m→∞

And we say that Xn is Cauchy convergent in probability if

lim P(|Xn – Xm | > ε) = 0, ∀ ε > 0.


n,m→∞

We will not prove the following Theorem.

Theorem 6.2.3. Consider a sequence of random variables (Xn ) defined on a probability space
(Ω, F, P) and let X∞ be a random variable defined on the same probability space.

a.s.
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent almost surely.
Lp
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent in Lp .
P
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent in probability.

The usefulness of Theorem 6.2.3 is that it allows us to establish the existence of a limit X∞ of a sequence
of random variables (Xn ) (in some sense) without having knowledge of what X∞ is.
P
In Example 6.1.1 we saw that the wealth process (Xn ) converged in probability to zero: Xn → 0. Suppose
P was such that P(ωi = H) = p ∈ (0, 1/2). Then the wealth process (Xn ) also converged in mean to
zero: Xn → 0 in L1 (P). However, if P(ω
e e ∈ [1/2, 1), then (Xn ) did not converge in L1 (P).
i = H) = p
e It

would be very helpful to know when one mode of convergence implies convergence in another mode. This
is the subject of the next section.

6.3 Convergence in one mode implies convergence in another?


Let us get right to it and state the main result.
6.3. CONVERGENCE IN ONE MODE IMPLIES CONVERGENCE IN ANOTHER? 85

Theorem 6.3.1. Let (Xn ) be a sequence of random variables defined on (Ω, F, P) and let X∞ be a
random variable on the same probability space. Suppose 1 ≤ q ≤ p. Then the following implications
hold
a.s.

Xn → X∞  P D
Lp Lq ⇒ Xn → X∞ ⇒ Xn → X∞ .
Xn → X∞ ⇒ Xn → X∞ 

No other implications hold in general.

Note: the right bracket “}” indicates that convergence in probability holds if Xn converges to X∞ either
in Lq or P-a.s.. One does not need for Xn to converge to X∞ in both Lq and P-a.s. in order for
convergence in probability to hold. We will prove Theorem 6.3.1 in a series of steps. Along the way, we
will introduce some related (and useful!) inequalities.
P D
Proof that Xn → X∞ ⇒ Xn → X∞ . First we note that, if ε > 0, then

FXn (x ) = P(Xn ≤ x )
= P(Xn ≤ x , X∞ ≤ x + ε) + P(Xn ≤ x , X∞ > x + ε)
≤ P(X∞ ≤ x + ε) + P(|Xn – X∞ | > ε)
= FX∞ (x + ε) + P(|Xn – X∞ | > ε). (6.3)

If you find the above inequality difficult to derive, it may help to draw a pictures of regions in the Xn -X∞
plane to which the above probabilities correspond. Likewise, we have

FX∞ (x – ε) = P(X∞ ≤ x – ε)
= P(X∞ ≤ x – ε, Xn ≤ x ) + P(X∞ ≤ x – ε, Xn > x )
≤ P(Xn ≤ x ) + P(|Xn – X∞ | > ε)
= FXn (x ) + P(|Xn – X∞ | > ε). (6.4)

Combining (6.3) with (6.4) we obtain

FX∞ (x – ε) – P(|Xn – X∞ | > ε) ≤ FXn (x ) ≤ FX∞ (x + ε) + P(|Xn – X∞ | > ε).


P
As Xn → X∞ by assumption, we have that limn→∞ P(|Xn – X∞ | > ε) = 0. Thus

FX∞ (x – ε) ≤ liminf FXn (x ) ≤ limsup FXn (x ) ≤ FX∞ (x + ε).


n→∞ n→∞
Finally, if FX∞ (x ) is continuous at x , then limε→0 FX∞ (x ± ε) = FX∞ (x ). Therefore, letting ε → 0 we
obtain

FX∞ (x ) ≤ liminf FXn (x ) ≤ limsup FXn (x ) ≤ FX∞ (x ),


n→∞ n→∞
86 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

which implies that limn→∞ FXn (x ) = FX∞ (x ) at any point x where FX∞ is continuous. And this
D
establishes that Xn → X∞ .

We shall need the following result, which we state without proof.

Theorem 6.3.2 (Hölder’s inequality). Let X and Y be random variables defined on (Ω, F, P) and
let p, q ≥ 1 satisfy 1/p + 1/q = 1. Then we have

E|XY| ≤ (E|X|p )1/p (E|Y|q )1/q . (6.5)

Corollary 6.3.3 (Lyapunov’s inequality). Let Z be a random variable defined on (Ω, F, P) and
suppose 1 ≤ q ≤ p. Then we have

(E|Z|q )1/q ≤ (E|Z|p )1/p . (6.6)

Proof. Let r , p ≥ 1. Applying Hölder’s inequality (6.5) with Y = 1 and X = |Z|r we obtain

E|Z|r ≤ (E|Z|rp )1/p .

Defining q := rp, we have

E|Z|r ≤ (E|Z|q )r /q ⇒ (E|Z|r )1/r ≤ (E|Z|q )1/q .

Noting the q ≥ r , we have proved (6.6).


Lp Lq
Proof that Xn → X∞ ⇒ Xn → X∞ with p ≥ q ≥ 1. Apply (6.6) with Z = Xn – X∞ , take a limit as
n → ∞ and use the fact that E|Xn – X∞ |p → 0 by assumption.

Lemma 6.3.4 (Markov’s inequality). Let Z be a random variable defined on (Ω, F, P). For any
a > 0 we have

P(|Z| > a) ≤ a1 E|Z|. (6.7)

Proof. By direct computation, we have

P(|Z| > a) = E1{|Z|>a} = a1 Ea1{|Z|>a} ≤ a1 E|Z|,

where we have used |Z| ≥ a1{|Z|>a} .


Lp P
Proof that Xn → X∞ ⇒ Xn → X∞ for p ≥ 1. Fix ε > 0. Applying Markov’s inequality (6.7) with
Z = Xn – X∞ and a = ε and taking a limit as n → ∞ we obtain

lim P(|Xn – X∞ | > ε) ≤ 1ε lim E|Xn – X∞ | = 0,


n→∞ n→∞

Lp L1
where, to deduce the last equality, we have used the fact that convergence in Xn → X∞ ⇒ Xn → X∞ .
6.4. CONTINUITY OF PROBABILITY MEASURES 87

We will not prove the following result.

Theorem 6.3.5 (Lebesgue’s dominated convergence theorem). Let (Xn ) be a sequence of


a.s.
random variables defined on (Ω, F, P) and suppose Xn → X∞ . If there exists non-negative random
variable Y such that (i) |Xn | ≤ Y for every n and (ii) EY < ∞, then

lim EXn = E lim Xn = EX∞ . (6.8)


n→∞ n→∞

We can now prove the remaining part of Theorem 6.3.1.


a.s. P a.s.
Proof that Xn → X∞ ⇒ Xn → X∞ . Fix ε > 0 and define Yn := 1{|Xn –X∞ |>ε} . As Xn → X∞ it
a.s.
follows that Yn → 0. Moreover, Yn ≤ 1 for all n and E1 < ∞ (clearly!). Thus, using Lebesgue’s
dominated convergence Theorem (6.8) we obtain

lim P(|Xn – X∞ | > ε) = lim EYn = E lim Yn = 0,


n→∞ n→∞ n→∞
P
which implies Xn → X∞ .

6.4 Continuity of probability measures


Much like we can define the limit (if it exists) of a sequence of real numbers (xn ), we can define the limit
(if it exists) of a sequence of sets.

Definition 6.4.1. We say a sequence of sets (An ) increases to A, written An ↑ A, if



[
An ⊆ An+1 ∀ n ∈ N, and A= An .
n=1

We say a sequence of sets (An ) decreases to A, written An ↓ A, if



\
An ⊇ An+1 ∀ n ∈ N, and A= An .
n=1
A very important property of probability measures is that they are continuous in the sense described by
the following Theorem.

Theorem 6.4.2 (Continuity of Probability Measures). Fix a probability space (Ω, F, P). Suppose
(An ) is a sequence of subsets of F. Then
 n
[ 
An ↑ A, ⇒ lim P(An ) = P lim Ak = P(A),
n→∞ n→∞
k =1
 \n 
An ↓ A, ⇒ lim P(An ) = P lim Ak = P(A).
n→∞ n→∞
k =1
88 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

Proof. Suppose An ↑ A. Define

Bn := An \ An–1 , ∀ n ∈ N, A0 = ∅,

and observe that the sets (Bn ) are disjoint and


n
[ ∞
[ ∞
[
An = Bk , A := An = Bn .
k =1 n=1 n=1

Hence, using the countably additive property of P we have



X n
X n
 [ 
P(A) = P(Bk ) = lim P(Bk ) = lim P Bk = lim P(An ).
n→∞ n→∞ n→∞
k =1 k =1 k =1

Next, suppose An ↓ A. Then Acn ↑ Ac . Thus, we have


 
P(A) = 1 – P(Ac ) = 1 – lim P(Acn ) = lim 1 – P(Acn ) = lim P(An ).
n→∞ n→∞ n→∞

6.5 lim sup and lim inf of sets


Much like we can define the lim sup and lim inf of a sequence of real numbers (xn ), we can define the
lim sup and lim inf of a sequence of sets.

Definition 6.5.1. Let (An ) be a sequence of sets. We define



\ ∞
[ ∞
[ ∞
\
lim sup An := An , lim inf An := An .
n→∞ n→∞
m=1 n=m m=1 n=m

To get a more intuitive understanding of the above definitions, it may be helpful to note the following

lim sup An = {ω ∈ Ω : ∀ m ∃ n(ω) ≥ m s.t. ω ∈ An(ω) }


n→∞
= {ω ∈ Ω : ω ∈ An for infinitely many n},
lim inf An = {ω ∈ Ω : ∃ m(ω) s.t. ω ∈ An ∀ n ≥ m(ω)}
n→∞
= {ω ∈ Ω : ω ∈ An for all but finitely many n}.

For this reason, we sometimes write

lim sup An = {An , i.o.}, i.o. = “infinitely often,”


n→∞
lim inf An = {An , a.b.f.m.}, a.b.f.m = “all but finitely many.”
n→∞
6.6. BOREL-CANTELLI LEMMA 89

Clearly, we have

lim inf An ⊆ lim sup An .


n→∞ n→∞

It is also helpful to note that

(An , i.o)c = (Acn , a.b.f.m), (An , a.b.f.m.)c = (Acn , i.o.).

Note that the lim inf and lim sup of a sequence of sets always exists (just like the lim inf and lim sup of a
sequence of real numbers always exists). To see this, observe that, if we define

[ ∞
\
Bm := An , Cm := An ,
n=m n=m

then, for all n ∈ N we have Bn ⊇ Bn+1 and Cn ⊆ Cn+1 . We further have



\ ∞
[
lim sup An = Bm , lim inf An = Cm .
n→∞ n→∞
m=1 m=1

Stated another way, we have

Bm ↓ lim sup An , Cn ↑ lim inf An .


n→∞ n→∞

It follows by the continuity of P that


   
P lim sup An = lim P(Bn ), P lim inf An = lim P(Cn ).
n→∞ n→∞ n→∞ n→∞

Example 6.5.2. Consider the following sequence of sets

A2n = B, A2n+1 = C, n = 0, 1, 2, . . . ,

where B and C are some arbitrary sets in a sample space Ω. To compute the lim inf n An consider the
following question: for which ω ∈ Ω does the event ω ∈ An occur all but finitely many times? The
answer is all of the ω ∈ B ∩ C. To compute the lim supn An consider the following question: for which
ω ∈ Ω does the event ω ∈ An infinitely often? The answer is all of the ω ∈ B ∪ C. Thus, we have

lim inf An = B ∩ C, lim sup An = B ∪ C.


n→∞ n→∞

6.6 Borel-Cantelli Lemma


Lemma 6.6.1 (Borel-Cantelli). Let (An ) be a sequence of sets. Then

X
P(An ) < ∞ ⇒ P(An , i.o) = 0. (6.9)
n=1
90 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

If the events (An ) are independent, then



X
P(An ) = ∞ ⇒ P(An , i.o.) = 1. (6.10)
n=1
Proof. To prove (6.9) we note that

\ ∞
[ ∞
[
{An , i.o.} = An ⊆ An .
m=1 n=m n=m
Therefore, we have

 [  ∞
X
P(An , i.o.) ≤ P An ≤ P(An ).
n=m n=m
P∞
The right-hand side above goes to zero as m → ∞ if n=1 P(An ) < ∞, which proves (6.9). To prove
(6.10) we note that

 \  k
 \ 
P Acn = lim P Acn
n=m k →∞ n=m
∞ 
Y 
= 1 – P(An ) (by independence)
n=m
∞  
(as 1 – x ≤ e–x if x ≥ 0)
Y
≤ exp – P(An )
n=m
 ∞
X 
= exp – P(An ) .
n=m
P∞
The last term equals zero if n=1 P(An ) = ∞, and in this case we have

 [ ∞   \∞ 
P({An , i.o.}c ) = P Acn = lim P Acn
\
= 0,
m→∞
m=1 n=m n=m
which establishes that P(An , i.o.) = 1, as claimed.

We can use the Borel-Cantelli Lemma to establish that fast convergence in probability implies almost
sure convergence.
P
Theorem 6.6.2. Suppose Xn → X∞ and

X
P(|Xn – X∞ | > ε) < ∞, ∀ ε > 0.
n=1
a.s.
Then Xn → X∞ .
P(An ) < ∞. It
P
Proof. Fix ε > 0 and take An = {|Xn – X∞ | > ε}. We have by assumption that n
follows from the Borel-Cantelli Lemma 6.6.1, that

P(An , i.o.) = P(|Xn – X∞ | > ε, i.o.) = 0.


a.s.
Thus, Xn → X∞ .
6.7. MARTINGALE CONVERGENCE THEOREMS 91

6.7 Martingale convergence theorems


It is often the case that a sequence of random variables (Xn ) represents the value of a process at discrete
periods of time. This was the case in Example 6.1.1, in which Xn represented the wealth of a gambler
after the n th flip of a coin. A very special class of processes that we have previously encountered is
the class of martingale processes (see Definition 2.3.5). For the reader’s convenience, we restate the
definition of a martingale for a discrete-time process here.

Definition 6.7.1. Let X = (Xn )n∈N0 be a sequence of random variables defined on a probability space
(Ω, F, P). The sequence X is said to be a martingale with respect to a filtration (Fn )n∈N0 if the following
hold:
(i) E|Xn | < ∞ for all n ∈ N0 ,
(ii) E[Xn+m |Fn ] = Xn for all n, m ∈ N0 . Alternatively, E[Xn+1 |Fn ] = Xn for all n ∈ N0 .

The special structure of martingales will allow us to prove some very powerful theorems about the
behavior of a martingale X as n → ∞. Before stating and proving these theorems, however, let us take a
look at some example of discrete-time martingales.

Example 6.7.2 (Martingales from branching processes). Suppose Zn is the size of the n th
generation of a branching process with Z0 = 1 (see Section 3.2). We have
Zn
X
Zn+1 = Xn,i ,
i =1

where the (Xn,i ) are i.i.d with common distribution X. Let Fn := σ(Z0 , Z1 , . . . , Zn ). We have
Zn
X
E[Zn+1 |Fn ] = E[ Xn,i |Zn ] = Zn EX = Zn µ, µ := EX.
i =1

From this, one obtains by induction that EZn = µn . Clearly the process Z is not a martingale when
µ 6= 1. However, consider Wn := µ–n Zn . We have

E[Wn+1 |Fn ] = µ–(n+1) E[Zn+1 |Fn ] = µ–(n+1) Zn µ = µ–n Zn = Wn .

Thus, the process W = (Wn ) is a martingale with respect to (Fn ). Next, suppose η is the smallest
solution of

G(η) = η, where G(s) := Es X .

Defining Vn := η Zn , we find
PZn
E[Vn+1 |Fn ] = E[η Zn+1 |F n] = E[η i =1 Xn,i |Fn ] = (Eη X )Zn
92 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

= (G(η))Zn = η Zn = Vn .

Thus, the process V = (Vn ) is a martingale with respect to (Fn ).

Example 6.7.3 (Martingales from Markov chains). Let X = (Xn ) be a discrete time Markov chain
with state space S and one-step transition matrix P = (p(i , j )). Suppose ψ is an a right eigenvector of P
with corresponding eigenvalue λ = 1. That is
X
Pψ = ψ, or component-wise p(i , j )ψ(j ) = ψ(i ), ∀ i ∈ S.
j ∈S

Define Yn := ψ(Xn ). Letting Fn := σ(X0 , X1 , . . . , Xn ), we have


X
E[Yn+1 |Fn ] = E[ψ(Xn+1 )|Xn ] = p(Xn , j )ψ(j ) = ψ(Xn ) = Yn .
j ∈S

This, the process Y = (Yn ) is a martingale with respect to (Fn ).

To prove the main result of this section (Theorem 6.7.5) we require the following result:

Theorem 6.7.4 (Doob-Kolmogorov inequality). Let M = (Mn ) be a martingale with respect to


a filtration (Fn ). Then

P( sup |Mi | ≥ ε) ≤ ε12 EM2n , ∀ ε > 0. (6.11)


0≤i ≤n

Proof. We begin by defining the following sets

A0 := Ω, Ak := { sup |Mi | < ε}, Bk := Ak –1 ∩ {|Mk | ≥ ε}, k ≥ 1.


0≤i ≤k

Observe that Ak is the event that |M| does not reach ε prior to time k and Bk is the event that |M| first
reaches or exceeds ε for the first time at time k . Thus, for any k , the events Ak , B1 , B2 , . . . , Bk are a
partition of Ω. That is, for all k , we have
k
 [ 
Ω = Ak ∪ Bi , Ak ∩ Bi = ∅, i ≤ k, Bi ∩ Bj = ∅, i 6= j .
i =1

It follows that
n n
EM2n = EM2n 1An + EM2n 1Bi ≤ EM2n 1Bi .
X X

i =1 i =1

Moreover, we have

EM2n 1Bi = E(Mn – Mi + Mi )2 1Bi


6.7. MARTINGALE CONVERGENCE THEOREMS 93

= E(Mn – Mi )2 1Bi + 2E(Mn – Mi )Mi 1Bi + EM2i 1Bi .


| {z } | {z } | {z }
α 2β γ

Let us look at these three terms one-by-one. Clearly, the first term satisfies α ≥ 0. Next, using the fact
that M is a martingale, the second term satisfies

β = E(Mn – Mi )Mi 1Bi = EMi 1Bi E[(Mn – Mi )|Fi ] = 0.

And the last term satisfies

γ = EM2i 1Bi ≥ Eε2 1Bi = ε2 P(Bi ).

Using the above results, we have

EM2n 1Bi = α + 2β + γ ≥ ε2 P(Bi )

and hence
n n
EM2n ≤ EM2n 1Bi ≤ ε2 P(Bi ) = ε2 P( sup |Mi | ≥ ε),
X X

i =1 i =1 0≤i ≤n

which proves (6.11).

We are now ready to prove the main result of this section:

Theorem 6.7.5. Let M = (Mn ) be a martingale with respect to a filtration (Fn ). Suppose that
supn EM2n < ∞. Then there exists a random variable M∞ such that
a.s. L2
Mn → M∞ , and Mn → M∞ .

Proof. Observe that, for any n, m ∈ N0 , we have

EM2n+m = E(Mn+m – Mn + Mn )2
= E(Mn+m – Mn )2 + EM2n + 2EMn (Mn+m – Mn )
= E(Mn+m – Mn )2 + EM2n + 2EMn E[(Mn+m – Mn )|Fn ]
= E(Mn+m – Mn )2 + EM2n ≥ EM2n .

Thus, the sequence (EM2n ) is strictly increasing and (by assumption) bounded. The sequence therefore
has a limit ` := limn→∞ EM2n . We will now show that the sequence (Mn ) is Cauchy convergent almost
a.s.
surely (see Definition 6.2.2). By Theorem 6.2.3, this will establish that Mn → M∞ . Define C as the set
of ω ∈ Ω for which the sequence (Mn ) is Cauchy convergent almost surely

C := {∀ε > 0 ∃ m s.t.|Mm+i – Mm+j | < ε ∀ i , j ∈ N}


94 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

\ ∞
[
= {|Mm+i – Mm | < ε ∀ i ∈ N}
ε>0 m=1
The complement of C is given by
∞ ∞
Cc =
[ \ [ \
{∃ i ∈ N s.t. |Mm+i – Mm | ≥ ε} = Am (ε),
ε>0 m=1 ε>0 m=1
Am (ε) := {∃ i ∈ N s.t. |Mm+i – Mm | ≥ ε}.

Noting that Am (ε) ⊆ Am (ε0 ) if ε ≥ ε0 , it follows that



 \ 
P(Cc ) = lim P Am (ε) ≤ lim lim P(Am (ε)).
ε&0 m=1 ε&0 m→∞

Thus, to prove that P(Cc ) = 0, it is sufficient to show that P(An (ε)) → 0 as n → ∞ for all ε > 0. To
this end, for a fixed m ∈ N define the sequence Y = (Yn ) by Yn := Mn+m – Mm . The process (Yn ) is a
martingale with respect to (Gn ) where Gn = σ(Y1 , Y2 , . . . , Yn ) since

E[Yn+1 |Gn ] = E[Mn+1+m – Mm |Gn ] = E[E[Mn+1+m – Mm |Fn+m ]|Gn ]


= E[Mn++m – Mm |Gn ] = E[Yn |Gm ] = Yn .

As the process Y is a martingale, we have by the Doob-Komogorov inequality (Theorem 6.7.4) that

P( sup |Yi | ≥ ε) ≤ ε12 EY2n , ∀ ε > 0,


0≤i ≤n
⇒ P( sup |Mm+i – Mm | ≥ ε) ≤ ε12 E(Mm+n – Mn )2
0≤i ≤n
= ε12 (EM2m+n – EM2m ), ∀ ε > 0.

Taking a limit as n → ∞ we obtain

P(Am (ε)) ≤ ε12 (` – EM2m ),


⇒ lim P(Am (ε)) ≤ ε12 (` – `) = 0,
m→∞
which establishes that (Mn ) is Cauchy convergent almost surely. Hence, there exists M∞ such that
a.s.
Mn → M∞ . Next using Fatou’s Lemma 1 we have

E(Mn – M∞ )2 = E lim inf (Mn – Mm )2 ≤ lim inf E(Mn – Mm )2


m→∞ m→∞
= lim inf E(Mn – Mm )2 = lim EM2m – EM2n
m→∞ m→∞
= ` – EM2n →0 as n → ∞,
L2
which establishes that Mn → M∞ .
1 Fatou’s Lemma states that, for a sequence of non-negative random variables (Xn ) the following holds: E lim inf n→∞ Xn ≤
lim inf n→∞ EXm .
6.8. UNIFORM INTEGRABILITY 95

Let us see an example of how Theorem 6.7.5 can be applied.

Example 6.7.6. Suppose (Xn ) is a Markov chain with one-step transition probabilities given by
! !j !N–j
N i i
p(i , j ) = P(Xn+1 = j |Xn = i ) = 1– , i , j ∈ {0, 1, . . . , N}.
j N N
Let Fn = σ(X1 , X2 , . . . , Xn ). The process (Xn ) is a martingale with respect to the filtration (Fn ) because
N
X
E[Xn+1 |Fn ] = j p(Xn , j )
j =0
X N N
!
Xn j
 
Xn N–j

= j 1–
j =0
j N N
Xn –N Xn N
   
= 1– 1– Xn = Xn .
N N
Clearly, as (Xn ) is bounded, we have supn EX2n < ∞. As such, by Theorem 6.7.5 there exists a random
a.s. L2
variable X∞ such that Xn → X∞ and Xn → X∞ . To find the distribution of X∞ , we note that both
{0} and {N} are absorbing states. As there are no other absorbing states, we conclude that (Xn ) will
eventually end up in onf of these two states. Thus, using the martingale property X0 = EXn we find

X0 = lim EXn = E lim Xn = EX∞


n→∞ n→∞
= N · P(X∞ = N) + 0 · P(X∞ = 0),

where passing the limit through the expectation is allowed by Lebesgue’s dominated convergence (Theorem
6.3.5). From the above computation, we conclude that

X0 /N = P(X∞ = N) = 1 – P(X∞ = 0).

6.8 Uniform Integrability


We begin this section with a motivating example.

Example 6.8.1. Suppose (Ω, F, P) = ([0, 1], B([0, 1]), Leb) so that P(dω) = dω. Define a sequence of
random variables (Xn ) by

Xn (ω) := n1{[0,1/n]} (ω).

P
It is easy to see that Xn → 0 as, for any ε > 0, we have

lim P(|Xn – 0| > ε) = lim (1/n) = 0.


n→∞ n→∞
96 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

a.s.
In fact, in this case, we have Xn → 0 as well. However, we clearly do not have L1 convergence of Xn → 0
because

lim E|Xn – 0| = lim 1 = 1.


n→∞ n→∞

P L1
It is natural to ask: are there conditions under which Xn → X∞ implies Xn → X∞ ? It turns out that
the answer is “yes.” We introduce these conditions in this section.

Definition 6.8.2. A collection of random variables (Xn ) (possibly uncountable) is said to be uniformly
integrable (UI) if, for every ε > 0 there exists Kε ∈ [0, ∞) such that

sup E|Xn |1{|Xn |>Kε } < ε.


n

Example 6.8.3. Let us return to Example 6.8.1, discussed above. For every n we have E|Xn | = 1. Thus,
the collection of random variables (Xn ) is integrable. However, we have

E|Xn |1{|Xn |>K} = 1, ∀ n > K.

Thus, for a given ε there is no Kε ∈ [0, ∞) for which supn E|Xn |1{|Xn |>Kε } < ε. As such, we conclude
that the collection (Xn ) is not UI.

Let us provide some simple conditions which, if satisfied by a collection of random variables, ensure
uniform integrability.

Lemma 6.8.4. Suppose a collection of random variables (Xn ) satisfies

sup E|Xn |p < ∞


n

for some p > 1. Then the collection (Xn ) is UI.

Proof. Assuming K > 0 and p > 1 we have |Xn |1–p 1{|Xn |>K} ≤ K1–p 1{|Xn |>K} and thus |Xn |1{|Xn |>K} ≤
Kp–1 |Xn |p 1{|Xn |>K} . Hence

E|Xn |1{|Xn |>K} ≤ K1–p E|Xn |p 1{|Xn |>K} ≤ K1–p E|Xn |p .

Taking sup’s on both sides, we have

sup E|Xn |1{|Xn |>K} ≤ K1–p sup E|Xn |p =: K1–p c, c := sup E|Xn |p < ∞
n n n

The right-hand side can be made smaller than ε by choosing K > Kε := (c/ε)1/(p–1) . Thus, we have
that (Xn ) is UI.
6.8. UNIFORM INTEGRABILITY 97

Lemma 6.8.5. Suppose a collection of random variables (Xn ) safisfies

sup |Xn | ≤ Y, EY < ∞,


n

for some random variable Y. Then the collection (Xn ) is UI.

Proof. By assumption |Xn | ≤ Y, which implies 1{|Xn |>K} ≤ 1{Y>K} . Hence, we have

sup E|Xn |1{|Xn |>K} ≤ EY1{Y>K} .


n

Because EY < ∞ by assumption, for any ε > 0, there exists Kε such that EY1{Y>Kε } < ε. Thus, by
choosing K > Kε above, we can ensure supn E|Xn |1{|Xn |>K} < ε. It follows that the collection of random
variables (Xn ) is UI.

Theorem 6.8.6 (Bounded convergence). Let (Xn ) be a sequence of random variables and let
P
X∞ be a random variable. Suppose that Xn → X∞ and that, for some K < ∞ we have

sup |Xn | ≤ K.
n

L1
Then Xn → X∞ .

Proof. First, we check that P(|X∞ | ≤ K) = 1. For any k ∈ N we have

P(|X∞ | > K + 1/k ) ≤ P(|X∞ – Xn | > 1/k ).

P
The right-hand side goes to zero as n → ∞ because (by assumption) Xn → X∞ . Thus, we have
P(|X| > K + 1/k ) = 0. Now, observe that
[ 
P(|X∞ | > K) = P {|X∞ | > K + 1/k } = lim P(|X∞ | > K + 1/k ) = 0.
k k →∞

P
Hence, P(|X∞ | ≤ K) = 1. Next, because Xn → X∞ , for any ε > 0 we can choose nε such that

P(|Xn – X∞ | > ε/3) < ε/(3K), ∀ n ≥ nε .

Then, for n ≥ nε , we have

E|Xn – X∞ | = E|Xn – X∞ |1{|Xn –X∞ |>ε/3} + E|Xn – X∞ |1{|Xn –X∞ |≤ε/3}


≤ 2KP(|Xn – X∞ | > ε/3) + ε/3 ≤ ε,

where we have used |Xn – X∞ | ≤ 2K. As ε was arbitrary, we can make E|Xn – X∞ | as small as we like.
L1
Hence, Xn → X∞ .
98 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

We can now state and prove the main result of this section.

Theorem 6.8.7. Let (Xn ) be a sequence of random variables and let X∞ be a random variable.
Suppose that E|Xn | < ∞ for every n and E|X∞ | < ∞. Then

L1 P
Xn → X∞ ⇔ Xn → X∞ and (Xn ) is UI.

Proof. We have already proved the ⇒ part 2 (see Theorem 6.3.1). So, we focus now on the ⇐ part. To
P
this end, suppose Xn → X∞ and (Xn ) is UI. For any K ∈ [0, ∞), define φK : R → [–K, K] as follows



 – K x < –K,


φK (x ) := x |x | ≤ K,



K

x > K.

Observe that

|φK (x ) – x | = (|x | – K)1{|x |>K} ≤ |x |1{|x |>K} . (6.12)

Fix ε > 0. From (6.12) and the fact that (Xn ) is UI and X∞ integrable, there exists a Kε such that

sup E|φKε (Xn ) – Xn | < ε/3, E|φKε (X∞ ) – X∞ | < ε/3.


n

P P
Now, Xn → X∞ implies that φKε (Xn ) → φKε (X∞ ). And furthermore, supn |φKε (Xn )| ≤ Kε . Thus, it
L1
follows from Theorem 6.8.6 that φKε (Xn ) → φKε (X∞ ). Thus, there exists an nε such that

E|φKε (Xn ) – φKε (X∞ )| < ε/3, ∀ n ≥ nε .

Combining the above results, we obtain

E|Xn – X∞ | ≤ E|Xn – φKε (Xn )| + E|φKε (Xn ) – φKε (X∞ )| + E|φKε (X∞ ) – X∞ | < ε.

L1
As ε was abritrary, it follows that Xn → X∞ , as claimed.
a.s. P
Remark 6.8.8. From Theorem 6.3.1, we know that Xn → X∞ implies Xn → X∞ . From this and
Theorem 6.8.7, it follows that

a.s. L1
Xn → X∞ and (Xn ) is UI ⇒ Xn → X∞ .
2 More specifically, we have proved that convergence in L1 implies convergence in probability, but it also implies uniform
integrability.
6.8. UNIFORM INTEGRABILITY 99

We will show below that the the set of conditional expectations of an integrable random variable form a
class of uniformly integable random variables. To prove this result, we require the following lemma.

Lemma 6.8.9. Suppose E|X| < ∞. Then, for all ε > 0 there exists δε > 0 such that P(A) < δε
implies E|X|1A < ε.

Proof. Suppose, by contradiction, that E|X| < ∞ and, for a given ε > 0, there exists a sequence of sets
(An ) such that

P(An ) < 1/2n , and E|X|1An ≥ ε, ∀ n.


P∞
Define A := {An , i.o.}. As n=1 P(An ) < ∞, we have by the Borel-Cantelli Lemma (see (6.9)) that
P(A) = 0 and thus E|X|1A = 0. On the other hand we have

ε ≤ lim sup E|X|1An ≤ E lim sup |X|1An = E|X|1A ,


n→∞ n→∞
which is a contradiction.

Theorem 6.8.10. Let X be a random variable on (Ω, F, P) and suppose E|X| < ∞. Then the
collection C of random variables defined by

C = {E(X|G) : G is a sub-σ-algebra of F},

is uniformly integrable.

Proof. Fix ε > 0. From Lemma 6.8.9 there exists δε > 0 such that P(A) < δε implies E|X|1A < ε. As
E|X| < ∞ we can choose Kε > 0 so that E|X|/Kε < δε . Now, for any G ⊆ F, we can define Y = E(X|G).
By Jensen’s inequality (see Theorem 2.3.4) we have

|Y| = |E(X|G)| ≤ E(|X| |G), P – a.s.. (6.13)

Hence, we have E|Y| ≤ EE(|X||G) = E|X| and, by the Markov inequality

Kε P(|Y| > Kε ) ≤ E|Y| ≤ E|X|,

so that

P(|Y| > Kε ) ≤ E|X|/Kε < δε .

Note that {|Y| > Kε } ∈ G. Thus, from (6.13) we have

E|Y|1{|Y|>Kε } ≤ EE(|X| |G)1{|Y|>Kε }


= EE(|X|1{|Y|>Kε } |G) = E|X|1{|Y|>Kε } < ε.

As the above inequality holds for any G ⊆ F, the class C is UI, as claimed.
100 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES

Example 6.8.11 (Uniformly integrable martingale). Suppose X∞ is a random variable defined


on (Ω, F, P) and E|X∞ | < ∞. Suppose (Fn ) is a filtration and define a sequence of random variables
(Xn ) by Xn := E[X∞ |Fn ]. Then the process X = (Xn ) is martingale because

E[Xn+1 |Fn ] = E[E[X∞ |Fn+1 ]|Fn ] = E[X∞ |Fn ] = Xn ,

and, moreover, (Xn ) is uniformly integrable by Theorem 6.8.10.

6.9 Exercises
Exercise 6.1. Let X1 , X2 , X3 , ... be a sequence of random variables such that

Xn ∼ Geo(λ/n), for n = 1, 2, 3, ...

where λ > 0 is a constant. Define a new sequence as


1
Yn := Xn , for n = 1, 2, 3, ...
n
Show that Yn converges in distribution to E(λ).

Exercise 6.2. Consider the sample space Ω = [0, 1] with uniform probability distribution, i.e.,

P(ω ∈ [a, b]) = b – a, ∀ 0≤a ≤b≤1

n ω + (1 – ω)n . Also, define the random variable X


Define the sequence {Xn , n = 1, 2, ...} as Xn (ω) = n+1
a.s.
on this sample space as X(ω) = ω. Show that Xn → X.

Exercise 6.3. Let {Xn , n = 1, 2, ...} and {Yn , n = 1, 2, ...} be two sequences of random variables, defined
on some probability space (Ω, F, P). Suppose that we know

a.s. a.s.
Xn → X, Yn → Y.

a.s.
Prove that Xn + Yn → X + Y.

Exercise 6.4. Let {Xn , n = 1, 2, ...} and {Yn , n = 1, 2, ...} be two sequences of random variables, defined
on some probability space (Ω, F, P). Suppose that we know

P P
Xn → X, Yn → Y.

P
Prove that Xn + Yn → X + Y.
6.9. EXERCISES 101

Exercise 6.5. Show that if Xn is any sequence of random variables, there are constants cn → ∞ so
a.s.
that Xn /cn → 0.

Exercise 6.6. Let X1 , X2 , ..., be independent with P(Xn = 1) = pn and P(Xn = 0) = 1 – pn . Show that
P
(a) Xn → 0 if and only if pn → 0.
a.s.
(b) Xn → 0 if and only if pn < ∞.
P
n

Exercise 6.7. Suppose that X1 , X2 , ..., are independent with P(Xn > x ) = x –5 for all x ≥ 1 and
n = 1, 2, ..., Show that limsupn→∞ (log Xn )/ log n = c almost surely for some number c, and find c.
102 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
Chapter 7

Brownian motion

The notes from this chapter are primarily taken from (Shreve, 2004, Chapter 3). The goals of this
chapter are (i) to define what we mean by “Brownian motion” and (ii) to develop important properties of
Brownian motion.

7.1 Scaled random walks


In order to construct Brownian motion, we will begin with a random walk. We return to the setting of
Section 1.3.2. Our sample space Ω and a generic element ω in this sample space are given by

Ω = the set of infinite sequences of Hs and Ts , ω = ω1 ω2 ω3 . . . ,

where ωi is the result of the i th coin toss. We take Fn to be the σ-algebra generated by observing the
first n coin tosses and we set F = σ(∪n Fn ). The coin tosses are assumed to be independent and we take
P(ωi = H) = P(ωi = T) = (1/2).

7.1.1 Symmetric random walk


We construct a symmetric random walk M = (Mi )i ∈N0 by

k
X  +1 ωi = H,
M0 = 0, Mk = Xj , Xi = 
j =1 –1 ωi = T.

Let ki < ki +1 for all i ∈ N0 . As the (Xi )i ∈N are independent, we clearly have

independent increments : (Mk2 – Mk1 ) ⊥⊥ (Mk4 – Mk3 ).

103
104 CHAPTER 7. BROWNIAN MOTION

Moreover, as EXj = 0 and VXj = 1 for all j we also have


k2
X k2
X
E(Mk1 – Mk2 ) = EXj = 0, V(Mk2 – Mk1 ) = VXj = k2 – k1 . (7.1)
j =k1 +1 j =k1 +1

Thus, for the discrete time process M = (Mi )i ≥0 we see that variance accumulates at a rate of one per
unit time.

Next, we note that M is a martingale with respect to the filtration F = (Fn )n∈N0 . To see this, let k ≤ l
and note that

E[Ml |Fk ] = E[Ml – Mk + Mk |Fk ] = E[Ml – Mk |Fk ] + E[Mk |Fk ]


= E(Ml – Mk ) + Mk . = Mk ,

where we have used the independent increments property (Ml – Mk ) ⊥⊥ Fk , equation (7.1) and the fact
that Mk ∈ Fk .

The quadratic variation of the symmetric random walk M up to time k , denoted [M, M]k , is defined as
k k
(Mj – Mj –1 )2 = X2j = k ,
X X
[M, M]k :=
j =1 j =1

where we have used X2j = 1. The astute reader will notice that [M, M]k = VMk . However, it is important
to note that the computation of variance and the computation of quadratic variation are different !
To see this, note that if P(ωi = H) = p 6= (1/2) then VXi 6= 1 and thus VMk = 6 k . However, since
X2j = 1 is unaffected by the value of p = P(ωi = H), the computation of [M, M]k is also unaffected by p.
Another way to see that VMk and [M, M]k are different is to note that VMk is a statistical quantity
(i.e., it is an average over all ω) whereas [M, M]k is computed ω-by-ω (it just turns out that, for each ω
we have [M, M]k (ω) = k .

7.1.2 Scaled symmetric random walk


(n)
We now fix a positive integer n. We construct a scaled symmetric random walk W(n) = (Wt )t ≥0 by
(n) 1
Wt = √ Mnt , if nt ∈ N0 .
n
(n)
If nt is not an integer, we define Wt to be the linear interpolation of Mbnt c and Mdnt e . Note that M
was a discrete time process whereas W(n) is a continuous time process.

Let 0 = t0 < t1 < t2 < . . . and suppose ntj ∈ N for every j . Then we have
(n) (n) (n) (n)
independent increments : (Wt2 – Wt1 ) ⊥⊥ (Wt4 – Wt3 ),
7.1. SCALED RANDOM WALKS 105

because non-overlapping increments depend on different coin tosses. For example


(100) (100)
W0.2 – W0.0 depends on coin tosses 1 through 20,
(100) (100)
W0.7 – W0.2 depends on coin tosses 21 through 70.

Next, suppose 0 ≤ s ≤ t are such that ns ∈ N and nt ∈ N. Then


(n) (n) 1 (n) (n) 1
E(Wt – Ws ) = √ E(Mnt – Mns ) = 0, V(Wt – Ws ) = V(Mnt – Mns ) = t – s. (7.2)
n n
(n)
Now, let us define the scaled filtration F(n) = (Ft )t ≥0 where
(n)
Ft := Fnt , if nt ∈ N0 .

The scaled random walk W(n) , restricted to the set of t for which nt ∈ N0 , is a martingale with respect
to the filtration F(n) . To see this, assume 0 ≤ s ≤ t are such that ns ∈ N and nt ∈ N. Then we have
(n) (n) (n) (n) (n) (n) (n) (n) (n) (n) (n)
E[Wt |Fs ] = E[Wt – Ws + Ws |Fs ] = E[Wt – Ws |Fs ] + E[Ws |Fs ]
(n) (n) (n) (n)
= E(Wt – Ws ) + Ws = Ws ,

(n) (n) (n) (n) (n)


where we have used (Wt – Ws ) ⊥⊥ Fs , equation (7.2) and Ws ∈ Fs .

The quadratic variation of the scaled symmetric random walk W(n) up to time t , denoted [W(n) , W(n) ]t ,
is defined as
nt nt  2 nt
(n) (n)
[W(n) , W(n) ]t W(j –1)/n )2 √1 Xj 1
X X X
:= (Wj /n – = n
= n = t,
j =1 j =1 j =1

(n)
where we again assume nt ∈ N0 . Thus, for the scaled symmetric random walk, we see that VWt =
[W(n) , W(n) ]t . However, we emphasize one more time that the computation of variance is a statistical
average over all ω and the computation of quadratic variation is done ω-by-ω.
(n)
Theorem 7.1.1. Fix t ≥ 0. Define a random variable Wt := limn→∞ Wt . Then Wt ∼ N(0, t ).

Proof. We will use the Continuity Theorem 3.3.8. We have


 
iu X
nt 
(n)
φ (n) (u) = E exp iuWt = E exp √ X
Wt n j =1 j
nt
Y iu
 
= E exp √ Xj
j =1
n
 √ √ nt 1 2
= 1 eiu/ n
2 + 1 e–iu/ n
2 → e– 2 tu , as n → ∞.
106 CHAPTER 7. BROWNIAN MOTION

The limit as n → ∞ is in fact not easy to show. Nevertheless, the limit above is correct. From Example
3.3.4, we know that if a random variable Z is normally distributed Z ∼ N(0, t ) then its characteristic
function φZ is given by
1 2
φZ (u) = e– 2 tu .

(n) D
As limn→∞ φ (n) → φZ , we have from the Continuity Theorem 3.3.8 that Wt → N(0, t ), as claimed.
Wt

7.2 Brownian motion


Definition 7.2.1. Let (Ω, F, P) be a probability space. A Brownian motion is a stochastic process
W = (Wt )t ≥0 that satisfies:

1. W0 = 0.
2. If 0 ≤ r < s < t < u < ∞ then (Wu – Wt ) ⊥⊥ (Ws – Wr ).
3. If 0 ≤ r < s then Ws – Wr ∼ N(0, s – r ).
4. The map t → Wt is continuous for every ω.

It is clear from the previous sections that we can construct a Brownian motion as a limit of a scaled
symmetric random walk. Had we simply given Definition 7.2.1 at the beginning of this chapter with
no further introduction, one might have legitimately aksed if there exists a process that satisfies the
properties of a Brownian motion. There are other methods to prove the existence of Brownian motion.
But, the scaled random walk construction is perhaps the most intuitive.

What is Ω in Definition 7.2.1? It could be an infinite series of Hs and Ts, representing movements up
and down of a scaled random walk. Or, it could be Ω = C0 (R+ ), the set of continuous functions on
R+ , starting from zero. In this case, an element of Ω is a continuous function t → ω(t ) and one could
simply take Wt (ω) = ω(t ). Whatever the sample space, the probability of any single element ω is zero:
P(ω) = 0, but probabilities such as P(Wt ≤ 0) are well-defined.

Let 0 < t1 < t2 < . . . < td < ∞. Note that the vector W := (Wt1 , Wt2 , . . . , Wtd ) is a d-dimensional
normally distributed random variable. The distribution of a normally distributed random vector is
uniquely determined by its mean vector and covariance matrix. We clearly have E(Wt1 , Wt2 , . . . , Wtd ) =
(0, 0, . . . , 0). The entries of the covariance matrix are of the following form: for T ≥ t we have

CoV[WT , Wt ] = EWT Wt = E(WT – Wt + Wt )Wt


= E(WT – Wt )Wt + EW2t = EWt E[WT – Wt |Wt ] + t = t .
7.3. QUADRATIC VARIATION 107

Thus, CoV[Ws , Wt ] = s ∧ t , and the covariance matrix for (Wt1 , Wt2 , . . . , Wtd ) is
 
 t1 t1 . . . t1 
 
 t1 t2 . . . t2 
C= . . . . (7.3)
 
 . . . .. 
 . . . .  
 
t1 t2 . . . td

Definition 7.2.2. Let (Ω, F, P) be on probability space on which a Brownian motion W = (Wt )t ≥0 is
defined. A filtration for the Brownian motion W is a collection of σ-algebras F = (Ft )t ≥0 satisfying:

1. Information accumulates: if 0 ≤ s < t then Fs ⊂ Ft .


2. Adaptivity: for all t ≥ 0, we have Wt ∈ Ft .
3. Independence of future increments: if u > t ≥ 0 then (Wu – Wt ) ⊥⊥ Ft .

The most natural choice for this filtration F is the natural filtration for W. That is Ft = σ(Wu , 0 ≤ u ≤
t ). In principle the filtration (Ft )t ≥0 could contain more than the information obtained by observing W.
However, the information in the filtration is not allowed to destroy the independence of future increments
of Brownian motion.

Not surprisingly, if F = (Ft )t ≥0 is a filtration for a Brownian motion W then W is a martingale with
respect to this filtration. We see this, let 0 ≤ s < t and observe that

E[Wt |Fs ] = E[Wt – Ws + Ws |Fs ] = E[Wt – Ws |Fs ] + E[Ws |Fs ] = E[Wt – Ws ] + Ws = Ws .

7.3 Quadratic variation


In Section 7.1, we computed the quadratic variation of a symetric random walk M and a scaled symmetric
random walk W(n) . However, we did not rigorously define the notion of quadratic variation. In this
Section we will define what we mean by “quadratic variation” and we will compute this quantity for a
Brownian motion W. Before defining quadratic variation we first introduce the “first variation.”

Let Π be a partition of [0, T] and define kΠk as

Π = {t0 , t1 , . . . , tn }, 0 = t0 < t1 < . . . < tn = T, kΠk = max(ti +1 – ti ). (7.4)


i

Let f : [0, T] → R. We define the first variation of f up to time T, denoted FVT (f ), by


n–1
X
FVT (f ) := lim |f (tj +1 ) – f (tj )|.
kΠk→0 j =0
108 CHAPTER 7. BROWNIAN MOTION

Suppose that f ∈ C([0, T]) and f 0 (t ) exists and is finite for all t ∈ (0, T). Then, by the Mean Value
Theorem, there exits tj∗ ∈ [tj , tj +1 ] such that

f (tj +1 ) – f (tj ) = f 0 (tj∗ )(tj +1 – tj ).

Thus, if f ∈ C([0, T]) and f 0 (t ) exists and is finite for all t ∈ (0, T), we have
n–1 Z T
|f 0 (tj∗ )|(tj +1 dt |f 0 (t )|.
X
FVT (f ) := lim – tj ) =
kΠk→0 j =0 0

Definition 7.3.1. Let f : [0, T] → R. We define the quadratic variation of f up to time T, denoted
[f , f ]T as
n–1
Xh i2
[f , f ]T := lim f (tj +1 ) – f (tj ) ,
kΠk→0 j =0

where Π and kΠk are as defined in (7.4).

Proposition 7.3.2. Suppose f : [0, T] → R has a continuous first derivative: f ∈ C1 ((0, T)). Then
[f , f ]T = 0.

Proof. For any partition Π of [0, T] we have


n–1 i2 n–1 n–1
[f 0 (tj∗ )]2 (tj +1 – tj )2 ≤ kΠk [f 0 (tj∗ )]2 (tj +1 – tj ).
Xh X X
f (tj +1 ) – f (tj ) =
j =0 j =0 j =0

Inserting the above inequality into the definition of [f , f ]T , we obtain


n–1 Z T
[f 0 (tj∗ )]2 (tj +1 dt |f 0 (t )|2 = 0,
X
[f , f ]T ≤ lim kΠk · lim – tj ) = 0 ·
kΠk→0 kΠk→0 j =0 0

where we have used 0T dt |f 0 (t )|2 < ∞ as, by assumption, f ∈ C1 ((0, T)).


R

In ordinary calculus, we typically deal with functions f ∈ C1 , and hence [f , f ]T = 0. For this reason,
quadratic variation never arises in usual calculus. However, it turns out that for almost every ω ∈ Ω, we
have that t 7→ Wt (ω) is not differentiable. We can see this from the scaled random walk construction of
(n) dW (n)
Brownian motion. The slope of the scaled random walk Wt at any t for which dt t (ω) is defined is
(n) (n)
W – Wt √
lim t +ε = ± n → ±∞ as n → ∞.
ε→0 ε
Thus, Brownian motion, which we constructed as a limit of a scaled random walk W(ω) := limn→∞ W(n) (ω)
/ C1 ((0, T)), then Proposition 7.3.2 can fail.
is not differentiable at any t , P-a.s.. When a function f ∈
Indeed, as we will show, paths of BM have strictly positive quadratic variation. It is for this reason that
stochastic calculus is different from ordinary calculus.
7.3. QUADRATIC VARIATION 109

Theorem 7.3.3. Let W be a Brownian motion. Then, for all T ≥ 0 we have [W, W]T = T almost
surely.

Proof. For a fixed partition Π = {0 = t0 , t1 , . . . , tn = T} we define the sampled quadratic variation of


W, denoted QΠ , by
n–1
(Wtj +1 – Wtj )2 .
X
QΠ :=
j =0

Note that QΠ → [W, W]T as kΠk → 0. We will show that EQΠ → T and VQΠ → 0. Using the fact that
Wtj +1 – Wtj ∼ N(0, tj +1 – tj ), we compute

E(Wtj +1 – Wtj )2 = tj +1 – tj , (7.5)


 2
V(Wtj +1 – Wtj )2 = E(Wtj +1 – Wtj )4 – E(Wtj +1 – Wtj )2
= 3(tj +1 – tj )2 – (tj +1 – tj )2 = 2(tj +1 – tj )2 (7.6)

Using (7.5) and (7.6), we obtain


n–1 n–1
E(Wtj +1 – Wtj )2 =
X X
EQΠ = (tj +1 – tj ) = T,
j =0 j =0
n–1 n–1 n–1
V(Wtj +1 – Wtj )2 = 2(tj +1 – tj )2 ≤ kΠk
X X X
VQΠ = 2(tj +1 – tj ) = 2 kΠk T.
j =0 j =0 j =0

Thus, EQΠ → T and VQΠ → 0 as kΠk → 0, which proves that [W, W]T = limkΠk→0 QΠ = T.

Suppose dt  1 and define dWt := Wt +dt – Wt . The above computations show that E(dWt )2 = dt and
V(dWt ) = 2dt 2 . Since dt 2 is practically zero for dt  1, one can imagine that (dWt )2 is almost equal
to a constant dt . Informally, we write this as

dWt dWt = dt . (7.7)

This informal statement, while not rigorously correct, captures the spirit of the quadratic variation
computation for W.

Definition 7.3.4. Let f , g : [0, T] → R. We define the covaration of f and g up to time T, denoted
[f , g]T as
n–1
Xh ih i
[f , g]T := lim f (tj +1 ) – f (tj ) g(tj +1 ) – g(tj ) ,
kΠk→0 j =0

where Π and kΠk are as defined in (7.4).


110 CHAPTER 7. BROWNIAN MOTION

Theorem 7.3.5. Let W be a Brownian motion and let Id be the identity function: Id(t ) = t . Then,
for all T ≥ 0 we have [W, Id]T = 0 almost surely and [Id, Id]T = 0.
Proof. For a fixed partition Π = {0 = t0 , t1 , . . . , tn = T} we have

n–1
X


[W, Id]T

= lim (Wtj +1 – Wtj ) · (tj +1 – tj )
kΠk→0 j =0

n–1
X
= lim (Wtj +1 – Wtj ) · (tj +1 – tj )
kΠk→0 j =0

≤ T · lim max |Wtj +1 – Wtj | = T · 0 = 0.


kΠk→0 0≤j ≤n–1
where, in the last step we are using the fact that W is continuous. To see that [Id, Id]T = 0, simply note
that g ∈ C1 ([0, T]) and use Proposition 7.3.2.

Just as (7.7) captures the spirit of the computation of [W, W]T , the following equations

dWt dt = 0, dt dt = 0,

informally capture the spirit of the [W, Id]T and [Id, Id]T computations.

7.4 Markov property of Brownian motion


Theorem 7.4.1. Let W = (Wt )t ≥0 be a Brownian motion and let (Ft )t ≥0 be a filtration for this
Brownian motion. Then W is a Markov process.
Proof. According to Definition 2.3.6, we must show that there exists a function g such that

E[f (WT )|Ft ] = g(Wt ),

where T ≥ t ≥ 0. Noting that Wt ∈ Ft and WT – Wt ⊥⊥ Ft , we have


Z
E[f (WT )|Ft ] = E[f (WT – Wt + Wt )|Ft ] = dy f (y + Wt )Γ(t , Wt ; T, y) =: g(Wt ),
R
where Γ(t , x ; T, ·) is the density of a normal random variable with mean x and variance T – t .

7.5 First hitting time of Brownian motion


Theorem 7.5.1. Let W = (Wt )t ≥0 be a Brownian motion with a filtration (Ft )t ≥0 . Define a process
Z = (Zt )t ≥0 by
1 2
Zt = e– 2 σ t +σWt . (7.8)

Then Z is a martingale with respect to (Ft )t ≥0 .


7.6. REFLECTION PRINCIPLE 111

Proof. The proof is a simple computation. Fixing T ≥ t we have


1 2 1 2
E[ZT |Ft ] = E[e– 2 σ T+σWT |Ft ] = e– 2 σ T+σWt E[eσ(WT –Wt ) |Ft ]
1 2 1 2 1 2
= e– 2 σ T+σWt e 2 σ (T–t ) = e– 2 σ t +σWt = Zt .

Thus, Z is a martingale.

The process Z is sometimes referred to as an exponential martingale. The exponential martingale will
be used to compute the distribution of

τm := inf{t ≥ 0 : Wt = m}. (7.9)

We call τm the first hitting time or first passage time of a Brownian motion W to level m.

Theorem 7.5.2. For any m ∈ R and α > 0 we have



Ee–ατm = e–|m| 2α . (7.10)

where τm is defined in (7.9).


(m)
Proof. First, assume m ≥ 0. Let Z(m) = (Zt )t ≥0 be given by
(m)
Zt := Zt ∧τm ,

where Z is given by (7.8). We call Z(m) a stopped process as it remains at Zτm forever after W hits m.
As Z is a martingale it follows that the stopped process Z(m) is also a martingale. Thus, we have
(m) (m) 1 2
1 = Z0 = EZt = Ee– 2 σ t ∧τm +σWt ∧τm .

Now, it turns out that P(τm < ∞) = 1 (a fact that can be proved with relative ease). As a result, we
have limt →∞ t ∧ τm = τm and Wτm = m. Thus, we obtain
1 2 1 2 1 2
1 = lim Ee– 2 σ t ∧τm +σWt ∧τm = E lim e– 2 σ t ∧τm +σWt ∧τm = Ee– 2 σ τm +σm .
t →∞ t →∞
√ √
Setting σ = 2α we obtain Ee–ατm = e–m 2α , which agrees with (7.10) for m ≥ 0. To obtain (7.10) for
m < 0, simply note that, since Brownian motion is symmetric about zero, the distribution of τm is the
same as the distribution of τ–m .

We can compute Eτm as follows


d –ατ d –|m|√2α |m| –|m|√2α
Eτm = lim – Ee m = lim – e = lim √ e = ∞, m 6= 0.
α&0 dα α&0 dα α&0 α

Thus, while P(τm < ∞) = 1 we have Eτm = ∞.


112 CHAPTER 7. BROWNIAN MOTION

Figure 7.1: An illustration of the reflection principle for Brownian motion.

7.6 Reflection principle


For every sample path of Brownian motion that hits a level m > 0 prior to time t and finishes at level
Wt = w ≤ m there is an equally likely path that finishes at a level Wt = 2m – w . In Figure 7.1 these two
paths are represented by the solid and dashed lines, respectively. The dotted lines from top to bottom
represent the levels 2m – w , m and w . It follows that

P(τm ≤ t , Wt ≤ w ) = P(Wt ≥ 2m – w ), m > 0, w ≤ m. (7.11)

We will use this fact to find the density of τm .

Theorem 7.6.1. For all m 6= 0, the first hitting time τm of Brownian motion to level m has a
density fτm , which given by

|m| –m 2 /(2t )
fτm (t ) = 1{t ≥0} √ e . (7.12)
t 2πt

Proof. Consider the case m > 0. Substitute w = m into (7.11) to obtain

P(τm ≤ t , Wt ≤ m) = P(Wt ≥ m).


7.6. REFLECTION PRINCIPLE 113

Now, note that Wt ≥ m implies that τm ≤ t . And thus

P(τm ≤ t , Wt ≥ m) = P(Wt ≥ m).

Adding the above two equations, we obtain

2P(Wt ≥ m) = P(τm ≤ t , Wt ≤ m) + P(τm ≤ t , Wt ≥ m) = P(τm ≤ t ).

The density (7.12) is obtained using

d d d Z∞ 1 x2

fτm (t ) = P(τm ≤ t ) = 2P(Wt ≥ m) = 2 dx √ exp –
dt dt dt m 2πt 2t
d Z∞ 1 y2  m

2
= 2 √ dy √ exp – = √ e–m /(2t )
dt m/ t 2π 2 t 2πt
The case m ≤ 0 is proved in a similar manner.

Now, let us define the running maximum, denoted W = (Wt )t ≥0 , of Brownian motion

Wt = max Ws .
0≤s≤t

Note that Wt ≥ m if an only if τm ≤ t . Thus, we have from (7.11) that

P(Wt ≥ m, Wt ≤ w ) = P(Wt ≥ 2m – w ), m > 0, w ≤ m. (7.13)

We can use (7.13) to obtain the joint density of (Wt , Wt ).

Theorem 7.6.2. For any t > 0, the joint density of Brownian motion Wt and its running maximum
Wt is

2(2m – w ) –(2m–w )2 /(2t )


fW ,W (w , m) = √ e , m > 0, w ≤ m. (7.14)
t t t 2πt
Proof. Note that

P(Wt ≤ w , Wt ≤ m) = P(Wt ≤ w ) – P(Wt ≤ w , Wt ≥ m)


Z ∞
= P(Wt ≤ w ) – P(Wt ≥ 2m – w ) = P(Wt ≤ w ) – dx fWt (x ),
2m–w

where fWt is the density of Wt , which is a N(0, t ) random variable. To obtain the density (7.14) use

∂2 ∂2 Z ∞
fW ,W (w , m) = P(Wt ≤ w , Wt ≤ m) = – dx fWt (x ).
t t ∂m∂w ∂m∂w 2m–w
The rest of the computation is algebra.
114 CHAPTER 7. BROWNIAN MOTION

Corollary 7.6.3. For any t > 0, the conditional density of Wt given Wt is given by

2(2m – w ) –2m(m–w )/t


fW |W (m, w ) = e , m > 0, w ≤ m. (7.15)
t t t
Proof. Expression (7.15) follows directly from

fW ,W (w , m)
t t
fW |W (m, w ) = .
t t fWt (w )

where fWt is the density of Wt , which is N(0, t ).

7.7 Exercises
Exercise 7.1. Let W be a Brownian motion and let F be a filtration for W. Show that W2t – t is a
martingale with respect to the filtration F.

Exercise 7.2. Compute the characteristic function of WNt where N is a Poisson process with intensity
λ and the Brownian motion W is independent of the Poisson process N.

Exercise 7.3. The nth variariation of a function f , over the interval [0, T] is defined as
n–1
|f (tj +1 ) – f (tj )|m ,
X
VT (m, f ) := lim Π = {0 = t0 , t1 , . . . tn = T}, kΠk = max(tj +1 – tj ).
kΠk→0 j =0 j

Show that VT (1, W) = ∞ and VT (3, W) = 0, where W is a Brownian motion.

Exercise 7.4. Define

Xt = µt + Wt , τm := inf{t ≥ 0 : Xt = m},

where W = (Wt )t ≥0 is a Brownian motion. Let F = (Ft )t ≥0 be a filtration for W. Show that Z is a
martingale with respect to F where
 
Zt = exp σXt – (σµ + σ 2 /2)t .

Assume µ > 0 and m ≥ 0. Assume further that τm < ∞ with probability one and the stopped process
Zt ∧τm is a martingale. Find the Laplace transform Ee–ατm .
Chapter 8

Stochastic calculus

The notes from this chapter are taken primarily from (Shreve, 2004, Chapters 4 and 5).

8.1 Itô Integrals


If a function g : [0, T] → R satisfies g ∈ C1 ([0, T]), then we can define
Z T Z T
f (t )dg(t ) := f (t )g 0 (t )dt .
0 0
where the right-hand side is a Riemman integral with respect to t . However, as Brownian motion is not
differentiable, this method of integration will not work when we consider integrals of the form 0T ∆t dWt .
R

Our goal is to make sense of 0T ∆t dWt .


R

Assumption 8.1.1. In what follows W = (Wt )t ≥0 will always represent a Brownian motion and
F = (Ft )t ≥0 will always be a filtration for this Brownian motion. We shall assume the integrand
∆ = (∆t )t ≥0 is adapted to F, meaning ∆t ∈ Ft for all t .

Note that the process ∆ can and, in many cases, will be random. However, the information available in
Ft will always be sufficient to determine the value of ∆t at time t . Also note, since (WT – Wt ) ⊥
⊥ Ft
for T > t , it follows that (WT – Wt ) ⊥⊥ ∆t . In other words, future increments of Brownian motion are
independent of the ∆ process.

Itô integrals for simple integrands


To begin let us assume that ∆ is a simple process, meaning ∆ is of the form
n–1
∆tj ∈ Ftj .
X
∆t = ∆tj 1{tj ≤t <tj +1 } , 0 = t0 < t1 < . . . tn = T,
j =0

115
116 CHAPTER 8. STOCHASTIC CALCULUS

Since the process ∆ is constant over intervals of the form [tj , tj +1 ), it makes sense to define
Z T n–1
X
IT = ∆t dWt := ∆tj (Wtj +1 – Wtj ). (for ∆ a simple process) (8.1)
0 j =0

Let us establish some properties of the process I = (It )t ≥0 .

Theorem 8.1.2. The process I = (It )t ≥0 defined in (8.1) is a martingale with respect to the filtration
(Ft )t ≥0 .

Proof. Without loss of generality assume T = tn and t = ti for some 0 ≤ i ≤ n – 1 (we can always
re-define our time grid so that this is true). Then we have
n–1
X
E[IT |Ft ] = E[∆tj (Wtj +1 – Wtj )|Fti ]
j =0
iX–1 n–1
X
= ∆tj (Wtj +1 – Wtj ) + E[∆tj (Wtj +1 – Wtj )|Fti ]
j =0 j =i
iX–1 n–1
X
= ∆tj (Wtj +1 – Wtj ) + E[∆tj E[Wtj +1 – Wtj |Ftj ]|Fti ]
j =0 j =i
iX–1
= ∆tj (Wtj +1 – Wtj ) = It .
j =0
Thus, the process I is a martingale, as claimed.

Note that I0 = 0 and therefore EIt = 0 for all t ≥ 0.

Theorem 8.1.3 (Itô Isometry). The process I = (It )t ≥0 defined in (8.1) satisfies
Z T
VIT = EI2T =E ∆2t dt . (8.2)
0
Proof. We have
n–1
X n–1
EI2T =
X
E∆ti ∆tj (Wti +1 – Wti )(Wtj +1 – Wtj )
j =0 i =0
n–1 X jX
n–1 –1
E∆2tj (Wtj +1 )2
X
= – W tj +2 E∆ti ∆tj (Wti +1 – Wti )(Wtj +1 – Wtj )
j –0 j =0 i =0
n–1 X jX
n–1 –1
E∆2tj E[(Wtj +1 – Wtj )2 |Ftj ] + 2
X
= E∆ti ∆tj (Wti +1 – Wti )E[(Wtj +1 – Wtj )|Ftj ]
j –0 j =0 i =0
n–1 Z T
E∆2tj (tj +1 – tj ) = E ∆2t dt .
X
=
j –0 0
8.1. ITÔ INTEGRALS 117

Theorem 8.1.4. The process I = (It )t ≥0 defined in (8.1) satisfies


Z T
[I, I]T = ∆2t dt . (8.3)
0

Proof. Fix a partition Π of [tj , tj +1 ]

tj = s0 < s1 < . . . < sm = tj +1 .

From the definition of Quadratic variation, we compute


m–1
X  2 m–1
X  2
[I, I]tj +1 – [I, I]tj = lim Isi +1 – Isi = ∆2tj lim Wsi +1 – Wsi
kΠk→0 i =0 kΠk→0 i =0
   
= ∆2tj [W, W]tj +1 – [W, W]tj = ∆2tj tj +1 – tj .

Thus, we obtain
n–1
X  n–1   Z T
∆2tj ∆2t dt ,
X
[I, I]T = [I, I]tj +1 – [I, I]tj = tj +1 – tj =
j =0 j =0 0

as claimed.

When we computed VWT and [W, W]T we found that these two quantities were equal, even though the
computations for these quantities were completely different. From (8.2) and (8.3) we now see how the
variance and quadratic variation of a stochastic process can be different. Note that VIT is a non-random
constant, whereas [I, I]T is random.

Itô integrals for general integrands


Clearly, it is rather restrictive to limit ourselves to integrands ∆ that are simple processes. We now
allow the process ∆ to any process that is adapted to F = (Ft )t ≥0 and which satisfies the following
integrability condition
Z T
E ∆2t dt < ∞. (8.4)
0

To construct an Itô integral with ∆ as the integrand, we first approximate ∆ by a simple process
n–1
(n) X
∆t ≈ ∆t := ∆tj 1{tj ≤t <tj +1 } , 0 ≤ t0 < t1 < . . . < tn = T.
j =0

As n → ∞ the process ∆(n) converges to ∆ in the sense that

(n) 2
Z T 
lim E ∆t – ∆t dt = 0. (8.5)
n→∞ 0
118 CHAPTER 8. STOCHASTIC CALCULUS

We now define the Itô integral for a general integrand ∆ by


Z T Z T
(n)
IT ≡ ∆t dWt := lim ∆t dWt . (8.6)
0 n→∞ 0
(n)
Note that the integrals 0T ∆t dt are well-defined for every n, since ∆(n) is a simple process. Furthermore,
R

the condition (8.5) ensures that the limit exists in L2 (Ω, F, P). The Itô integral for general integrands
inherits the properties we established for simple integrands.

Theorem 8.1.5. Let W be a Brownian motion and let F = (Ft )t ≥0 be filtration for this Brownian
motion. Let ∆ = (∆t )0≤t ≤T be adapted to the filtration F and satisfy (8.5). Define I = (It )0≤t ≤T ,
be given by 0t ∆s ds, where the integral is defined as in (8.6). Then the process I has the following
R

properties.

1. The sample paths of I are continuous.


2. The process I is adapted to the filtration F. That is, It ∈ Ft for all t .
3. If Γ = (Γt )0≤t ≤T satisfies the same conditions as ∆, then
Z T Z T Z T
(a∆t + bΓt ) dWt = a ∆t dWt + b Γt dWt ,
0 0 0
where a and b are constants.
4. The process I is a martingale with respect to the filtration F.
5. We have the Itô isometry EI2t = E 0T ∆2t dt .
R

6. The quadratic variation of I is given by [I, I]T = 0T ∆2t dt .


R

8.2 Itô formula


If f and g are differentiable functions then we can compute
d
f (g(t )) = f 0 (g(t )) · g 0 (t ).
dt
Thus, multiplying through by dt we obtain

df (g(t )) = f 0 (g(t )) · g 0 (t )dt = f (g(t ))dg(t ), dg(t ) = g 0 (t )dt .

This formula is correct in the sense that


Z T Z T Z T
f (g(T)) – f (g(0)) = df (g(t )) = f 0 (g(t )) · g 0 (t )dt = f 0 (g(t ))dg(t ).
0 0 0
Although paths of a Brownian motion W are not differentiable, we might suspect that
Z T Z T
f (WT ) – f (W0 ) = df (Wt ) = f 0 (Wt )dWt . (Not correct!) (8.7)
0 0
Unfortunately, as indicated above, equation (8.7) is not correct.
8.2. ITÔ FORMULA 119

Theorem 8.2.1. Let W = (Wt )t ≥0 be a Brownian motion and suppose f : R → R satisfies f ∈ C2 (R).
Then, for any T ≥ 0 we have
Z T Z T
f (WT ) – f (W0 ) = f 0 (W t )dWt + 1 00
2 0 f (Wt )dt . (8.8)
0

Proof. We shall simply sketch the proof of Theorem 8.2.1. Suppose for simplicity that f is analytic
(i.e., that f is equal to its power series expansion at every point). Let 0 = t0 < t1 < . . . < tn = T be a
partition Π of [0, T]. Then
Z T n–1
X  n–1
X 
df (Wt ) = f (WT ) – f (W0 ) = f (Wtj +1 ) – f (Wtj ) = Aj + B j + C j ,
0 j =0 j =0

where we have defined

Aj := f 0 (Wtj ) Wtj +1 – Wtj ,


 

2
Bj := 12 f 00 (Wtj ) Wtj +1 – Wtj

,
3
1 f 000 (W ) W

Cj := 3! tj tj +1 – Wtj + ....

In the limit as kΠk → 0 we have


n–1 Z T n–1 Z T n–1
f 0 (W 1 00
X X X
Aj → t )dWt , Bj → 2 f (Wt )dt , Cj → 0.
j =0 0 j =0 0 j =0

Example 8.2.2. What is 0T Wt dWt ? To answer this question, consider f (Wt ) with f (x ) = x 2 .
R

According to equation (8.8) we have


Z T Z T
W2T – W20 = 2Wt dWt + dt
0 0

where we have used f 0 (x ) = 2x and f 00 (x ) = 2. Noting that W0 = 0 and solving for 0T Wt dWt we obtain
R

Z T
Wt dWt = 12 W2T – 12 T.
0

Not surprisingly, there are stochastic processes that are not adequately described by Brownian motion
alone. However, a large class of stochastic processes can be constructed from Brownian motion.

Definition 8.2.3. Let W = (Wt )t ≥0 be a Brownian motion and let F = (Ft )t ≥0 be a filtration for this
Brownian motion. An Itô process is any process X = (Xt )t ≥0 of the form
Z t Z t
Xt = X0 + Θs ds + ∆s dWs , (8.9)
0 0
120 CHAPTER 8. STOCHASTIC CALCULUS

where Θ = (Θt )t ≥0 and ∆ = (∆t )t ≥0 are adapted to the filtration F and satisfy
Z T Z T
|Θt |dt < ∞, E ∆2t dt < ∞, ∀ T ≥ 0.
0 0
and X0 is not random.

We sometimes write an Itô process in differential form

dXt = Θt dt + ∆t dWt . (8.10)

Expression (8.10) literally means that X satisfies (8.9). Informally, the differential form can be understood
as follows: in a small interval of time δt , the process X changes according to

Xt +δt – Xt ≈ Θt δt + ∆t (Wt +δt – Wt ) . (8.11)

In fact, noting that Wt +δt – Wt ∼ N(0, δt ) and Wt +δt – Wt ⊥⊥ Ft , one can use expression (8.11) to
simulate the increment Xt +δt – Xt . This way of simulating X is called the Euler scheme and is the
workhorse of many Monte Carlo methods.

Lemma 8.2.4. The quadratic variation [X, X]T of an Itô process (8.9) is given by
Z T
[X, X]T = ∆2t dt .
0
Proof. We sketch the proof of Lemma 8.2.4. Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T].
By definition we have
n–1
X 2 n–1
X 
[X, X]T = lim Xtj +1 – Xtj = lim Aj + B j + C j
kΠk→0 j =0 kΠk→0 j =0

where we have defined


 2
Aj := Itj +1 – Itj ,
 2
Bj := Jtj +1 – Jtj ,
  
Cj := 2 Itj +1 – Itj Jtj +1 – Jtj .

with
Z t Z t
It = ∆s dWs , Jt = Θs ds.
0 0
In the limit as kΠk → 0 we obtain
n–1
X n–1
X n–1
X
Aj → [I, I]T , Bj → 0, Cj → 0.
j =0 j =0 j =0

The proof of these limits is similar to the proof of Theorem 8.1.4.


8.2. ITÔ FORMULA 121

Definition 8.2.5. Let X = (Xt )t ≥0 be an Itô process, as described in Definition 8.2.3. Let Γ = (Γt )t ≥0
be adapted to the filtration of the Brownian motion F = (Ft )t ≥0 . We define
Z T Z T Z T
Γt dXt := Γt Θt dt + Γt ∆t dWt ,
0 0 0

where we assume
Z T Z T
|Γt Θt |dt < ∞, E (Γt ∆t )2 dt < ∞, ∀ T ≥ 0.
0 0

Theorem 8.2.6 (Itô formula in one dimension). Let X = (Xt )t ≥0 be an Itô process and suppose
f : R → R satisfies f ∈ C2 (R). Then, for any T ≥ 0 we have
Z T Z T
f (XT ) – f (X0 ) = f 0 (Xt )dXt + 12 f 00 (Xt )d[X, X]t .
0 0

Proof. The proof of Theorem 8.2.6 if very similar to the proof of Theorem 8.2.1. We outline the proof
here. Suppose for simplicity that f is analytic (i.e., that f is equal to its power series expansion at every
point). Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T]. Then
Z T n–1
X  n–1
X 
df (Xt ) = f (XT ) – f (X0 ) = f (Xtj +1 ) – f (Xtj ) = Aj + B j + C j ,
0 j =0 j =0

where we have defined

Aj := f 0 (Xtj ) Xtj +1 – Xtj ,


 

2
Bj := 12 f 00 (Xtj ) Xtj +1 – Xtj

,
3
1 f 000 (X ) X

Cj := 3! tj tj +1 – Xtj + ....

In the limit as kΠk → 0 we have


n–1 Z T n–1 Z T n–1
f 0 (Xt )dXt , 1 00
X X X
Aj → Bj → 2 f (Xt )d[X, X]t , Cj → 0.
j =0 0 j =0 0 j =0

In differential form, with X given by (8.10), Itô’s formula becomes

df (Xt ) = f 0 (Xt )dXt + 21 f 00 (Xt )d[X, X]t


= f 0 (Xt ) (Θt dt + ∆t dWt ) + 21 f 00 (Xt )∆2t dt , (8.12)

where we have used d[X, X]t = ∆2t dt . Perhaps the easiest way to remember (8.12) is to use the following
two-step procedure:
122 CHAPTER 8. STOCHASTIC CALCULUS

1. Expand f (Xt + dXt ) – f (Xt ) to second order about the point Xt

df (Xt ) = f (Xt + dXt ) – f (Xt ) = f 0 (Xt )dXt + 12 f 00 (Xt )(dXt )2 , (8.13)

2. Insert the differential dXt = Θt dt + ∆t dWt into (8.13), expand (dXt )2 and use the rules

dWt dWt = dt , dWt dt = 0, dt dt = 0.

The resulting formula gives the correct expression for df (Xt ).

Example 8.2.7. Let X be an Itô process with the following dynamics

dXt = µt Xt dt + σt Xt dWt . (8.14)

Assuming µ = (µt )t ≥0 and σ = (σt )t ≥0 are bounded above and below and X0 > 0, the process X remains
strictly positive. We call X a generalized geometric Brownian motion. The “geometric” part refers to
the fact that the relative step size dXt /Xt has dyanmics µt dt + σt dWt . The “generalized” part refers to
p
the fact that the processes σ and µ are stochastic rather than constant. Define Yt = Xt . What is dYt ?
Let f (x ) = x p . Then f 0 (x ) = px p–1 and f 00 (x ) = p(p – 1)x p–2 . Thus, we have
p–1 p–2
dYt = df (Xt ) = pXt dXt + 21 p(p – 1)Xt (dXt )2
p–1 p–2
= pXt (µt Xt dt + σt Xt dWt ) + 21 p(p – 1)Xt (µt Xt dt + σt Xt dWt )2
p–1 p–2
= pXt (µt Xt dt + σt Xt dWt ) + 21 p(p – 1)Xt σt2 X2t dt
p p
 
= pµt + 21 p(p – 1)σt2 Xt dt + pσt Xt dWt
 
= pµt + 21 p(p – 1)σt2 Yt dt + pσt Yt dWt .

We see from the last line that Y = (Yt )t ≥0 is also a generalized geometric Brownian motion.

Example 8.2.8. Let X have generalized geometric Brownian motion dynamics as in (8.14). We would
like to find an explicit expression for Xt (i.e., an expression of the form Xt = . . . where . . . does not
contain X). To this end,we let Yt = log Xt . With f (x ) = log x we have f 0 (x ) = 1/x and f 00 (x ) = –1/x 2 .
Thus, we have
1 1 –1  
2 = µ – 1 σ 2 dt + σ dW .
dYt = dXt + (dXt ) t 2 t t t
Xt 2 X2t
Thus, we have
Z T Z T !

XT = exp(YT ) = exp Y0 + µt – 1 σ2 dt + σt dWt
0 2 t 0
Z T Z T !

= X0 exp µt – 1 2
0 2 σt dt +
0
σt dWt , (8.15)

where we have used Y0 = log X0 .


8.3. MULTIVARIATE STOCHASTIC CALCULUS 123

Proposition 8.2.9. Let W = (Wt )t ≥0 be a Brownian motion. Suppose g : R+ → R+ is a determin-


istic function. Then
Z T Z T
IT := g(t )dWt ∼ N(0, v (T)), v (T) = g 2 (t )dt .
0 0

Proof. Set µt = 0 and σt = ug(t ) in (8.14) where u is a constant. And suppose X0 = 1. Then we have
Z T
XT = 1 + ug(t )Xt dWt ,
0

which is a martingale since Itô integrals are martingales. From (8.15), we know that XT can be written
explicitly as
Z T Z T ! Z T !
XT = exp 1
–2 u 2 g 2 (t )dt + ug(t )dWt = exp 2
–u 21 g 2 (t )dt + uIT .
0 0 0

Since X is a martingale, we have


Z T ! Z T !
1 = EXT = E exp 2
–u 21 g 2 (t )dt + uIT ⇒ E exp (uIT ) = exp 2
u 21 2
g (t )dt .
0 0
 
Note that E exp (uIT ) = MIT (u) is the moment generating function of IT and exp u 2 12 0T g 2 (t )dt is the
R

moment generating function of a normal random variable with mean zero and variance v (T) = 0T g 2 (t )dt .
R

Thus, IT ∼ N(0, v (T)), as claimed.

8.3 Multivariate stochastic calculus


Definition 8.3.1. A d-dimensional Brownian motion is a process

W = (W1t , W2t , . . . , Wdt )t ≥0

with the the following properties.

1. Each Wi = (Wit )t ≥0 , i = 1, 2, . . . d, is a one-dimensional Brownian motion.


2. The processes (Wi )1≤i ≤d are independent.

A filtration for W is a collection of σ-algebras F = (Ft )t ≥0 such that

1. Information accumulates: Fs ⊂ Ft for all 0 ≤ s < t .


2. Adaptivity: Wt ∈ Ft for all t ≥ 0.
3. Independent increments: for 0 ≤ s < t we have Wu – Wt ⊥⊥ Ft .
124 CHAPTER 8. STOCHASTIC CALCULUS

Theorem 8.3.2. Let W = (W1t , W2t , . . . , Wdt )t ≥0 be a d-dimensional Brownian motion. The covari-
ation of independent components of W is zero: [Wi , Wj ]T = 0 for all i 6= j and T ≥ 0.

Proof. Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T]. The sampled covariation CΠ of Wi
and Wj is given by
n–1
X  j j
 
CΠ = Witk +1 – Witk Wt – Wt
k +1 k
k =0
j j
  
Since E Witk +1 – Witk Wt – Wt = 0, we clearly have ECΠ = 0. Next, we compute the variance of
k +1 k
CΠ . We have
n–1
X n–1 j j j j
    
EC2Π E Witk +1 – Witk Witl+1 – Witl
X
VCΠ = = Wt – Wt Wt – Wt
k +1 k l+1 l
k =0 l=0
n–1 2  2
j j

E Witk +1 – Witk
X
= Wt – Wt
k +1 k
k =0
n–1
X kX–1
j j j j
    
+2 E Witk +1 – Witk Wt – Wt Witl+1 – Witl Wt – Wt
k +1 k l+1 l
k =0 l=0
n–1
X
= E (tk +1 – tk ) (tk +1 – tk )
k =0
n–1
X
≤ kΠk E (tk +1 – tk ) = kΠk T.
k =0

Thus, ECΠ → 0 and VCΠ → 0 as kΠk → 0, which proves that [Wi , Wj ]T := limkΠk→0 CΠ = 0.

Theorem 8.3.2 can be used to derive the covariation of two Itô processes Xi and Xj .

Theorem 8.3.3. Let Xi = (Xit )t ≥0 i = 1, 2, . . . , n be the Itô processes given by


d
ij j
dXit = Θit dt +
X
σt dWt , i = 1, 2, . . . , n, (8.16)
j =1

where W = (W1t , W2t , . . . , Wdt )t ≥0 is a d-dimensional Brownian motion. Then


d
jk
d[Xi , Xj ]t = σtik σt dt .
X

k =1

We will not prove Theorem 8.3.3. Rather, we simply remark that it can be obtained informally by
writing
j
d[Xi , Xj ]t = dXit dXt , (8.17)
8.3. MULTIVARIATE STOCHASTIC CALCULUS 125

inserting expression (8.16) into (8.17) and using the multiplication rules

j 1, i = j ,

j
dWit dWt = δij dt , δij =  dWt dt = 0, dt dt = 0. (8.18)
6 j,
0, i =

Note that d[Xi , Xj ]t = 0 unless Xj and Xj are driven by at least one common one-dimensional Brownian
motion.

We can now give a n-dimensional version of Itô’s Lemma. We present the formula in differential form, as
it is written more compactly in this way.

Theorem 8.3.4 (Itô formula in two dimensions). Let X = (X1t , X2t , . . . , Xn


t )t ≥0 be an n-dimensional
Itô process and suppose f : Rn → R satisfies f ∈ C2 (Rn ). Then, for any T ≥ 0 we have
n ∂f (Xt ) i 1 X n X n ∂ 2 f (X )
t
d[Xi , Xj ]t .
X
df (Xt ) = dXt +
i =1
∂x i 2 i =1 j =1
∂x ∂x
i j

The prove of Theorem 8.3.4 is a straightforward extension of Theorem 8.2.6 to the n-dimensional case
and will not be presented here.

To obtain an explicit expression for df (Xt ) in terms of dW1t , dW2t , . . . , dWdt and dt we can repeat the
same informal procedure we used in the one-dimensional case.

1. Expand df (Xt ) = f (Xt + dXt ) – f (Xt ) about the point Xt to second order
n ∂f (Xt ) i 1 X n X n ∂ 2 f (X )
t j
dXit dXt .
X
df (Xt ) = dXt + (8.19)
i =1
∂x i 2 i =1 j =1
∂x ∂x
i j

2. Insert expression for dXit into (8.19) and use the multiplication rules given in (8.18).

Example 8.3.5 (Product rule). To compute d(Xt Yt ) wherer X and Y and one-dimensional Itô
processes, we define f (x , y) = xy and use fx = y, fy = x , fxy = 1 and fxx = fyy = 0 to compute

d(Xt Yt ) = Yt dXt + Xt dYt + d[X, Y]t .

Example 8.3.6 (OU process). An Ornstein-Uhlenbeck process (OU process, for short) is an Itô
process X = (Xt )t ≥0 that satisfies

dXt = κ(θ – Xt )dt + σdWt , (8.20)

where W = (Wt )t ≥0 is a one-dimensional Brownian motion and κ, θ > 0. The OU process is mean-
reverting in the following sense. If Xt > θ then κ(θ – Xt ) < 0 and the deterministic part of (8.20) (i.e.,
126 CHAPTER 8. STOCHASTIC CALCULUS

the dt -term) pushes the process down towards θ. If Xt < θ then κ(θ – Xt ) > 0 and the deterministic part
of (8.20) pushes the process up towards θ. The the OU process mean-reverts to the long-run mean θ.
We often call κ the rate of mean reversion, though, this nomenclature is somewhat misleading since the
instantaneous rate of mean reversion is actually κ(θ – Xt ).

We will find an explicit expression for Xt and also compute EXt and VXt . To this end, let us define
Yt = Xt – θ so that

dYt = –κYt dt + σdWt

Note that Y is an OU process that mean-reverts to zero. Next, we define Zt = f (t , Yt ) = eκt Yt . We can
use the two-dimensional Itô formula to compute dZt . Using fyy = 0 and the heuristic rules dt dWt = 0
and dt dt = 0 we have

dZt = ft dt + fy dYt + 21 fyy d[Y, Y]t


= κeκt Yt dt + eκt dYt
= κeκt Yt dt + eκt (–κYt dt + σdWt )
= eκt σdWt .

Thus, we have obtained an expression for Zt :


Z t
Zt = Z0 + eκs σdWs .
0

Next, we use Yt = e–κt Zt and Xt = Yt + θ to obtain


Z t
Yt = e–κt Y 0 + e–κ(t –s) σdWs ,
0
Z t
Xt = θ + e–κt (X0 – θ) + e–κ(t –s) σdWs .
0
Note that Xt has a normal distribution at every time t > 0 because Itô integrals with deterministic
integrands are normally distributed random variables; see Proposition 8.2.9. Thus, the distribution of Xt
is completely determined by its mean and variance. We have
Z t
EXt = θ + e–κt (X0 – θ) + E e–κ(t –s) σdWs
0
= θ + e–κt (X0 – θ),
Z t
VXt = V e–κ(t –s) σdWs
0
Z t
–κ(t –s) 2
 σ2  
= e σ ds = 1 – e–2κt .
0 2κ
where we have used Proposition 8.2.9.
8.3. MULTIVARIATE STOCHASTIC CALCULUS 127

In some cases, it will be useful to see if a process is in fact a Brownian motion.

Theorem 8.3.7 (Lévy characterization of Brownian motion). Let M = (Mt )t ≥0 be a martingale


with respect to a filtration F = (Ft )t ≥0 . Suppose M has continuous sample paths and satisfies
M0 = 0 and [M, M]t = t for all t ≥ 0. Then M is a Brownian motion.

Proof. Let f : R+ × R → R be C1,2 . Note that

2 f (t , M )d[M, M]
df (t , Mt ) = ∂t f (t , Mt )dt + ∂m f (t , Mt )dMt + 21 ∂m t t
 
= ∂t + 12 ∂m 2 f (t , M )dt + ∂ f (t , M )dM .
t m t t

In integral form, we have


Z T  Z T
f (T, MT ) = f (t , Mt ) + 2 f (s, M )ds +
∂s + 21 ∂m ∂m f (s, Ms )dMs .
s
t t

Although we have not proved it, since M is a martingale (by assumption), it follows that any integral of
the form It := 0t ∆s dMs , where ∆ = (∆t )t ≥0 is adapted to (Ft )t ≥0 , is a martingale. Thus, using the
R

short-hand Et · := E[ · |Ft ], we have


Z T  Z t
Et f (T, MT ) = f (t , Mt ) + Et 2 f (s, M )ds + E
∂s + 21 ∂m ∂m f (s, Ms )dMs
s
t 0
Z T 
= f (t , Mt ) + Et 2 f (s, M )ds.
∂s + 12 ∂m s
t

Now, let a fix a constant u and define


1 2
f (t , m) = eum– 2 u t .

It is easy to verify that


 
2 f = 0.
∂t + 12 ∂m

It follows that
1 2 1 2 1 2
Et euMT – 2 u T = f (t , Mt ) = euMt – 2 u t ⇒ Et eu(MT –Mt ) = e 2 u (T–t ) . (8.21)

1 2
By definition, Et eu(MT –Mt ) is the Ft -conditional moment generating function of (MT –Mt ). And e 2 u (T–t )
is the moment generating function of a N(0, T – t ) random variable. It follows that MT – Mt ∼ N(0, T – t )
for all 0 ≤ t ≤ T < ∞, just like a Brownian motion. Furthermore, we see that MT – Mt ⊥⊥ Ft , as the
right-hand-side of (8.21) does not depend on Ft . As M satisfies all properties of a Brownian motion, it
must be a Brownian motion.
128 CHAPTER 8. STOCHASTIC CALCULUS

Example 8.3.8 (Correlated Brownian motions). Let B = (Bt )t ≥0 be given by


q
Bt = ρW1t + ρ̄W2t , ρ̄ = 1 – ρ2 , ρ = [–1, 1],

where W = (W1t , W2t ) is a two-dimensional Brownian motion. We will use Theorem 8.3.7 to show that B
is a Brownian motion. It is clear that B0 = 0 and B has sample paths that are continuous. It is also
clear that B is a martingale since W1 and W2 are martingales. What remains is to show that [B, B]t = t .
We have

[B, B]t = [ρW1 + ρ̄W2 , ρW1 + ρ̄W2 ]t


= ρ2 [W1 , W1 ]t + 2ρρ̄[W1 , W2 ] + ρ̄2 [W2 , W2 ]t
= ρ2 t + 0 + ρ̄2 t = t .
j
Alternatively, one can simply compute (dBt )2 = (ρdW1t + ρ̄2 dW2t )2 and use dWit dWt = δij dt in order to
show that dBt = dt . Note that

[B, W1 ]t = [ρW1 + ρ̄W2 , W1 ]t


= ρ[W1 , W1 ] + ρ̄[W2 , W1 ] = ρt .

where we have used [Wi , Wj ]t = δij t .

8.4 Brownian bridge


Definition 8.4.1. A Gaussian process X = (Xt )t ≥0 is any process that, for arbitrary times t1 < t2 <
. . . tn , the joint distribution of (Xt1 , Xt2 , . . . , Xtn ) is jointly normal.

As previously mentioned, the distribution of a normally distributed random vector is uniquely determined
by its mean vector m and covariance matrix. Thus, for a Gaussian process, we are interested in

m(t ) = EXt , c(s, t ) = E (Xt – m(t )) (Xs – m(s)) .

Example 8.4.2. A Brownian motion W is a Gaussian process. To see this, fix an arbitrary sequence of
times 0 = t0 < t1 < t2 < . . . tn . Note that
kX
–1
Wtk = (Wtj +1 – Wtj ).
j =0

The increments are independent and normally distributed Wtj +1 – Wtj ∼ N(0, tj +1 – tj ). It fol-
lows that the vector (Wt1 , Wt2 , . . . , Wtn ) is jointly normal. The mean vector is given by m =
(m(t1 ), m(t2 ), . . . , m(tn )) = (0, 0, . . . , 0) and the covariance matrix C = (c(ti , tj ))1≤i ,j ≤n has entries
c(ti , tj ) = ti ∧ tj ; see equation (7.3).
8.4. BROWNIAN BRIDGE 129

Example 8.4.3. A stochastic integral with a deterministic integrand


Z t
It = g(s)dWs ,
0

is a Gaussian process. To see this, fix an arbitrary sequence of times 0 = t0 < t1 < t2 < . . . tn . Note that
kX
–1 Z tj +1
Itk = g(s)dWs .
j =0 tj

The increments are independent and normally distributed with mean zero and variance:
Z t Z t
j +1 j +1
V g(s)dWs = g 2 (s)ds
tj tj

It follows that the vector (It1 , It2 , . . . , Itn ) is jointly normal. The mean vector is given by m =
(m(t1 ), m(t2 ), . . . , m(tn )) = (0, 0, . . . , 0). To compute the covariance matrix C = (c(ti , tj ))1≤i ,j ≤n ,
assume without loss of generality that ti < tj . Then we have
Z t Z t
i j
c(ti , tj ) = E g(s)dWs g(u)dWu
0 0
Z t 2 Z t Z t 
i i j
g(u)dWu Ftj

=E g(s)dWs +E g(s)dWs E
0 0 ti
Z t Z t ∧t
i i j
= g 2 (s)ds = g 2 (s)ds,
0 0

where we have used Proposition 8.2.9.

Definition 8.4.4 (Brownian bridge, version I). / Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Xa→b = (Xa→b
t )0≤t ≤T , a Brownian bridge from a to b on [0, T], by

t t
Xa→b
t =a+ (b – a) + Wt – WT , 0 ≤ t ≤ T. (8.22)
T T

Clearly, we have Xa→b


0
a→b = b.
= a and XT

Theorem 8.4.5. The Brownian Bridge from a to b, defined in (8.22), is a Guassian process. The
mean vector and covariance matrix have entries given by
t
m a→b (t ) = a + (b – a),
T
st
c a→b (t , s) = t ∧ s – .
T
Proof. Since the sum of two (possibly correlated) normal random variables is again normal, and since
for every t ∈ [0, T], the value of Xa→b
t is a linear combination of Wt and WT , both of which are normal,
130 CHAPTER 8. STOCHASTIC CALCULUS

it follows that (Xa→b a→b a→b


t1 , Xt2 , . . . , Xtn ) is jointly normal. To compute the mean and variance, we note
that
t t t
m a→b (t ) = EXta→b = a + (b – a) + EWt – EWT = a + (b – a),

T 
T 
T
c a→b (t , s) = E Xta→b –m a→b (t ) Xs a→b –m a→b (s)
t s
  
= E Wt – WT Ws – WT
T T
t s ts
= EWt Ws – EWs WT – EWt WT + 2 EW2T
T T T
st st st
=t ∧s – – +
T T T
st
=t ∧s – .
T

Suppose F = (Ft )t ≥0 is a filtration for a Brownian motion W. Then Xa→b is clearly not adapted to F.
Indeed, since Xa→b
t is expressed in terms of WT , we require the information in FT in order to write the
value of Xta→b . There is an alterantive definition of a Brownian bridge, which is adapted to F.

Definition 8.4.6 (Brownian bridge, version II). Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Ya→b = (Yta→b )0≤t ≤T , a Brownian bridge from a to b on [0, T], by
Z t
t 1
Ya→b
t =a+ (b – a) + (T – t ) dWs , 0 ≤ t ≤ T. (8.23)
T 0 T–s

Theorem 8.4.7. The process Ya→b = (Ya→b


t )0≤t ≤T given in Definition 8.4.6 is a Gaussian process.
The mean vector and covariance matrix have entries given by
t
m a→b (t ) = a + (b – a),
T
st
c a→b (t , s) = t ∧ s – ,
T
just as Xa→b = (Xa→b
t )0≤t ≤T .

Proof. The stochastic integral It := 0t T–s


R 1 dW in (8.23) has a deterministic integrand. Therefore, I
s t
is Guassian, as is (T – t )It . It follows that the process Y is a Gaussian process. Next, we compute
Z t
t 1 t
m a→b (t ) = EYa→b
t = a + (b – a) + (T – t )E dWs = a + (b – a).
T 0 T–s T
Finally, to compute c(t , s), assume without loss of generality that s ≤ t . Then we have
  
c a→b (t , s) = E Ya→b
t – m(t ) Ya→b
s – m(s)
8.4. BROWNIAN BRIDGE 131
Z t Z s
1 1
= (T – t )(T – s)E dWu · dWr
0 T–u 0 T–r
Z s 2 Z s Z t
1 1 1

dWr Fs

= (T – t )(T – s)E dWu + (T – t )(T – s)E dWu E
0 T–u 0 T–u s T–r
Z s 2
1
= (T – t )(T – s) du
0 T–u
1 1 st st
 
= (T – t )(T – s) – =s – .=s ∧t – .
T–s T T T
This completes the proof.

It is interesting to compute the differential of Ya→b


t . For simplicity, consider Y0→0 . We have
Z t Z t
1 1

dY0→0
t = dWs d(T – t ) + (T – t )d dWs
0 T–s 0 T–s
1 1
 
=– Y0→0
t dt + (T – t ) dWt
T–t T–t
Y0→0
=– t dt + dWt .
T–t
|Y0→0 |
Note that Y0→0 mean reverts to zero with an instantaneous rate t
T–t that blows up as t → T. This
blow up guarantees that Y0→0
T = 0.

Definition 8.4.8 (Brownian bridge, version III). Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Za→b = (Zta→b )0≤t ≤T , a Brownian bridge from a to b on [0, T], by

Za→b
t = a + Wt |WT = b – a, 0 ≤ t ≤ T.

Theorem 8.4.9. The process Za→b = (Za→b


t )0≤t ≤T given in Definition 8.4.8 is a Gaussian process.
The mean vector and covariance matrix have entries given by
t
m a→b (t ) = a + (b – a),
T
st
c a→b (t , s) = t ∧ s – ,
T

just as Xa→b = (Xa→b


t )0≤t ≤T .

Proof. We will compute the density of Za→b


t directly. For simplicity, let a = 0. We have

P(Wt ∈ dx , WT ∈ db)
P(Z0→b
t ∈ dx ) = P(Wt ∈ dx |WT = b) =
P(WT ∈ db)
P(Wt ∈ dx )P(WT – Wt + x ∈ db)
= .
P(WT ∈ db)
132 CHAPTER 8. STOCHASTIC CALCULUS

Letting fµ,σ2 denote the density of a N(µ, σ 2 ) random variable we have

f0,t (x )dx fx ,T–t (b)db


P(Z0→b
t ∈ dx ) = = fm 0→b (t ),c 0→b (t ,T) (x )dx
f0,T (b)db

where m a→b (t ) and c a→b (t , s) are as given in Theorem 8.4.9. We can clearly see then, that Z0→b
t
is normally distributed at every t with mean m 0→b (t ). We leave computation of the entries of the
covariance matrix c 0→b (t , s) as an exercise for the reader.

8.5 Girsanov’s Theorem for a single Brownian motion


We briefly recall some results from Section 1.7. Suppose that, on a probability space (Ω, F, P) we have a
random variable Z ≥ 0 that has expectation EZ = 1. Then we can define a new probability measure P
e

via

P(A)
e = EZ1A , A ∈ F,

and we call Z = ddP


P the Radon-Nikodym derivative of P
e with respect to P. If Z is strictly positive Z > 0,
e

then we also have

e 11 ,
P(A) = E A ∈ F,
Z A

and we call Z1 = dP
e the Radon-Nykodym derivative of P with respect to P.
dP
In Example 1.7.6, on a probability space (Ω, F, P), we defined X ∼ N(0, 1) and a Radon-Nikodym
1 2
derivative Z = e–θX– 2 θ . We showed that Y := X + θ was N(θ, 1) under P and N(0, 1) under P.
e Thus, Z

had the effect of changing the mean of Y.

We would like to extend this idea from a static to a dynamics setting. Specifically, we would like to find
a measure change that modifies the dynamics of a stochastic process X = (Xt )t ≥0 .

Definition 8.5.1. Let (Ω, F, P) be a probability space and let F = (Ft )0≤t ≤T be a filtration on this
space. A Radon-Nykodým derivative process (Zt )0≤t ≤T is any process of the form

Zt := E[Z|Ft ]

where Z is a random variable satisfying EZ = 1 and Z > 0.

Note that Z in Definition 8.5.1 satisfies the conditions of a Radon-Nikodým derivative. As such, one can
define a measure change ddP
P from Z.
e
8.5. GIRSANOV’S THEOREM FOR A SINGLE BROWNIAN MOTION 133

Lemma 8.5.2. A Radon-Nikodým derivative process Z = (Zt )0≤t ≥T is a martingale.

Proof. For 0 ≤ s ≤ t ≤ T we have

E[Zt |Fs ] = E[E[Z|Ft ]|Fs ] = E[Z|Fs ] = Zs .

Lemma 8.5.3. Let (Zt )0≤t ≤T be a Radon-Nikodým derivative process and define ddP
P = Z. Suppose
e

Y ∈ Fs , where s ∈ [0, T]. Then

EY
e = EZ Y.
s

Proof. The proof is a simple exercise in iterated conditioning. We have

EY
e = EZY = EYE[Z|F ] = EYZ .
s s

Lemma 8.5.4. Let (Zt )0≤t ≤T be a Radon-Nikodým derivative process and define ddP
P = Z. Suppose
e

Y ∈ Ft where 0 ≤ s ≤ t ≤ T. Then
1
E[Y|F
e
s] = E[Zt Y|Fs ].
Zs
Proof. From Definition 2.3.1, we recall that a conditional expectation E[Y|F
e
s ] must satisfy two
properties:
(i) E[Y|Fs ] ∈ Fs .
e

(ii) E1 A E[Y|Fs ] = E1A Y for all A ∈ Fs .


e e e

It is clear that property (i) is satisfied since


1 1
E[Y|F
e
s] = E[Zt Y|Fs ] = E[Zt Y|Fs ] ∈ Fs .
Zs E[Z|Fs ]
We must check property (ii). Let A ∈ Fs . Then
1
E1
e
A E[Y|Fs ] = E1A
e e E[Zt Y|Fs ]
Zs
1 1
= EZs 1A E[Zt Y|Fs ] because 1A E[Zt Y|Fs ] ∈ Fs and Lemma 8.5.3
Zs Zs
= E1A E[Zt Y|Fs ]
= EE[1A Zt Y|Fs ]
= E1A Zt Y
= E1
e
A Y. because 1A Y ∈ Ft and Lemma 8.5.3
134 CHAPTER 8. STOCHASTIC CALCULUS

Theorem 8.5.5 (Girsanov). Let W = (Wt )0≤t ≤T be a Brownian motion on a probability space
(Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration for W. Suppose Θ = (Θt )0≤t ≤T is adapted to the
filtration F. Define (Zt )0≤t ≤T and W
f = (W
f )
t 0≤t ≤T by
 Z t Z t 
Zt = exp – 1 Θ2 ds – Θs dWs , f = Θ dt + dW ,
dW W
f = 0.
s t t t 0
0 2 0

Assume that
Z T
E Θ2t Z2t dt < ∞.
0

Define a Radon-Nikodým derivative Z = ddP


P = ZT . Then the process W
f is a Brownian motion
e

under P.
e

Proof. We will use Lévy’s Theorem 8.3.7 to show that W


f is a Brownian motion under P.
e By definition,

W
f = 0. Also, we see that
0

d[W,
f W] f )2 = (dW + Θ dt )2 = (dW )2 + 2Θ dW dt + Θ2 (dt )2 = dt .
f = (dW
t t t t t t t t

Lastly, we must verify that W


f is a martingale under P.
e To do this, we first show that (Z )
t t ≥0 is a
martingale under P. Setting
Z t Z t
Zt = eXt , Xt = – 1 Θ2 ds
s – Θs dWs ,
0 2 0

we find (using the Itô formula) that

dZt = eXt dXt + 21 eXt d[X, X]t


= Zt (–Θ2t dt – Θt dWt ) + Zt Θ2t dt = –Zt Θt dWt .

Since Itô integrals are martingales, it follows that (Zt )0≤t ≤T is a martingale under P. In particular we
have EZ = EZT = Z0 = 1. We also have

Zt = E[ZT |Ft ] = E[Z|Ft ],

for all 0 ≤ t ≤ T, which shows that (Zt )0≤t ≤T is a Radon-Nikodým derivative process. Next, we show
that (Wf Z )
t t 0≤t ≤T is a martingale under P. We have

d(W
f Z )=W
t t
f dZ + Z dW
t t t
f + d[W,
t
f Z]
t
f (–Z Θ dW ) + Z (Θ d + dW
=W f ) + (Θ dt + dW )(–Z Θ dW )
t t t t t t t t t t t t t
f (–Z Θ dW ) + Z (Θ d + dW
=W f ) – Z Θ dt
t t t t t t t t t t
8.6. GIRSANOV’S THEOREM FOR D-DIMENSIONAL BROWNIAN MOTION 135

f Θ + 1)Z dW .
= (–W t t t t

Again, since Itô integrals are martingales, it follows that (W


f Z )
t t 0≤t ≤T is a martingale under P. Finally,
we can show that W f is a martingale under P. e Assuming 0 ≤ s ≤ t ≤ T we have

1
E[
e Wf |F ] =
t s E[Zt W
f |F ]
t s (by Lemma 8.5.4)
Zs
1
= Zs W
f =W
s
f .
s
Zs
Thus, W
f is a martingale, and therefore, a Brownian motion under P.
e

Theorem 8.5.6 (Martingale representation). Let W = (Wt )0≤t ≤T be a Brownian motion


on a probability space (Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration generated by W (that is,
Ft = σ(Ws , 0 ≤ s ≤ t )). Let M = (Mt )0≤t ≤T be a martingale with respect to the filtration F. Then
there exists a process Γ = (Γt )0≤t ≤T that is adapted to the filtration F such that
Z t
Mt = M0 + Γs dWs , t ∈ [0, T].
0
The proof the Theorem 8.5.6 is rather involved, and will not be given here. The theorem plays a
fundamental role in mathematical finance, where a martingale M represents the value of an asset and Γ
represents a trading strategy that replicates the asset’s value. For more precise details, take a course in
mathematical finance!

8.6 Girsanov’s Theorem for d-dimensional Brownian motion


We conclude this chapter by stating (wihtout proof) Girsanov’s Theorem and the martingale representation
theorem for multi-dimensional Brownian motions.

Theorem 8.6.1 (Girsanov). Let W = (W1t , W2t , . . . , Wdt )0≤t ≤T be a d-dimensional Brownian
motion on a probability space (Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration for W. Sup-
pose Θ = (Θ1t , Θ2t , . . . , Θdt )0≤t ≤T is adapted to the filtration F. Define (Zt )0≤t ≤T and W
f =
1 2 d
(W
f ,W
t
f ,...,W
t
f )
t 0≤t ≤T by
 Z t Z t 
Zt = exp – 1 hΘ , Θ ids – hΘs , dWs i , f = Θ dt + dW ,
dW W
f = 0,
s s t t t 0
0 2 0
where h·, ·i denotes a d-dimensional Euclidean inner product. Assume that
Z T
E hΘt , Θt iZ2t dt < ∞.
0

Define a Radon-Nikodým derivative Z = ddP


P = ZT . Then the process W
f is a d-dimensional
e

Brownian motion under P.


e
136 CHAPTER 8. STOCHASTIC CALCULUS

It is interesting to note that the components of W


f in Theorem 8.6.1 could be co-dependent under P

(as Θj could depend on any of W1 , W2 , . . . , Wd ). Nevertheless, under P,


e the components of W
f are

independent of each other.

Theorem 8.6.2 (Martingale representation). Let W = (W1t , W2t , . . . , Wdt )0≤t ≤T be a d-dimensional
Brownian motion on a probability space (Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration generated
by W (that is, Ft = σ(Ws , 0 ≤ s ≤ t )). Let M = (Mt )0≤t ≤T be a martingale with respect to the
filtration F. Then there exists a process Γ = (Γ1t , Γ2t , . . . , Γdt )0≤t ≤T that is adapted to the filtration
F such that
Z t
Mt = M0 + hΓs , dWs i, t ∈ [0, T].
0

8.7 Exercises
Exercise 8.1. Compute d(W4t ). Write W4T as an integral with respect to W plus an integral with respect
to t . Use this representation of W4T to show that EW4T = 3T2 . Compute EW6T using the same technique.

Exercise 8.2. Find an explicit expression for YT where

dYt = r dt + αYt dWt .

Hint: compute d(Yt Zt ) where Zt := exp(–αWt + 21 α2 t ).

Exercise 8.3. Suppose X, ∆ and Π are given by


∂f
dXt = σXt dWt , ∆t = (t , Xt ), Πt = Xt ∆t
∂x
where f is some smooth function. Show that if f satisfies
∂2
!
∂ 1
+ σ 2 x 2 2 f (t , x ) = 0,
∂t 2 ∂x
for all (t , x ), then Π is a martingale with respect to a filtration Ft for W.

Exercise 8.4. Suppose X is given by

dXt = µ(t , Xt )dt + σ(t , Xt )dWt .

For any smooth function f define


Z t
∂ ∂ 1 ∂2

f
Mt := f (t , Xt ) – f (0, X0 ) – + µ(s, Xs ) + σ 2 (s, Xs ) 2 f (s, Xs )ds.
0 ∂s ∂x 2 ∂x
Show that Mf is a martingale with respect to a filtration Ft for W.
8.7. EXERCISES 137

Exercise 8.5. Let X = (Xt )0≤t ≤T be an OU process on a probability space (Ω, F, P)

dXt = K(θ – Xt )dt + σdWt .

Where W = (Wt )0≤tleqT is a Brownian motion under probablity measure P. Then we can define a new
probability measure P̃ such that the process W̃ = (W̃t )0≤t ≤T is a Brownian motion under P̃. Then the
OU process X = (Xt )0≤t ≤T on the new probablity space (Ω, F, P̃) will be

dXt = K(θ∗ – Xt )dt + σdW̃t .

Find the Radon-Nikodym derivative ddP


Exercise 8.6. Let X be a Brownian bridge from zero to zero on the interval t = 0 to t = 1
(1) Prove that the process X1–t , 0 ≤ t ≤ 1, is also a Brownian bridge.
(2) Prove that if W is a Brownian motion, then the processes (1 – t )W1/(1–t ) and t W(1/t )–1 , 0 ≤ t ≤ 1,
are both Brownian bridges.
(3) If we write t̄ for t modulo 1, prove that the process Yt = Xt +s–Xs , 0 ≤ t ≤ 1 is a Brownian bridge
for every fixed s ∈ (0, 1). This is called the cyclic invariance of Brownian bridge.
138 CHAPTER 8. STOCHASTIC CALCULUS
Chapter 9

SDEs and PDEs

The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 6), (Øksendal, 2005,
Chapters 5, 7, 8 and 9) and Linetsky (2007). Another good reference is (Karlin and Taylor, 1981, Chapter
15).

9.1 Stochastic differential equations


We will begin this chapter by focusing on scalar stochastic differential equations driven by a single
Brownian motion. The extension to d-dimensional stochastic differential equations driven by m Brownian
motions will be given in Section 9.6.

Definition 9.1.1. A stochastic differential equation (SDE) is an equation of the form

dXs = µ(s, Xs )ds + σ(s, Xs )dWs , Xt = x , (9.1)

We call functions µ and σ the drift and diffusion, respectively, and we call Xt = x the initial condition.
A (strong) solution of an SDE is a stochastic process X = (Xs )s≥t such that
Z T Z T
XT = x + µ(s, Xs )ds + σ(s, Xs )dWs , (9.2)
t t

for all T ≥ t .

One way to envision a strong solution of an SDE is as follows: think of a sample path W· (ω) : [t , ∞) → R
as input. From this input, we can construct a unique sample path X· (ω) : [t , ∞) → R.

Ideally, we would like to write XT as an explicit functional of the Brownian path (Ws )s≥t . Unfortunately,
this is typically not possible. Still, it will help to build intuition if we see some explicitly solvable examples.

139
140 CHAPTER 9. SDES AND PDES

Example 9.1.2 (Geometric Brownian motion). A geometric Brownian motion is a process


Z = (Z)t ≥0 that satisfies

dZt = µ(t )Zt dt + σ(t )Zt dWt , Z0 = z ,

where µ and σ are deterministic functions of t . To solve this SDE, we consider Xt = log Zt . Using the
Itô formula, we obtain
!
1 1 –1
dXt = d log Zt = dZt + d[Z, Z]t
Zt 2 Z2t
 
= µ(t ) – 21 σ 2 (t ) dt + σ(t )dWt .

Integrating from 0 to T, we obtain


Z T  Z T
XT = x + µ(t ) – 12 σ 2 (t ) dt + σ(t )dWt , x = log z .
0 0

Finally, we obtain our expression for ZT .


Z T  Z T 
ZT = exp (XT ) = z exp µ(t ) – 12 σ 2 (t ) dt + σ(t )dWt .
0 0

Example 9.1.3 (Linear SDE). Consider the following SDE with linear coefficients

dXt = (b(t ) + µ(t )Xt ) dt + (a(t ) + σ(t )Xt ) dWt , X0 = x . (9.3)

Define two processes Z = (Zt )t ≥0 and Y = (Yt )t ≥0 by

dZt = µ(t )Zt dt + σ(t )Zt dWt , Z0 = 1,


b(t ) – σ(t )a(t ) a(t )
dYt = dt + dWt , Y0 = x .
Zt Zt
We will show that Xt = Yt Zt solves (9.3). Using the product rule, we have

dXt = d(Yt Zt )
= Yt dZt + Zt dYt + d[Y, Z]t
   
= µ(t )Yt Zt dt + σ(t )Yt Zt dWt + (b(t ) – σ(t )a(t )) dt + a(t )dWt + a(t )σ(t )dt

= (b(t ) + µ(t )Yt Zt ) dt + (a(t ) + σ(t )Yt Zt ) dWt


= (b(t ) + µ(t )Xt ) dt + (a(t ) + σ(t )Xt ) dWt .

We also have X0 = Y0 Z0 = x · 1 = x . Thus, we have shown that YZ solves (9.3). Now, we note that
Z T Z T
b(t ) – σ(t )a(t ) a(t )
YT = x + dt + dWt ,
0 Zt 0 Zt
9.1. STOCHASTIC DIFFERENTIAL EQUATIONS 141

and thus
Z T Z T
b(t ) – σ(t )a(t ) a(t )
X T = Y T ZT = x ZT + Z T dt + ZT dWt .
0 Zt 0 Zt
where, from Example 9.1.2, we have
Z t   Z t 
Zt = exp µ(s) – 1 σ 2 (s) ds + σ(s)dWs .
0 2 0

Although we have mostly thrown mathematical rigor out the window in these notes, in an effort to
be responsible mathematicians, we should at least state (even if we do not prove) an existence and
uniqueness result.

Theorem 9.1.4 (Existence and Uniqueness of SDEs). Consider the following SDE

dXt = µ(t , Xt )dt + σ(t , Xt )dWt . (9.4)

Suppose the functions µ : R+ × R → R and σ : R+ × R → R+ satisfy

At most linear growth : |µ(t , x )| + |σ(t , x )| < C1 (1 + |x |), ∀ t, x, (9.5)


Lipschitz continuity : |µ(t , x ) – µ(t , y)| + |σ(t , x ) – σ(t , y)| < C2 |x – y|, ∀ t , x , y,(9.6)

for some constants C1 , C2 < ∞. Then SDE (9.4) has a unique solution, which is adapted to to the
filtration F = (Ft )t ≥0 generated by W = (Wt )t ≥0 and satisfies E 0T X2t dt for all T < ∞.
R

Remark 9.1.5. Theorem 9.1.4 actually refers to a strong solution of an SDE. There is another notion of
a solution of an SDE called a weak solution. We will not discuss weak solutions here.

We will not prove Theorem 9.1.4. We refer the interested reader, instead, to (Øksendal, 2005, Theorem
5.21). However, we will illustrate with two examples what can go wrong if equations (9.5) and (9.6) are
not satisfied.

Example 9.1.6. Let X satisfy

dXt = X2t dt , X0 = 1. (9.7)

We identify µ(t , x ) = x 2 , which does not satisfy the linear growth condition (9.5). The unique solution
to (9.7) is
1
Xt = , 0 ≤ t < 1.
1–t
Note that Xt blows up as t → 1 and that Xt is not defined for t ≥ 1. Thus, it is impossible to find a
solution that is defined for all t ≥ 0.
142 CHAPTER 9. SDES AND PDES

Example 9.1.7. Let X satisfy


2/3
dXt = 3Xt dt , X0 = 0. (9.8)

We identify µ(t , x ) = 3x 2/3 , which does not satisfy the Lipschitz condition (9.6) at x = 0. One can
check directly that any X(a) of the form

(a)
 0 t ≤a
Xt =
 (t – a)3 t > a,

satisfies (9.8). Thus, solutions of (9.8) are not unique.

Theorem 9.1.8 (Markov property of solutions of an SDE). Let X = (Xt )t ≥0 be the solution
of an SDE of the form (9.1). The X is a Markov process. That is, for t ≤ T and for some suitable
function ϕ, there exists a function g (which depends on t , T and ϕ) such that

E[ϕ(XT )|Ft ] = g(Xt ),

where F = (Ft )t ≥0 is any filtration to which X is adapted.

The proof of Theorem 9.1.8 is somewhat technical and will not be given here. But, the intuitive idea for
why the theorem is true is rather simple. From (9.2), we see that the value of XT depends only on the
path of the Brownian motion over the interval [t , T] and the initial value Xt = x . The path that X took
to arrive at Xt = x plays no role. In other words, given the present Xt = x , the future (XT )T>t is
independent of the past Ft . With this in mind, the process X should admit a transition density

P(XT ∈ dy|Xt = x ) = Γ(t , x ; T, y)dy,

and thus, the function g should be given by


Z
g(Xt ) = E[ϕ(XT )|Ft ] = E[ϕ(XT )|Xt ] = dy Γ(t , Xt ; T, y)ϕ(y).

Of course, finding an explicit representation of the transition density Γ may not be possible.

9.2 Connection to partial differential equations


In Chapter 5, we showed that, when X = (Xt )t ≥0 is a continuous time Markov chain, its semigroup
P = (Pt )t ≥0 satisfies coupled ODEs (see Theorem 5.2.5). Similarly, when X is the solution of an SDE,
conditional expectations of the path of X can be related to certain PDEs. We discuss these relations
below.
9.2. CONNECTION TO PARTIAL DIFFERENTIAL EQUATIONS 143

Theorem 9.2.1 (Kolmogorov Backward equation). Let X be the solution of SDE (9.1). For
some suitable function ϕ, define

u(t , Xt ) := E[ϕ(XT )|Ft ]. (9.9)

If the function u ∈ C1,2 , then it satisfies the Kolmogorov Backward Equation (KBE), a linear PDE
of the form

(∂t + A(t ))u(t , ·) = 0, u(T, ·) = ϕ, A(t ) = µ(t , x )∂x + 12 σ 2 (t , x )∂x2 . (9.10)

Proof. First, we note that the process (u(t , Xt ))0≤t ≤T is a martingale since, for any 0 ≤ s ≤ t ≤ T,
we have

E[u(t , Xt )|Fs ] = E[E[ϕ(XT )|Ft ]|Fs ] = E[ϕ(XT )|Fs ] = u(s, Xs ).

Next we take the differential of u(t , Xt ) and find using the Itô formula that

du(t , Xt ) = ∂t u(t , Xt )dt + ∂x u(t , Xt )dXt + 21 ∂x2 u(t , Xt )d[X, X]t


= (∂t + A(t )) u(t , Xt )dt + σ(t , Xt )∂x u(t , Xt )dWt .

Integrating, we have
Z t Z t
u(t , Xt ) = u(s, Xs ) + (∂r + A(r )) u(r , Xr )dr + σ(r , Xr )∂x u(r , Xr )dWr .
s s

Taking a conditional expectation and using the fact that Itô integrals are martingales, we find
Z t
E[u(t , Xt )|Fs ] = u(s, Xs ) + E[(∂r + A(r )) u(r , Xr )|Fs ]dr .
s

Since (u(t , Xt ))0≤t ≤T is a martingale, the integral above must be zero for every 0 ≤ s ≤ t ≤ T and for
every possible value of Xs . The only way for this to be true is if the function u satisfies the PDE in
(9.10). To see why u(T, x ) = ϕ(x ) simply use the fact that ϕ(XT ) ∈ FT to write

u(T, XT ) = E[ϕ(XT )|FT ] = ϕ(XT ),

which must hold for all possible XT .

Theorem 9.2.1 tells us that the function u defined in (9.9) satisfies a PDE (9.10). Alternatively, the
Feynman-Kac formula says that the solution u of the PDE (9.10) has the stochastic representation
u(t , x ) = E[ϕ(x )|Xt = x ].

The methods outlined in the proof above can be applied more generally to find PDE representations for
more complicated functionals of the path of X. The basic steps are as follows
144 CHAPTER 9. SDES AND PDES

1. Find a martingale process.


2. Take the differential of it.
3. Set the dt term to zero.

We illustrate this method with the following Theorem

Theorem 9.2.2. Define


 Z T  Z s
e–A(t ,T) ϕ(XT ) + –A(t ,s) g(s, Xs ) Ft ,

u(t , Xt ) := E ds e A(t , s) := du γ(u, Xu ).
t t

Then u solves the following PDE

(∂t + A)u + g = 0, u(T, ·) = ϕ, A = 21 σ 2 (t , x )∂x2 + µ(t , x )∂x – γ(t , x ). (9.11)

Proof. Note that the function u is not a martingale as, for 0 ≤ s ≤ t ≤ T we have
Z T
E[u(t , Xt )|Fs ] = E[E[e–A(t ,T) ϕ(XT ) + dr e–A(t ,r ) g(r , Xr )|Ft ]|Fs ]
t
Z T
= E[e–A(t ,T) ϕ(XT ) + dr e–A(t ,r ) g(r , Xr )|Fs ]
t
Z T
6= E[e–A(s,T) ϕ(XT ) + dr e–A(s,r ) g(r , Xr )|Fs ] = u(s, Xs ).
s

However, the process M = (Mt )0≤t ≤T , defined by


Z t
Mt := e–A(0,t ) u(t , Xt ) + ds e–A(0,s) g(s, Xs ),
0
Z T
= E[e–A(0,T) ϕ(XT ) + ds e–A(0,s) g(s, Xs )|Ft ]
0

is a martingale as, for 0 ≤ s ≤ t ≤ T we have


Z T
E[Mt |Fs ] = E[E[e–A(0,T) ϕ(XT ) + dr e–A(0,r ) g(r , Xr )|Ft ]|Fs ]
0
Z T
= E[e–A(0,T) ϕ(XT ) + dr e–A(0,r ) g(r , Xr )|Fs ] = Ms .
0

Now, we take the differential of M. We compute


Z t
dMt = u(t , Xt )de–A(0,t )+e –A(0,t ) du(t , Xt ) + d dr e–A(0,r ) g(r , Xr )
 0 
= –e–A(0,t ) γ(t , Xt )u(t , Xt )dt + e –A(0,t ) 1 2 2
∂t + µ(t , Xt )∂x + 2 σ (t , Xt )∂x u(t , Xt )dt

+ e–A(0,t ) σ(t , X)∂x u(t , Xt )dWt + e–A(0,t ) g(t , Xt )dt


 
= e–A(0,t ) (∂t + A)u(t , Xt ) + g(t , Xt ) dt + e–A(0,t ) σ(t , X)∂x u(t , Xt )dWt .
9.2. CONNECTION TO PARTIAL DIFFERENTIAL EQUATIONS 145

Setting the dt term equal to zero, we obtain (9.11). The terminal condition is obtained using
Z T
u(T, XT ) = E[e–A(T,T) ϕ(XT ) + ds e–A(t ,s) g(s, Xs )|FT ]
T
= E[ϕ(XT )|FT ] = ϕ(XT ),

completing the proof.

Killing a diffusion
On a probability space (Ω, F, P), consider the following model for a diffusion X and a random time τ

dXt = µ(t , Xt )dt + σ(t , Xt )dWt , X0 = x , (9.12)


Z t
τ = inf{t ≥ 0 : γ(s, Xs )ds ≥ E} E ∼ E(1), E ⊥⊥ X. (9.13)
0
where E exponentially distributed and independent of the X. Here, we assume γ : R+ × R → R+ so that
the integral 0t γ(s, Xs )ds is non-decreasing. Note that τ is the first time the integral 0t γ(s, Xs )ds equals
R R

or exceeds E. The integral depends on the path of X, and therefore is stochastic. The exponentially
distributed random variable E is also random (obviously). For these reasons, we say that τ is doubly
stochastic. The random time τ called the killing time as it sometimes is used to model the lifetime of
the process X.
Let FX = (FtX )t ≥0 be the filtration generated by observing the X process. Note that

/ FtX .
1{τ >t } ∈

In order to keep track of the information obtained by observing τ , we introduce an auxiliary process D
as follows

Dt = 1{τ ≤t } .

Note that Dt = 0 if τ > t and Dt = 1 if τ ≤ t . As a result

1{τ >t } ∈ FtD ,

where FD = (FtD )t ≥0 is the filtration generated by observing D. We denote by F = (Ft )t ≥0 the


filtration generated by observing both D and X; this is sometimes written F = FX ∨ FD , which means
Ft = FtX ∨ FtD := σ(FtX ∪ FtD ).

Theorem 9.2.3. Let X = (Xt )t ≥0 and τ be as given in (9.12) and (9.13). Then
RT
E[1{τ >T} ϕ(XT )|Ft ] = 1{τ >t } E[e– t γ(s,Xs )ds ϕ(X )|F ],
T t (9.14)

where T ≥ t .
146 CHAPTER 9. SDES AND PDES

Proof. Noting that 1 = 1{τ ≤t } + 1{τ >t } and 1{τ ≤t } 1{τ >T} = 0, we have

E[1{τ >T} ϕ(XT )|Ft ] = 1{τ >t } E[1{τ >T} ϕ(XT )|Ft ] + 1{τ ≤t } E[1{τ >T} ϕ(XT )|Ft ]
X , F ]|F ].
= 1{τ >t } E[ϕ(XT )E[1{τ >T} |FT (9.15)
t t

Now, Noting that P(E > x ) = e–x , we compute


 Z t  Rt
P(τ > t |FtX ) = P E > γ(s, Xs )ds FtX = e– γ(s,Xs )ds .

0
0
From this, we find
X, F ] = 1
1{τ >t } E[1{τ >T} |FT X
t {τ >t } P(τ > T|FT , τ > t )
P(τ > T|FTX)
= 1{τ >t } X)
P(τ > t |FT
RT
= 1{τ >t } e– t γ(s,Xs )ds . (9.16)

Inserting (9.16) into (9.15) we obtain (9.14).

9.3 Kolmogorov forward and backward Equations


Let X = (Xt )t ≥0 be the solution of the following SDE

dXt = µ(t , Xt )dt + σ(t , Xt )dWt .

Let Γ denote the transition density of the process X.

Γ(t , x ; T, y)dy := P(XT ∈ dy|Xt = x ).

In this Section, we will derive two PDEs which are satisfied by the transition density Γ: the Kolmogorov
Backward Equation (KBE), which is a PDE in the backward variables (t , x ), and the Komogorov
Forward Equation (KFE), which is a PDE satisfied by the forward variables (T, y). Physicists and
biologists sometimes call the KFE the Fokker-Planck Equation.

We have already seen the KFE and KBE in the continuous time Markov chain setting, discussed in
Section 5.2. The development of the KFE and KBE for diffusion processes is remarkably similar.

Definition 9.3.1. The two-parameter semigroup (P(t , T))0≤t ≤T<∞ , of a Markov diffusion X, is defined
as
Z
P(t , T)ϕ(x ) = E[ϕ(XT )|Xt = x ] = dy Γ(t , x ; T, y)ϕ(y).

where ϕ is integrable with respect to the transition density Γ.


9.3. KOLMOGOROV FORWARD AND BACKWARD EQUATIONS 147

Theorem 9.3.2. The semigroup satisfies

P(t , t ) = I, P(t , s)P(s, T) = P(t , T), 0 ≤ t ≤ s ≤ T.

Proof. To see that P(t , t ) = I simply observe that

P(t , t )ϕ(x ) = E[ϕ(Xt )|Xt = x ] = ϕ(x ) = Iϕ(x ).

To see the semigroup property P(t , s)P(s, T) = P(t , T), note that

P(t , s)P(s, T)ϕ(x ) = P(t , s)E[ϕ(XT )|Xs = x ] = E[E[ϕ(XT )|Xs ]|Xt = x ]


= E[ϕ(XT )|Xt = x ] = P(t , T)ϕ(x ).

The semigroup property can, alternatively, be derived from the Chapman-Kolmogorov equations
Z
Γ(t , x ; T, y) = dz Γ(t , x ; s, z )Γ(s, z ; T, y), 0 ≤ t ≤ s ≤ T < ∞,

We have
Z
P(t , s)P(s, T)ϕ(x ) = dy Γ(s, x ; T, y)ϕ(y)
Z Z
= dz Γ(t , x ; s, z ) dy Γ(s, z ; T, y)ϕ(y)
Z
= dy Γ(t , x ; T, y)ϕ(y) = P(t , T)ϕ(x ).

We associate with a semigroup, its generator.

Definition 9.3.3. The infinitesimal generator or simply the generator, of a semigroup of operators
(P(t , s))0≤t ≤T<∞ is defined as
1
 
A(t )ϕ(x ) := lim P(t , s)ϕ(x ) – ϕ(x )
s&t s – t
1
 
= lim E[ϕ(Xs )|Xt = x ] – ϕ(x )
s&t s – t

where ϕ is any function for which the limit exists.

Theorem 9.3.4. If ϕ ∈ C20 (bounded and twice differentiable), then and X is the solution of

dXt = µ(t , Xt )dt + σ(t , Xt )dWt

then the generator A(t ) of the semigroup (P(t , s))0≤t ≤T<∞ of X is given by

A(t ) = µ(t , x )∂x + 12 σ 2 (t , x )∂x2 .


148 CHAPTER 9. SDES AND PDES

Proof. From the Itô formula we have

dϕ(Xt ) = ∂x ϕ(Xt )dXt + 21 ∂x2 ϕ(Xt )d[X, X]t


 
= µ(t , Xt )∂x ϕ(Xt ) + 21 σ 2 (t , Xt )∂x2 dt + σ(t , Xt )∂x ϕ(Xt )dWt

= A(t )ϕ(Xt )dt + σ(t , Xt )∂x ϕ(Xt )dWt .

Therefore, we have
Z s Z s
ϕ(Xs ) = ϕ(Xt ) + A(r )ϕ(Xr )dr + σ(r , Xr )ϕ(Xr )dWr ,
Z st t
E[ϕ(Xs )|Xt = x ] = ϕ(x ) + E[A(r )ϕ(Xr )|Xt = x ]dr .
t

Finally,
1 1 Zs
 
lim E[ϕ(Xs )|Xt = x ] – ϕ(x ) = lim E[A(r )ϕ(Xr )|Xt = x ]dr = A(t )ϕ(x ).
s&t s – t s&t s – t t

Remark 9.3.5. For all intents and purposes, when X is the solution of an SDE the generator A(t ) is the
simply the operator that acts on ϕ in the dt term of dϕ(Xt ). Thus, we can write the Itô formula more
compactly as

dϕ(Xt ) = A(t )ϕ(Xt )dt + σ(t , Xt )∂x ϕ(Xt )dWt .

Of course, if X is not the solution of an SDE, we cannot write the differential dϕ(Xt ) in this more
compact form.

We can now state and prove the KBE and KFE

Theorem 9.3.6. Let X = (Xt )t ≥0 be the unique solution (assumed to exist) of

dXt = µ(t , Xt )dt + σ(t , Xt )dWt .

Let Γ denote the transition density of X

Γ(t , x ; T, y)dy = P(XT ∈ dy|Xt = x ).

Then, seen as a function of the backwards variables (t , x ), the transition density Γ(·, ·; T, y) satisfies

KBE : (∂t + A(t ))Γ(·, t ; T, y) = 0, Γ(T, ·, T, y) = δy ,


9.3. KOLMOGOROV FORWARD AND BACKWARD EQUATIONS 149

where A(t ) is the infinitesimal generator of X

A(t ) = µ(t , x )∂x + 12 σ 2 (t , x )∂x2 .

Seen as a function of the forward variables (T, y), the transition density Γ(t , x ; ·, ·) satisfies

KFE : (–∂T + A∗ (T))Γ(t , x ; T, ·) = 0, Γ(t , x , t , ·) = δx ,

where A∗ (T) is the L2 (dx ) adjoint of A(T)

A∗ (T) = –∂y µ(T, y) + 21 ∂y2 σ 2 (T, y).

Proof. We have already established in Theorem 9.2.1 that


Z
u(t , x ) := E[ϕ(XT )|Xt = x ] = dy Γ(t , x ; T, y)ϕ(y),

satisfies (9.10). Thus, we must have


Z
0 = (∂t + A(t ))u(t , ·) = dy ϕ(y)(∂t + A(t ))Γ(t , ·; T, y),
Z
ϕ = u(T, ·) = dy ϕ(y)Γ(T, ·; T, y).

The above equations must hold for all functions ϕ. It follows that

(∂t + A(t ))Γ(·, t ; T, y) = 0, Γ(T, ·, T, y) = δy ,

which is precisely the KBE for Γ.

Proving the KFE requires a little more effort. First, we note that, by definition, the generator A(t ) and
its L2 (dx ) adjoint satisfy
Z Z
dy f (y)A(t )g(y) = dy g(y)A∗ (t )f (y),

where, both A(t ) and A∗ (t ) act on the y variable and f , g → 0 as y → ±∞. Now, observe that
Z
dy ϕ(y)Γ(t , x ; T, y) = E[ϕ(XT )|Xt = x ]
Z T Z T
= ϕ(x ) + ds E[A(s)ϕ(Xs )|Xt = x ] + E[ σ(s, Xs )∂x ϕ(Xs )dWs |Xt = x ]
t t
Z T Z
= ϕ(x ) + ds dy Γ(t , x ; s, y)A(s)ϕ(y)
t
Z T Z
= ϕ(x ) + ds dy ϕ(y)A∗ (s)Γ(t , x ; s, y).
t
150 CHAPTER 9. SDES AND PDES

Now, we differentiate the above expression with respect to T and find


Z Z
dy ϕ(y)∂T Γ(t , x ; T, y) = dy ϕ(y)A∗ (T)Γ(t , x ; T, y).

We also have
Z
dy ϕ(y)Γ(t , x ; t , y) = E[ϕ(Xt )|Xt = x ] = ϕ(x ).

Again, the above expressions must hold for all ϕ. Thus, we see that

0 = (–∂T + A∗ (T))Γ(t , x ; T, ·), Γ(t , x ; t , ·) = δx ,

which is exactly the KFE.

Example 9.3.7. Let us check that the KBE and KFE are satisfied in the following simple case:

dXt = σdWt .

Since (XT |Xt = x ) ∼ N(x , σ 2 (T – t )) we have


1 –(y – x )2
 
Γ(t , x ; T, y) = q exp .
2πσ 2 (T – t ) 2σ 2 (T – t )

Once can check by direct computation that

(∂t + A)Γ(t , ·; T, y) = 0, lim Γ(t , ·; T, y) = δy ,


t %T

(–∂T + A∗ )Γ(t , x ; T, ·) = 0, lim Γ(t , x ; T, ·) = δx ,


T&t

where A = 21 σ 2 ∂x2 and A∗ = 12 σ 2 ∂y2 .

9.4 Poisson equations


The KFE and KBE are parabolic PDEs (think heat-type equations). It turns out we can also relate
the expected value of a functional of X to an elliptic PDE (think, potential equations). Consider a
one-dimensional time-homogenous diffusion X = (Xt )t ≥0 that lives on some interval I = (l, r ), where
–∞ ≤ l < r ≤ ∞. We suppose X satisfies

dXt = µ(Xt )dt + σ(Xt )dWt . (9.17)

Let us define the hitting time

τ = inf{t ≥ 0 : Xt ∈
/ I}, (9.18)

where ∂I = {l} ∪ {r } denotes the boundary of I.


9.4. POISSON EQUATIONS 151

Theorem 9.4.1. Let X = (Xt )t ≥0 and τ be given by (9.17) and (9.18). Define
Z τ
u(x ) := E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Xt = x ], t ≤ τ.
t
The the function u satisfies

(A – λ)u + g = 0 in I, (9.19)
u = ϕ, on ∂I, (9.20)

where A = µ(x )∂x + 21 σ 2 (x )∂x2 is the infinitesimal generator of X.

Proof. Let F = (Ft )0≤t ≤τ be the filtration generated by X. The process M = (Mt )0≤t ≤τ , defined by
Z t
Mt := e–λt u(Xt ) + e–λs g(Xs )ds
0
Z τ Z t
= e–λt E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Xt ] + e–λs g(Xs )ds
t 0
Z τ Z t
= e–λt E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Ft ] + e–λs g(Xs )ds
Z τ t 0
= E[e–λτ ϕ(Xτ ) + –λs
e g(Xs )ds|Ft ]
0
is a martingale since, for 0 ≤ s ≤ t ≤ τ we have
Z τ
E[Mt |Fs ] = E[e–λτ ϕ(Xτ ) + e–λr g(Xr )dr |Fs ] = Ms .
0
Taking the differential of M we obtain
  Z t
dMt = –λt
d e u(Xt ) + d e–λs g(Xs )ds
 0 
= e–λt – λu(Xt ) + Au(Xt ) + g(Xt ) dt + e–λt σ(Xt )∂x u(Xt )dWt .

Since M is a martingale, the dt term must be zero for all Xt ∈ I. Thus, we have

(A – λ)u + g = 0, in I,

in agreement with (9.19). The boundary condition (9.20) follows from


Z τ
u(x ) = E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Xt = x ∈ ∂I] = ϕ(x ).
t
where we have used the fact that, if Xt ∈ ∂I then τ = t and Xτ = Xt = x .

Corollary 9.4.2. The Laplace transform of τ , given by

u(x ) := E[e–λτ |X0 = x ],

satisfies

(A – λ)u = 0, in I,
u = 1, on ∂I.
152 CHAPTER 9. SDES AND PDES

Proof. Take ϕ(x ) = 1, g(x ) = 0 and t = 0 in Theorem 9.4.1.

Example 9.4.3 (First hitting time of Brownian motion). Let us define a process X and a hitting
time τ by

dXt = dWt , X0 = x < r < ∞, τ = inf{t ≥ 0 : Xt ∈


/ (–∞, r )}.

According to Corollary 9.4.2, the function

u(x ) := E[e–λτ |X0 = x ],

should satisfy

(A – λ)u = 0, in (–∞, r ), A = 12 ∂x2 , u(r ) = 1. (9.21)

Let us check that this is the case. We showed in Section 7.5 that

Ee–λτm = e–|m| 2λ , τm := inf{t ≥ 0 : Wt = m},

where W0 = 0 by definition. Since Brownian motion is invariant to translations, it follows that



u(x ) = e–(r –x ) 2λ . (9.22)

One can easily see that (9.22) satisfies (9.21), since

u(r ) = 1, 1 2
2 ∂x u(x ) = λu(x ).

9.5 In depth look: scalar time-homogenoues diffusions


In this section we will specialize to time-homogeneous scalar diffusions. Specifically, we consider a
diffusion X = (Xt )t ≥0 that lives on some interval I with endpoints l and r , where –∞ ≤ l < r ≤ ∞. We
suppose X satisfies

dXt = µ(Xt )dt + σ(Xt )dWt .

The end-points l and r may or may not be part of the interval I. The generator A of X is given by

A = µ(x )∂x + 21 σ 2 (x )∂x2 .

We can alternatively write A as


!
1 1
A= ∂x ∂x · , (9.23)
m(x ) s(x )
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 153

where s and m, called the scale and speed densities, respectively, are given by
! !
Z
2µ(x ) 2 2 Z
2µ(x )
s(x ) = exp – dx 2 , m(x ) = 2 = 2 exp dx 2 .
σ (x ) σ (x )s(x ) σ (x ) σ (x )
The constant of integration is arbitrary.

It is interesting to note that, if m is normalizable


Z
dy m(y) = 1,
I

then m is a time-homogenous solution of the KFE. To see this simply observe that

σ 2 (y)
!
(–∂T + A∗ )m(y) = A∗ m(y)= ∂y –µ(y)m(y) + ∂y m(y)
2
! !!
2µ(y) Z
2µ(y) Z
2µ(y)
= ∂y – 2 exp dy 2 + ∂y exp dy 2 = 0.
σ (y) σ (y) σ (y)
Thus, m is a stationary density for X.

Useful results from functional analysis


Before continuing further, it will be convenient to recall a few definitions from functional analysis.

Definition 9.5.1. Let L2 (E, ρ) denote the set of square-integrable functions on E weighted by ρ
Z
L2 (E, ρ) := {f : hf , f iρ < ∞}, hf , giρ := dx ρ(x )f (x )g(x ).
E

The adjoint of L, is the operator L∗ that satisfies

hf , Lgiρ = hL∗ f , giρ , ∀ g ∈ dom(L), ∀ f ∈ dom(L∗ ).

We say that L is self-adjoint in L2 (E, ρ) if L = L∗ and dom(L) = dom(L∗ ).

Expression (9.23) is called the self-adjoint form of A as, for functions f and g that satisfy appropriate
boundary conditions (to be determined below) we have
Z r
hf , Agim = hAf , gim , hf , gim := dx m(x )f (x )g(x ).
l

To see this, simply note that


!
Z
1 1
hf , Agim = dx m(x )f (x ) · ∂x ∂x g(x )
I m(x ) s(x )
!
Z
1
= dx f (x ) · ∂x ∂x g(x )
I s(x )
154 CHAPTER 9. SDES AND PDES
! " !#
Z
1 1
=– dx ∂x g(x ) · ∂x f (x ) + f (x ) · ∂x g(x )
I s(x ) s(x )
" !# ∂I
Z
1 1
=– dx (∂x g(x )) · ∂x f (x ) + f (x ) · ∂x g(x )
I s(x ) s(x )
! " ! ∂I !#
Z
1 1 1
= dx g(x ) · ∂x ∂x f (x ) + f (x ) · ∂x g(x ) – g(x ) · ∂x f (x )
I s(x ) s(x ) s(x ) ∂I
! " ! !#
Z
1 1 1 1
= dx m(x )g(x ) · ∂x ∂x f (x ) + f (x ) · ∂x g(x ) – g(x ) · ∂x f (x )
I m(x ) s(x ) s(x ) s(x ) ∂I
" ! !#
1 1
= hAf , gim + f (x ) · ∂x g(x ) – g(x ) · ∂x f (x ) .
s(x ) s(x ) ∂I
Thus, we have
" ! !#
1 1
hf , Agim = hAf , gim , assuming f (x ) · ∂x g(x ) – g(x ) · ∂x f (x ) = 0.
s(x ) s(x ) ∂I
If we require that A acts only on functions h that satisfy
h 0 (l) h 0 (r )
h(l) + α = β, h(r ) + γ = δ, for some constants α, β, γ, δ ∈ R, (9.24)
s(l) s(r )
then the boundary terms will vanish because we have
g 0 (l) f 0 (l) g 0 (l) f 0 (l) –1 –1
   
f (l) – g(l) = f (l)g(l) – = f (l)g(l) – = 0,
s(l) s(l) s(l)g(l) s(l)f (l) αβ αβ
and likewise at x = r . Thus, the operator A with domain

dom(A) = {h ∈ L2 (I, m) : Ah ∈ L2 (I, m) and (9.24) holds}.

is self-adjoint in L2 (I, m). Note that A is not self-adjoint in L2 (I, dx ). Also note, when we talk about
the operator A we are really talking about the pair (A, dom(A)). The domain of A includes boundary
conditions that must be satisfied in order of A to be self-adjoint in L2 (I, m).

Boundary classification and boundary conditions


The endpoints l and r of the interval I fall in to one of four categories, which we describe below. In order
to determine which of these four categories the end-points belong to we introduce S, the scale measure
Z y
S[x , y] := dz s(z ), x , y ∈ (l, r ), S(l, y] = lim S[x , y], S[x , r ) = lim S[x , y].
x x &l y%r

Next we introduce four useful integrals related to the scale measure


Z x Z r
Il = dz S(l, z ], Ir = dz S[z , r ), (9.25)
l x
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 155
Z x Z r
Jl = dz S[z , x ], Jr = dz S[x , z ]. (9.26)
l x

Here, x is any point in the open interval (l, r ).

Definition 9.5.2. Let X be a time-homogeneous scalar diffusion, which lives on an interval I with
endpoints l and r . An endpoint l or r is said to be

• regular if I < ∞ and J < ∞.


• exit if I < ∞ and J = ∞.
• entrance if I = ∞ and J < ∞.
• natural if I = ∞ and J = ∞.

where I and J are given in (9.25) and (9.26).

The definition above, while precise, gives us very little intuition as to what the four different boundary
classifications mean. Thus, we elaborate a bit.

• Regular boundary The process X can be started from a regular boundary and X can reach a regular
boundary in finite time.
• Exit boundary: The process X cannot be started from a exit boundary but X can reach a exit
boundary in finite time. If X reaches an exit boundary, it does not return.
• Entrance boundary: The process X can be started from an entrance boundary but X cannot reach
an entrance boundary in finite time.
• Natural boundary: The process X cannot be started from a natural boundary nor can X reach a
natural boundary in finite time.

We can further classify the boundaries as follows:

• Unattainable: I = ∞. Entrance and natural boundaries are unattainable. A diffusion starting in


(l, r ) cannot reach an unattainable boundary in finite time.
• Attainable: I < ∞. Regular and exit boundaries are attainable. A a diffusion starting in (l.r ) can
reach an attainable boundary in finite time.

We must specify the behavior of X at a regular boundary. Different boundary behaviors correspond to
different boundary conditions for dom(A). The two most common behaviors are killing an reflecting.

If a regular boundary is specified as killing then the process X is killed as soon as it hits this boundary
and it cannot return to the state space I. Thus, if l and r are regular killing boundaries we clearly have

P(XT ∈ dy|Xt = l) ≡ Γ(t , l; T, y)dy = 0, P(XT ∈ dy|Xt = r ) ≡ Γ(t , r ; T, y)dy = 0.


156 CHAPTER 9. SDES AND PDES

Since Γ(t , ·; T, y) ∈ dom(A), we clearly want A to act on functions f that satisfy f (l) = f (r ) = 0.

If a regular boundary is specified as reflecting the the process X is instantaneously reflected back into I
if it hits this boundary. Thus, if l and r are regular reflecting boundaries we clearly have
Z
1= dy Γ(t , x ; T, y),
ZI
0= dy ∂T Γ(t , x ; T, y)
ZI
= dy A∗ Γ(t , x ; T, y)
ZI  
= dy ∂y –µ(y)Γ(t , x ; T, y) + 21 ∂y σ 2 (y)Γ(t , x ; T, y)
I
h i
= –µ(y)Γ(t , x ; T, y) + 21 ∂y σ 2 (y)Γ(t , x ; T, y)
" ∂I #
Γ(t , x ; T, y) 1 2 Γ(t , x ; T, y)
= –µ(y)m(y) + 2 ∂y σ (y)m(y)
m(y) m(y) ∂I
" #
2µ(y) 2µ(y) Γ(t , x ; T, y) 2µ(y) Γ(t , x ; T, y)
Z  Z 
= – 2 exp dy 2 + ∂y exp dy 2
σ (y) σ (y) m(y) σ (y) m(y) ∂I
" #
2µ(y) Γ(t , x ; T, y)
Z 
= exp dy 2 ∂y
σ (y) m(y) ∂I
" #
1 Γ(t , x ; T, y)
= ∂y .
s(y) m(y) ∂I
Since Γ(t , x ; T, ·) ∈ dom(A∗ ), if follows that A∗ acts on functions f that satisfy ∂y (f (y)/m(y)) = 0 at
y = l and y = r . But we are interested in the domain of A. Recall that A∗ is the L2 (I, dx ) adjoint of A
– not the L2 (I, m) adjoint of A. Thus, we seek boundary conditions for A such that hf , Agi = hA∗ f , gi.
Note that
!
Z
1 1
hf , Agi = dx f (x ) ∂x ∂x g(x )
I m(x ) s(x )
! ! " #
Z
f (x ) 1 f (x ) 1
= – dx ∂x · ∂x g(x ) + ∂x g(x )
I m(x ) s(x ) m(x ) s(x ) ∂I
! " #
Z
1 f (x ) f (x ) 1 g(x ) f (x )
= dx ∂x ∂x · g(x ) + ∂x g(x ) – ∂x
I s(x ) m(x ) m(x ) s(x ) s(x ) m(x ) ∂I
" #
∗ f (x ) 1 g(x ) f (x )
= hA f , gi + ∂x g(x ) – ∂x .
m(x ) s(x ) s(x ) m(x ) ∂I
Since
" #
f (x )
∂x = 0, ∀ f ∈ dom(A∗ ),
m(x ) ∂I
in order for hf , Agi = hA∗ f , gi, we must have
" #
1
∂x g(x ) = 0, ∀ g ∈ dom(A).
s(x ) ∂I
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 157

To summarize: If the endpoints of I are regular, we must specify if the boundary is killing or reflecting.
The correct boundary conditions to impose for the generator A are
1
   
killing : f (x ) = 0, reflecting : ∂x f (x ) = 0.
∂I s(x ) ∂I
Although we will not derive it, the dom(A) should also satisfy certain boundary conditions at exit and
entrance boundaries
1
   
exit : f (x ) = 0, entrance : ∂x f (x ) = 0.
∂I s(x ) ∂I
No boundary conditions are required for natural boundaries.

Eigenfunction expansions
Consider the following eigenvalue equation

Aψn = λn ψn , ψn ∈ dom(A).

Here, and throughout this subsection, we have implicitly assumed that the spectrum of A is discrete.

Theorem 9.5.3. Eigenfunctions corresponding to different eigenvalues of A are orthogonal in


L2 (I, m).

Proof. Let n 6= k . Then

λn hψn , ψk im = hAψn , ψk im = hψn , Aψk im = λk hψn , ψk im .

It follows that hψn , ψk im = 0.

Theorem 9.5.4. Suppose the eigenfunctions of A are complete in L2 (I, m) and ϕ ∈ L2 (I, m). Then
the function u(t , x ) := E[ϕ(XT )|Xt = x ] is given by

e(T–t )λn hψn , ϕim ψn (x ).


X
u(t , x ) = (9.27)
n
where Aψn = λn ψn and m is the speed measure of A.

Proof. We must show that the function u satisfies the KBE. We have
 
(∂t + A)u(t , x ) = hψn , ϕim ∂t e(T–t )λn ψn (x ) + e(T–t )λn Aψn (x )
X

n
 
hψn , ϕim –λn e(T–t )λn ψn (x ) + e(T–t )λn λn ψn (x ) = 0.
X
=
n
X
u(T, x ) = hψn , ϕim ψn (x ) = ϕ(x ).
n
where, to establish the terminal condition u(T, x ) = ϕ(x ) we have used the fact that the eigenfunctions
(ψn )n≥0 are complete in L2 (I, m).
158 CHAPTER 9. SDES AND PDES

Corollary 9.5.5. The transition density Γ(t , x ; T, y) of X has the following eigenfunction expansion

e(T–t )λn ψn (y)ψn (x ).


X
Γ(t , x ; T, y) = m(y) (9.28)
n

Proof. The transition density Γ(t , x ; T, y) satisfies the KBE with a terminal condition Γ(T, x ; T, y) =
δy (x ). Setting ϕ = δy in (9.27),we obtain

e(T–t )λn hψn , δy im ψn (x )


X
Γ(t , x ; T, y) =
n
e(T–t )λn ψn (y)ψn (x ),
X
= m(y)
n
where we have used
Z
hψn , δy im = dx ψn (x )δy (x )m(x ) = ψn (y)m(y).
I

Example 9.5.6 (Brownian motion in a finite interval). Consider a diffusion X = (Xt )t ≥0 that
lives on a finite interval (l, r ) and satisfies the SDE

dXt = dWt .

We identify the drift µ(x ) = 0 and the diffusion coefficient σ(x ) = 1. The generator A of X and the
speed density m are given by
1
A = 12 ∂x2 , m(x ) = .
r –l
One can easily check that the endpoints l and r are regular. As such, we must specify the behavior at
the endpoints as either killing or reflecting. In the case of killing boundaries we have

killing : Aψn = λn ψn , ψn (l) = ψn (r ) = 0,


√ 


1 nπ 2
 
ψn (x ) = 2 sin (x – l) , λn = – , n = 1, 2, . . . ,
r –l 2 r –l
where we have normalized the eigenfunctions so that hψn , ψk im = δn,k . In the case of reflecting
boundaries we have

reflecting : Aψn = λn ψn , ψn0 (l) = ψn0 (r ) = 0,


√ 


1 nπ 2
 
ψn (x ) = 2 cos (x – l) , λn = – , n = 0, 1, 2, . . . ,
r –l 2 r –l
where, again, we have normalized the eigenfunctions. One can immediately write down the transition
density Γ(t , x ; T, y) using (9.28). Of course, the sin and cos expansions should be familiar to anybody
that has solved the heat equation on a finite domain.
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 159

Example 9.5.7 (Ornstein-Uhlenbeck process). Consider now an OU process

dXt = –Xt dt + dWt ,

which has a state space I = R. We identify the drift µ(x ) = –x and volatility coefficient σ(x ) = 1. The
generator and speed density of X are
2
A = –x ∂x + 12 ∂x2 , m(x ) = e–x .

One can easily verify that the endpoints l = –∞ and r = +∞ are natural. As such, we do not need to
specify any boundary conditions. The eigenfunctions and eigenvalues of A are given by

ψn (x ) = Hn (x ), λn = –n, n = 0, 1, 2, . . . ,

where (Hn )n≥0 are the Hermite polynomials, properly normalized so that hψn , ψk im = δn,k .

Heuristic computations

We have used the fact that A is self-adjoint in L2 (I, m) in order to derive to write an eigenfunction
expansion for u(t , x ) = E[ϕ(XT )|Xt = x ], the solution of the KBE. In fact, this is a special case of a
more general method of solving PDEs and ODEs involving a self-adjoint operator A. In what follows,
we continue to assume that the generator A of a diffusion X has a discrete spectrum, and that the
eigenfunctions of A are a complete basis in L2 (I, m).

For any Borel-mearuble function g : R → R we define the operator

dom(g(A)) := {f ∈ L2 (I, m) : g 2 (λ)hψn , f i2m < ∞}. (9.29)


X X
g(A)· = g(λn )hψn , ·im ψn ,
n n

We call (9.29) the eigenfunction or spectral representation of g(A). For example, taking g(λ) = 1 gives
the spectral representation of the identity operator

dom(I) := {f ∈ L2 (I, m) : hψn , f i2m < ∞}.


X X
I· = hψn , ·im ψn ,
n n

Similarly, taking g(λ) = (λ – µ)–1 and defining Rµ := g(A) gives


!2
1 1
Rµ · = L2 (I, m) hψn , f i2m < ∞},
X X
hψn , ·im ψn , dom(Rµ ) := {f ∈ :
n λn – µ n λn – µ

and taking g(λ) = e(T–t )λ and defining P(t , T) := g(A) gives

P(t , T)· = e(T–t )λn hψn , ·im ψn , dom(P(t , T)) := {f ∈ L2 (I, m) : e(T–t )λn hψn , f i2m < ∞},
X X

n n
160 CHAPTER 9. SDES AND PDES

Note that P(t , T) as defined above is in fact the semigroup operator we introduced previously. To see
this note thate

P(t , T)ϕ(x ) = e(T–t )λn hψn , ϕim ψn (x )


X

n
Z
e(T–t )λn
X
= dy ψn (y)ϕ(y)m(y)ψn (x )
n I
Z  
e(T–t )λn ψn (y)ψn (x ) ϕ(y)
X
= dy m(y)
I n
Z
= dy Γ(t , x ; T, y)ϕ(y),
I

where we have used (9.28).

The usefulness in defining operators of the form g(A) is as follows. Consider an linear ODE or PDE for
a function u involving the operator A. One can obtain a solution to this ODE or PDE as follows:

1. Solve the ODE or PDE assuming A is a constant; the solution will involve terms of the form g(A)f .
2. Replace g(A)f with its spectral representation.

Let us see how this works with a few examples.

Example 9.5.8. Let us check that the above two-step method works for the KBE. We want to solve

(∂t + A)u(t , ·) = 0, u(T, ·) = ϕ, (9.30)

where ϕ ∈ L2 (I, m). Treating A like a constant, we have an ODE in t . Solving this ODE and then using
the spectral representation any expression involving A we obtain

u(t , ·) = e(T–t )A ϕ = e(T–t )λn hψn , ϕim ψn ,


X

which agrees with the expression given in (9.27). Noting that e(T–t )A = P(t , T) we can write the solution
to (9.30) compactly as u(t , ·) = P(t , T)ϕ.

Example 9.5.9. Consdier the following PDE involving A

(∂t + A)u(t , ·) + g(t , ·) = 0, u(T, ·) = ϕ, (9.31)

where ϕ ∈ L2 (I, m) and g(s, ·) ∈ L2 (I, m) for all s ∈ [0, T]. Treating A as a constant, we have an ODE
in t . The solution is given by
Z T
u(t , ·) = e(T–t )A ϕ + ds e(s–t )A g(s, ·).
t
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 161

Using the spectral representation of e(s–t )A we obtain


Z T
e(T–t )λn hψn , ϕim ψn + e(s–t )λn hψn , g(s, ·)im ψn .
X X
u(t , ·) = ds
n t n

Again, recognizing that e(s–t )A = P(t , s) we can write the solution of (9.31) as
Z T
u(t , ·) = P(t , T)ϕ + ds P(t , s)g(s, ·).
t

Example 9.5.10. Consider one last example. We wish to solve

(A – µ)u = g. (9.32)

Treating A as a contant, we have an algebraic equation for u. Solving this, and using the spectral
representation for any expression involving A we obtain
1 X 1
u= = hψn , gim ψn .
A–µ n λn – µ
1 = R we can write the solution u of (9.32) compactly as u = R g. Note that his
Noting that A–µ µ µ
solution makes sense only if g ∈ dom(Rµ ). In particular, this means that we must have λn 6= µ for all n.

The meaning of “scale” and “speed”


From the scale density s we can define a scale function S, which is given by
Z
S(x ) := dx s(x ).

By construction, the scale function S is one-to-one. It is interesting to note that the process S(X) =
(S(Xt ))t ≥0 is a martingale, as the following computation shows

dS(Xt ) = AS(Xt )dt + σ(Xt )∂x S(Xt )dWt


 
= µ(Xt )s(Xt ) + 1 σ 2 (X )∂ s(X ) dt + σ(X )s(X )dW
2 t x t t t t

1 σ 2 (X ) –2µ(Xt ) s(X ) dt + σ(X )s(X )dW


   
= µ(Xt )s(Xt ) + 2 t t t t t
σ 2 (Xt )
= σ(Xt )s(Xt )dWt .

Now, consider two regular points {l} and {r } of a scalar diffusion X. Let τl and τr be the first hitting
times of X to l and r , respectively

τl := inf{t ≥ 0 : Xt = l}, τr := inf{t ≥ 0 : Xt = r }.


162 CHAPTER 9. SDES AND PDES

We wish to compute P(τl < τr |X0 = x ), the probability that X hits l prior to hitting r . Let us define

τ := τl ∧ τr .

As the process S(X) is a martingale it follows from Theorems 2.4.3 and 2.4.5 that

S(x ) = ES(Xτ ) = P(τl < τr |X0 = x )S(l) + P(τr < τl |X0 = x )S(r ).

Also, because {τl < τr } ∪ {τl < τr } = Ω we have trivially that

1 = P(τl < τr |X0 = x ) + P(τr < τl |X0 = x ).

Using the above two equations, we find


S(r ) – S(x ) S(x ) – S(l)
P(τl < τr |X0 = x ) = , P(τr < τl |X0 = x ) = . (9.33)
S(r ) – S(l) S(r ) – S(l)
If we set Y = S(X), let L = S(l) and R = S(r ), and define

τL := inf{t ≥ 0 : Yt = L), τR = inf{t ≥ 0 : Yt = R), (9.34)

then equation (9.33) becomes


R–y y–L
P(τL < τR |Y0 = y) = , P(τR < τL |Y0 = y) = . (9.35)
R–L R–L
Any process that satisfies (9.34)-(9.35) is said to be on a natural scale. Thus, given a scalar diffusion X
we can always associate with it a diffusion S(X) that is on a natural scale.

To do: meaning of speed.

9.6 Extensions to higher dimensions


Throughout this Chapter, we have mostly dealt with Scalar SDEs and PDEs with one spacial dimension.
To conclude this chapter, we simply restate some of the main results in higher dimensions, without
providing the conditions under which they hold rigorously.

Throughout this section, we consider d-dimensional diffusion process X = (Xt )t ≥0 that satisfies the
following SDE

dXt = µ(t , Xt )dt + σ(t , Xt )dWt , (9.36)

where W is an m-dimensional Brownian motion and

µ : R+ × Rd → Rd , σ : R+ × Rd → Rd×m
+ .
9.6. EXTENSIONS TO HIGHER DIMENSIONS 163

SDE (9.36) has a unique strong solution, which is square integrable for all t (i.e., EX2t < ∞) and adapted
to the filtration F = (Ft )t ≥0 generated by W (i.e., Xt ∈ Ft ) if the following are satisfies

Linear Growth : |µ(t , x )| + |σ(t , x )| < C1 (1 + |x |), (9.37)


Lipschitz Continuous : |µ(t , x ) – µ(t , y)| + |σ(t , x ) – σ(t , y)| < C2 |x – y|, (9.38)

for some constants C1 and C2 , independent of t , x , y, where

d X
m d d
|σ|2 2, |µ|2 µ2i , |x |2 xi2 .
X X X
:= σij := :=
i =1 j =1 i =1 i =1

The generator A(t ) and its L2 (Rd , dx ) adjoint are given by

d d X
d
A(t ) = µi (t , x )∂xi + 12 (σσ T )i ,j (t , x )∂xi ∂xj ,
X X

i =1 i =1 j =1
d d X d
A∗ (t ) = – 1 ∂yi ∂yj (σσ T )i ,j (t , y).
X X
∂yi µi (t , y) + 2
i =1 i =1 j =1

We can write the Itô Formula most compactly as

dϕ(Xt ) = A(t )ϕ(Xt )dt + (∇f )(Xt )σ(t , Xt )dWt .

The transition density Γ, defined by Γ(t , x ; T, y)dy = P(XT ∈ dy|Xt = x ) satisfies the KBE in the
backwards variables (t , x ) and the KFE in the forward variables (T, y)

KBE : (∂t + A(t )) Γ(t , ·; T, y) = 0, Γ(T, ·; T, y) = δy ,


KFE : (–∂T + A∗ (T)) Γ(t , x ; T, ·) = 0, Γ(t , x ; t , ·) = δx .

If we define a function u : R+ × R → R by
 Z T 
e–A(t ,T) ϕ(XT ) + –A(t ,s)
u(t , x ) := E e g(s, Xs )ds Xt =x ,
t

where A(t , T) is given by


Z T
A(t , T) = γ(s, Xs )ds, γ : R+ × Rd → R+ .
t

Then u satisfies
 
∂t – γ(t , ·) + A(t ) u(t , ·) + g(t , ·) = 0, u(T, ·) = ϕ.
164 CHAPTER 9. SDES AND PDES

Lastly, consider the time-homogenous diffusion: µ(t , x ) = µ(x ) and σ(t , x ) = σ(x ). Let D ⊂ Rd be an
open, connect set. Denote by ∂D the boundary of D. Assume X0 ∈ D and define the hitting time

τD := inf{t ≥ 0 : Xt ∈
/ D}.

Assume P(τD < ∞) = 1 and let


 Z τ 
D
e–λ(τD –t ) ϕ(XτD ) + –λ(s–t )
u(x ) = E e g(Xs )ds Xt =x , t < τD .
t

Then u satisfies

(A – λ)u + g = 0, in D,
u = ϕ, on ∂D.

9.7 Exercises
Exercise 9.1. A (one-dimensional) backward stochastic differential equation (BSDE), defined on a
probability space filtered probability space (Ω, F, F = (Ft )t ≥0 , P), is an equation of the form

dYt = –f (ω, t , Yt , Zt )dt + Zt dWt , YT = ξ, (9.39)

where W = (Wt )0≤t ≤T is an (P, F)-Brownian motion, the process Y = (Yt )0≤t ≤T lives in R, the process
Z = (Zt )0≤t ≤T lives in R, the driver f : Ω × [0, T] × R × R → R, satisfies f (ω, t , Yt , Zt ) ∈ Ft for
all t ∈ [0, T] and the random variable ξ is FT -measurable. A solution of BSDE (9.39) is any pair of
F-adapted processes (Y, Z) such that the terminal condition YT = ξ is satisfied. A forward-backward
stochastic differential equation (FBSDE), is a BSDE of the form

dXt = µ(t , Xt , Yt , Zt )dt + σ(t , Xt , Yt )dWt , X0 = x ,


(9.40)
dYt = –f (t , Xt , Yt , Zt )dt + Zt dWt , YT = ϕ(XT ) ∈ R,

where the processes W, Y, and Z are as described above, the process X = (Xt )0≤t ≤T lives in R, and the
functions µ, σ, f and ϕ are maps

µ : [0, T] × R × R × R → R, σ : [0, T] × R × R → R,
f : [0, T] × R × R × R → R, ϕ : R → R.

In general, the coefficient σ in (9.40) could depend on Z as well. However, for simplicity, we do not
consider this case here. FBSDEs naturally arise in mathematical finance where components of X are
9.7. EXERCISES 165

assets in a market, components of Y are values of hedging portfolios, and components Z are the associated
hedges.
We wish to solve FBSDE (9.40) – meaning we wish to find a pair of F-adapted processes (Y, Z) such
that YT = ϕ(XT ). Let us supposed that Yt = u(t , Xt ) for some function u : [0, T] × R → R, which is to
be determined. Let us further suppose that u ∈ C1,2 . Compute du(t , Xt ) and compare your result to
the expression given for dYt in (9.40). Conclude that, if the function u satisfies the semilinear PDE

0 = (∂t + Au (t ))u(t ) + f u (t ), u(T) = ϕ,

where Au and f u are given by


1
Au (t ) = µ(t , x , u(t , x ), σ(t , x )∂x u(t , x ))∂x + σ 2 (t , x , u(t , x ))∂x2 ,
2
u
f (t ) = f (t , x , u(t , x ), σ(t , x )∂x u(t , x )),

then the the pair (Y, Z) defined by

Yt := u(t , Xt ), Zt := σ(t , Xt )∂x u(t , Xt ),

is (at least formally) a solution of FBSDE (9.40).

Exercise 9.2. Let X be the solution of the following SDE


q
dXt = κ(θ – Xt )dt + δ Xt dWt .

Define
" Z T ! #

u(t , x ) := E exp – Xs ds Xt =x .
t

Derive a PDE for the function u. To solve the PDE for u, try a solution of the form

u(t , x ) = exp(–x A(t ) – B(t )),

where A and B are deterministic functions of t . Show that A and B must satisfy a pair of coupled ODEs
(with appropriate terminal conditions at time T). Bonus question: solve the ODEs (it may be helpful
to note that one of the ODEs is a Riccati equation).

Exercise 9.3. For i = 1, 2, . . . , d, let X(i ) satisfy

(i ) b (i ) 1 (i )
dXt = – Xt dt + σdWt ,
2 2
166 CHAPTER 9. SDES AND PDES

where the (W(i ) )di=1 are independent Brownian motions. Define

d d Z t 1
(i )
(Xt )2 , √ X(i ) (i )
X X
Rt := Bt := s dWs .
i =1 i =1 0 Rs

Show that B is a Brownian motion. Derive an SDE for R that involves only dt and dBt terms (i.e., no
(i )
dWt terms should appear).

Exercise 9.4. Consider the following 2-dimensional SDE


1 q
dXt = – Zt dt + Zt dWt ,
2  q 
q
dZt = κ(θ – Zt )dt + σ Zt ρdWt + 1– ρ2 dBt ,

where W and B are independent Brownian motions. Define

u(t , x , z ) := E[ϕ(XT )|Xt = x , Zt = z ].

Derive a PDE for the function u. Let ub be the Fourier transform in u in the x variable
1 Z
u(t
b , ξ, z ) = dx e–iξx u(t , x , z ).
2π R
Show that ub satisfies a PDE in (t , z ) with a terminal condition u(T,
b ξ, z ) = ϕ(ξ)
b where ϕb is the Fourier
transform of ϕ. Assume that ub is of the form

u(t
b , ξ, z ) = ϕ(ξ)e
b A(t ,ξ)+z B(t ,ξ) .

Show that A and B satisfy a pair of coupled ODEs in t (with appropriate terminal conditions at time T).
Bonus question: solve the ODEs (it may be helpful to note that one of the ODEs is a Riccati equation).

Exercise 9.5. Consider a diffusion X = (Xt )t ≥0 that lives on a finite interval (l, r ), 0<l <r <∞
and satisfies the SDE
dXt = µXt dt + σXt dWt

One can easily check that the endpoints l and r are regular (you do not have to prove it here). Assume
both endpoints are killing. Find the transition density Γ(t , x ; T, y) of X.

Exercise 9.6. Consider a two-dimensional diffusion processes X = (Xt )t ≥0 and Y = (Yt )t ≥0 that satisfy
the SDEs

dXt = dW1t
dYt = dW2t
9.7. EXERCISES 167

where W1t and W2t are two independent Brownian motions. Define a function u as follows

u(x , y) = E[φ(Xτ )|Xt = x , Yt = y],


τ = inf{s ≥ t : Ys = a}

(1) State a PDE and boundary conditions satisfied by the function u.


(2) Let us define the Fourier transform and and inverse Fourier transform, respectively, as follows
Z
Fourier Transform : fˆ(ω) := e –i ωx f (x )dx ,
1 Z i ωx ˆ
Inverse Transform : f (x ) := e f (ω)dω.

Use Fourier transforms and a conditioning argument to derive an expression for u(x , y) as an inverse
Fourier transform. Use this result to derive an explicit form for P(Xτ ∈ dz |Xt = x , Yt = y) (i.e., an
expression involving no integrals).
(3) Show the expression you derived in part (2) for u(x , y) satisfies the PDE and BCs you stated in part
(1).

Exercise 9.7. Suppose W = (W1t , W2t , . . . , Wdt )t ≥0 is a d-dimensional Brownian motion. Define

d
X 1/2
Rt = (Wit )2 .
i =1

Clearly, R lives in an interval I with endpoints {0} and {∞}. Show that, when d = 1 the origin is a
regular endpoint. Show that when d ≥ 2, the origen is and entrance point.
168 CHAPTER 9. SDES AND PDES
Chapter 10

Jump diffusions

Notes from this chapter are taken primarily from (Øksendal and Sulem, 2005, Chapter 1). Notes for
Section 10.5 on Hawkes processes follow Hawkes (1971).

10.1 Basic definitions and results on Lévy processes


Definition 10.1.1. A d-dimensional stochstic process η = (ηt )t ≥0 , defined on a probability space
(Ω, F, P), is called a Lévy process if it satisfies

1. η0 = 0.
2. Independent increments: for any 0 ≤ t1 < t2 < t3 < t4 < ∞, we have ηt4 – ηt3 ⊥⊥ ηt2 – ηt1 .
3. Stationary increments: for any 0 ≤ t1 < t2 < ∞, we have ηt2 – ηt1 ∼ ηt2 –t1 .
4. Continuity in probability: for any ε > 0 and t ≥ 0, we have lims&0 P(|ηt +s – ηt | > ε) = 0.

Note, Item 4 in Definition 10.1.1 does not mean that a Lévy process cannot jump. For example, consider
a Poisson process N = (Nt )t ≥0 with intensity λ. A Poisson process is a jump process and yet it is easy
to see that it is continuous in probability as, for any ε ∈ (0, 1), we have

P(|Nt +s – Nt | > ε) = P(Nt +s – Nt ≥ 1) = 1 – P(Nt +s – Nt = 0) = 1 – e–λs → 0 as s & 0.

Item 4 simply means that, at a fixed t , the probability that a Lévy process has a discontinuity at t is
zero.

We can and do assume that any Lévy process is right-continuous with left limits (RCLL). That is

lim ηt +s = ηt , ∀ t ≥ 0.
s&0

169
170 CHAPTER 10. JUMP DIFFUSIONS

A process that is RCLL is sometimes called càdlàg (for those who speak French: continue à droite,
limite à gauche). We have already encountered two examples of Lévy processes.

Example 10.1.2. A Brownian motion W = (Wt )t ≥0 is a continuous Lévy process with Wt +s – Wt ∼


N(0, s).

Example 10.1.3. A Poisson process N = (Nt )t ≥0 with intensity λ is a pure jump Lévy process with
Nt +s – Nt ∼ Poi(λs).

A filtration for a Lévy process is defined just as a filtration for a Brownian motion.

Definition 10.1.4. Let (Ω, F, P) be on probability space on which a Lévy process η = (ηt )t ≥0 is defined.
A filtration for the Lévy process η is a collection of σ-algebras F = (Ft )t ≥0 satisfying:

1. Information accumulates: if 0 ≤ s < t then Fs ⊂ Ft .


2. Adaptivity: for all t ≥ 0, we have ηt ∈ Ft .
3. Independence of future increments: if u > t ≥ 0 then (ηu – ηt ) ⊥⊥ Ft .

The most natural choice for this filtration F is (not surprisingly) the natural filtration for η, that is,
Ft = σ(ηs , 0 ≤ s ≤ t ). In principle the filtration F could contain more than the information obtained by
observing η. However, the information in the filtration is not allowed to destroy the independence of
future increments of the Lévy process.

Definition 10.1.5. The jump of a Lévy process η at time t ≥ 0 is defined as

∆ηt := ηt – ηt – , ηt – := lim ηs .
s%t

Obviously, if η does not experience a jump at time t , then ∆ηt = 0.

Definition 10.1.6. We define N : R+ × Bd0 × Ω → N0 the Poisson random measure of η by

U ∈ Bd0 := {B ∈ B(Rd ) : {0} ∈


X
N(t , U, ω) = 1{∆ηs (ω)∈U} , / B}.
s:0<s≤t

As with most random variables and stochastic processes, we will typically omit the dependence of N
on ω, writing simply N(t , U) as opposed to N(t , U, ω). The Poisson random measure N(t , U) counts
the number of jumps of size ∆ηs ∈ U prior to time t . It will be convenient to introduce the following
differential form

N(dt , dz ),

which counts the number of jumps of size dz over the time interval dt .
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 171

Definition 10.1.7. Let N be the Poisson random measure of a Lévy process η. We define ν : Bd0 → R+ ,
the Lévy measure of η, as follows

ν(U) := EN(1, U), U ∈ Bd0 .

Theorem 10.1.8. Let U ∈ Bd0 . Then the process (N(t , U))t ≥0 is a Poisson process with intensity
ν(U).

Because N(t , U) is a Poisson process with intensity ν(U), we have


1
EN(t , U) = ν(U)t , ⇒ EN(t , U) = ν(U).
t
Thus, ν(U) is the expected number of times η has a jump of size z ∈ U per unit time. Note that

k P(N(t , U) = k ) = P(N(dt , U) = 1) + O(dt 2 ),
X
ν(U)dt = EN(dt , U) = as dt → 0.
k =0
Hence, for small dt , one can think of ν(U)dt as the probability that η experiences a jump of size z ∈ U
in the time interval dt .

Example 10.1.9 (Compound Poisson process). Let (Xn )n∈N be a sequence of iid random vectors in
Rd with distribution FX . Let (Pt )t ≥0 be a Poisson process with intensity λ. Assume (Pt )t ≥0 ⊥
⊥ (Xn )n∈N .
We define a compound Poisson process η = (ηt )t ≥0 by
Pt
X
ηt = Xn .
n=1
The increments of Y are given by
Pt
X
ηt – ηs = Xn .
n=Ps +1
The distribution Fηt –ηs depends only on (t – s) and FX . As such we see that η is a stationary process.
Also, non-overlapping increments of η are clearly independent, as they depend on different (Xi ). Thus, η
is a Lévy process in Rd . Let us find the Lévy measure ν corresponding to η. For any U ∈ Bd0 we have
X
ν(U) = EN(1, U) = E 1{∆ηs ∈U}
s:0<s≤t
P1
X P1
X
=E 1{Xn ∈U} = E E[1{Xn ∈U} |P1 ]
n=1 n=1
XP1 P1
X
=E E[1{Xn ∈U} ] = E P(Xn ∈ U)
n=1 n=1
= FX (U)EP1 = λFX (U).

Thus, the Lévy measure of η is ν = λFX .


172 CHAPTER 10. JUMP DIFFUSIONS

A pure-jump Lévy process η has a finite Lévy measure ν(Rd ) < ∞ if and only if it can be represented by
a compound Poisson process. In this case, we can express η in one of two ways
Z Pt
X
ηt = z N(t , dz ), or ηt = Xn . (10.1)
Rd n=1

However, there exist Lévy processes for which ν(Rd ) = ∞. We call a Lévy process for which ν(Rd ) = ∞
an infinite activity Lévy process. For an infinite activity Lévy process, neither of the representations in
(10.1) make sense. To write the most general form of a Lévy process, we introduce the compensated
Poisson random measure.

Definition 10.1.10. Let N be a Poisson random measure with associated Lévy measure ν. The
compensated Poisson random measure, denoted N
e is defined as

N(t
e , A) := N(t , A) – ν(A)t .

In differential form, we have

N(dt
e , dz ) := N(dt , dz ) – ν(dz )dt .

For any A ∈ Bd0 the process (N(t


e , A))
t ≥0 is a martingale (with respect to its own filtration) as

E[N(t
e , A)|F ] = N(s,
s
e A) + E[N(t
e , A) – N(s,
e A)|Fs ]
= N(s,
e A) + E[N(t , A) – N(s, A)|Ft ] – ν(A)(t – s)
= N(s,
e A) + ν(A)(t – s) – ν(A)(t – s) = N(s,
e A),

where we have used EN(t , A) = t ν(A). It follows that, for any fixed R > 0, the the process M(k ) =
(k )
(Mt )t ≥0 , defined as
Z
(k )
Mt := z N(t
e , dz ), k = 1, 2, . . . ,
1/k ≤|z |<R
(k )
is a (d-dimensional) martingale satisfying E|Mt |2 < ∞. One can show that, as k → ∞ the sequence of
processes (M(k ) )k ∈N converges in L2 (Ω, F, P) to a process M = (Mt )t ≥0 defined as
Z Z
Mt ≡ z N(t
e , dz ) := lim z N(t
e , dz ),
|z |<R k →∞ 1/k ≤|z |<R

which is also a martingale and satisfies E|Mt |2 < ∞.

Remark 10.1.11. In general we cannot seperate an integral with respect to N(t


e , dz ) into an integral

with respect to N(t , dz ) and an integral with respect to t ν(dz ). The reason is that we may have
= ∞, and in this case, we have
R
|z |<R |z |ν(dz )
Z Z Z
e , dz ) 6=
z N(t z N(t , dz ) – z ν(dz )t .
|z |<R |z |<R |z |<R
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 173

The following theorem gives the most general form of a Lévy process.

Theorem 10.1.12 (Itô-Lévy decomposition). Let η = (ηt )t ≥0 be a Lévy process in Rd . Then η


has the decomposition
Z Z
ηt = µR t + σWt + z N(t
e , dz ) + z N(t , dz ), (10.2)
|z |<R |z |≥R

for some vector µR ∈ Rd , matrix σ ∈ Rd×d


+ , constant R ∈ [0, ∞], Poisson random measure N on
Rd , and d-dimensional Brownian motion W that independent of N.

We can always choose R = 1 in (10.2). It is useful to recognize when we can choose R = 0 and R = ∞ as,
in these cases, the right-hand side of (10.2) reduces from four to three terms.

Example 10.1.13. If the Lévy measure ν satisfies


Z
|z |ν(dz ) < ∞, (10.3)
|z |≥1

then we can write (10.2) as


 Z  Z Z
η t = µ1 + z ν(dz ) t + σWt + z N(t
e , dz ) + (N(t , dz ) – ν(dz )t )
|z |≥1 |z |<1 |z |≥1
Z
= µ∞ t + σWt + z N(t
e , dz ),
Rd
R
where µ∞ = µ1 + |z |≥1 z ν(dz ). We call µ∞ the center of η since Eηt = µ∞ t . It is straightforward to
show that if µ∞ = 0 then η is a martingale.

Example 10.1.14. If the Lévy measure ν satisfies


Z
|z |ν(dz ) < ∞, (10.4)
|z |<1

then we can write (10.2) as


Z Z Z
ηt = µ1 t + σWt + z N(t , dz ) – z ν(dz )t + z N(t , dz )
|z |<1 |z |<1 |z |≥1
Z
= µ0 t + σWt + z N(t , dz ), (10.5)
Rd
R
where µ0 = µ1 – |z |<1 z ν(dz ). We call µ0 the drift of η.

Example 10.1.15. If the Lévy measure ν satisfies

ν(Rd ) < ∞,
174 CHAPTER 10. JUMP DIFFUSIONS

then (10.4) holds, and we can write (10.5) as a compound Poisson process
Pt
X
ηt = µ0 t + σWt + Xn ,
n=1

where P is a Poisson process with intensity ν(Rd ) and the (Xn )n∈N are iid random vectors in Rd with
common distribution FX = ν/ν(Rd ).

We will typically drop the subscript from µR , writing it instead simply as µ.

Theorem 10.1.16. Lévy processes are Markov processes.

This should come as no surprise. We have already seen a Brownian motion and a Poisson process are
Markov processes. As these processes serve as building blocks for more general Lévy processes, it follows
that Lévy processes are Markov as well.

Theorem 10.1.17 (Lévy-Kintchine formula). Let η = (ηt )t ≥0 be the following d-dimensional


Lévy process
Z Z
ηt = µt + σWt + z N(t
e , dz ) + z N(t , dz ),
|z |<R |z |≥R

where µ ∈ Rd , σ ∈ Rd×d
+ , W is a d-dimensional Brownian motion and N is a Poisson random
measure on Rd . Then the Lévy measure ν associated with N satisfies
Z
(1 ∧ |z |2 )ν(dz ) < ∞, (10.6)
Rd

and the characteristic function φηt : Rd → C is given by


d
Eeihu,ηt i et ψ(u) ,
X
φηt (u) := = ha, bi := ai bi . (10.7)
i =1

where ψ, called the characteristic exponent of η, is given by


Z Z
ψ(u) = ihµ, ui – 21 hu, σσ T ui + (eihu,z i – 1 – ihu, z i)ν(dz ) + (eihu,z i – 1)ν(dz ). (10.8)
|z |<R |z |≥R

Conversely, given triplet (µ, a, ν) with a = σσ T and ν satisfying (10.6), there exists a Lévy process
η satisfying (10.7)-(10.8).

Proof. We will show why (10.7)-(10.8) holds for a scalar Lévy process of the compound Poisson type.
In this case, η has the decomposition
Pt
X
ηt = µt + σWt + Xn ,
n=1
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 175

where W is a scalar Brownian motion, P is a Poisson process with intensity λ and the (Xn )n∈N are iid
random variables with common distribution FX . We compute
PPt
Eeiuηt = Eeiuµt +iuσWt +iu n=1 Xn
PPt
= eiuµt EeiuσWt Eeiu n=1 Xn

1 2 2 PPt
= eiuµt – 2 σ u t EE[eiu n=1 Xn |Pt ]

1 2 2
= eiuµt – 2 σ u t E(E[eiuX ])Pt
1 2 2 Z P
t
= eiuµt – 2 σ u t E eiuz FX (dz )

iuµt – 21 σ 2 u 2 t
 Z 
=e exp λt eiuz F X (dz ) – 1
1 2 2  Z 
= eiuµt – 2 σ u t exp λt (eiuz – 1)FX (dz )

= et ψ(u) ,

where we have used Es Pt = exp(λt (s – 1)) and ν = λFX . The computation for a more general Lévy
process in multiple dimensions is similar.

Let us provide a few examples of Lévy measures ν and compute the integrals that appear in expression
(10.8) for the characteristic exponent ψ.

Example 10.1.18 (Dirac comb). A Dirac comb Lévy measure on R is a measure of the form
n
X
ν(dz ) = λi δzi (z )dz ,
i =1
where λi is the intensity of a jump of size zi . In this case, both (10.3) and (10.4) are satisfied, so we can
choose either R = 0 or R = ∞. Suppose we choose R = 0. Then the third term in (10.8) disappears and
the last term becomes
Z n
(eiuz λi (eiuzi – 1).
X
– 1)ν(dz ) =
R i =1
Example 10.1.19 (Gaussian Jumps). A Gaussian Lévy measure on R is a measure of the form
1 –(z – m)2
 
ν(dz ) = λ √ exp dz ,
2πs 2 2s 2
where m is the mean jump size, s 2 is the variance of the jumps, and the intensity of all jumps is ν(R) = λ.
In this case, both (10.3) and (10.4) are satisfied, so we can choose either R = 0 or R = ∞. Suppose we
choose R = 0. Then the third term in (10.8) disappears and the last term becomes
ium– 12 σ 2 u 2
Z  
(eiuz – 1)ν(dz ) = λ e –1 .
R
176 CHAPTER 10. JUMP DIFFUSIONS

Example 10.1.20 (Generalized tempered stable jumps). A Generalized tempered stable Lévy
measure on R is a measure of the form
A– e–β– |z | A+ e–β+ z
ν(dz ) = 1{z <0} + 1 {z >0} ,
|z |1+α– |z |1+α+
where the parameters (A± , α± , β± ) are all positive. In this case (10.3) is satisfied by (10.4) is not. Thus,
we must choose R = ∞. The fourth tern in (10.8) then disappears and the third term becomes
iu α– iuα–
Z  
(eiuz – 1 – iuz )ν(dz ) = A– Γ(–α– )β–α–
1+ –1–
R β– β–
α+
iu iuα+
 
α
+ A+ Γ(–α+ )β++ 1 – –1+ .
β+ β+
6 0 and α± =
assuming α± = 6 1. The above integral can also be computed eixplicitly for the cases α± = 0
and α± = 1 (see, e.g., (Cont and Tankov, 2004, Proposition 4.2)).

Assumption 10.1.21. From this point onward we assume E|ηt |2 < ∞, and thus E|ηt | < ∞. Thus, we
can take R = ∞ and express η as follows
Z
ηt = µt + σWt + z N(t
e , dz ). (10.9)
Rd

Lévy process are semimartingales, which (roughly speaking) form the class of stochastic processes that
can be used to construct an Itô integral. In Section 8.2 we defined Itô integral of an (Ft )t ≥0 -adapted
process Γ = (Γt )t ≥0 with respect to a Brownian motion W = (Wt )t ≥0 by introducing a sequence of
(n)
simple processes Γ(n) = (Γt )t ≥0 that converged to Γ in the followingg sense
n–1 Z T
(n) (n)
|Γt – Γt |2 dt = 0.
X
Γt := Γtj 1{tj ≤t <tj +1 } , 0 ≤ t0 < t1 < . . . < tn = T, lim E
n→∞ 0
j =0

We then by defined the Itô integral as the following limit


Z T Z T n–1
(n) X
Γt dWt := lim Γt dWt = lim Γtj (Wtj +1 – Wtj ).
0 n→∞ 0 n→∞
j =0

We can define the Itô integral of Γ with respect to a Lévy process η in the same manner
Z T n–1
XZ T
(n)
Γt dηt := lim Γ dηt = lim Γtj (ηtj +1 – ηtj ).
0 n→∞ t n→∞
j =0 0

Here, we must add a technical condition that the integrand Γ be a left-continuous process. If a process
Γ is not left-continuous, then we can still define a stochastic integral as follows
Z T
IT = Γt – dηt , Γt – = lim Γs ,
0 s%t
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 177

where (Γt – )t ≥0 is the left-continuous version of (Γt )t ≥0 . Note that, if Γ is left-continuous then Γt = Γt –
for all t ≥ 0. The process (It )t ≥0 will still be a right-continuous process as η is right-continuous. In view
of (10.9), for a left-continuous process Γ, we can seperate an Itô integral into three terms
Z T Z T Z T Z TZ
Γt dηt = Γt µdt + Γt σdWt + Γt z N(dt
e , dz ). (10.10)
0 0 0 0
Expression (10.10) suggests that we consider general processes of the form
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ), (10.11)

where the processes (µt )t ≥0 , (σt )t ≥0 and (γt – (z ))t ≥0 , must be adapted to a filtration F = (Ft )t ≥0
R
obtained by observing the processes W and z N(·, dz ). Note that we have written things in differential
form; to obtain the integral form, simply integrate and add an initial condition. We have been somewhat
vague about the dimension of the various objects appearing in (10.11). In general, for X ∈ Rd , we could
have

µ : R+ × Ω → Rd , σ : R+ × Ω → Rd×n , γ : R+ × Rm × Ω → Rd×k .

where W is an n-dimensional Brownian motion (with W(i ) ⊥⊥ W(j ) for i 6= j ) and N


e is a k -dimensional

Poisson random measure (with N(i ) ⊥⊥ N(j ) for i 6= j ) on Rm . Component-wise, we have


n k Z
(i ) (i ) X (i ,j ) (j ) X (i ,j ) (j )
dXt = µt dt + σt dWt + γt – (z )N
e (dt , dz ), i = 1, 2, . . . , d.
j =1 j =1 Rm

We call any process of the form (10.11) a Lévy-Itô process.

We showed (for the scalar case) in Theorem 8.1.5 that if σ satisfies E 0T |σt |2 dt < ∞ then the Itô integral
R

It := 0t σs dWs is a martingale. Likewise, if


R

Z TZ
E |γt (z )|2 ν(dz )dt < ∞,
0 R
then the process M = (Mt )0≤t ≤T , defined by
Z tZ
Mt := γs (z )N(ds,
e dz ),
0 R
is a martingale. It follows that if µ = 0 in (10.11), and σ and γ(z ) satisfy the above square integrability
conditions, then the process X = (Xt )t ≥0 is a martingale.

10.2 The Itô formula for Lévy-Itô processes.


In this section, we suppose X is given by (10.11) we show how to compute df (Xt ). We simply state the
result and then explain heuristically why it is true.
178 CHAPTER 10. JUMP DIFFUSIONS

Theorem 10.2.1 (One-dimensional Itô formula for a Lévy-Itô processes). Suppose X =


(Xt )t ≥0 is a one-dimensional Lévy-Itô process of the form
Z
dXt = µt dt + σt dWt + γt – (z )N(dt
e , dz ). (10.12)
R

where the drift µ volatility σ and γ are maps

µ : R+ × Ω → R, σ : R+ × Ω → R+ , γ : R+ × Ω × R → R,

and N is a one-dimensional Poisson random measure on R, which is independent of a one-


dimensional Brownian motion W. Let f : R → R satisfy f ∈ C2 (R). Then
 
df (Xt ) = µt f 0 (Xt ) + 12 σt2 f 00 (Xt ) dt + σt f 0 (Xt )dWt
Z  
+ f (Xt – + γt (z )) – f (Xt – ) N(dt
e , dz )
R
Z  
+ f (Xt – + γt (z )) – f (Xt – ) – γt (z )f 0 (Xt ) ν(dz )dt . (10.13)
R

We are not going to prove Theorem 10.2.1. However, we will attempt to understand why the formula is
correct. Suppose for simplicity that ν(R) < ∞. In this case we have a finite activity Lévy process and
we can separate the compensated Poisson random measure N
e into two parts N(dt , dz ) and ν(dz )dt . In

this case, we can write dXt as


Z Z
dXt = µt dt + σt dWt + γt (z )N(dt , dz ) – γt (z )ν(dz )dt
 Z R Z R
= µt – γt (z )ν(dz ) dt + σt dWt + γt (z )N(dt , dz )
R Z R
= µt0 dt + σt dWt + γt (z )N(dt , dz ),
Z R
µt0 := µt – γt (z )ν(dz ),
R

where we can now identify the drift of X as µt0 . Similarly, we can write df (Xt ) as
 
df (Xt ) = µt f 0 (Xt ) + 12 σt2 f 00 (Xt ) dt + σt f 0 (Xt )dWt
Z  
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz )
R 
Z
– f (Xt – + γt (z )) – f (Xt – ) ν(dz )dt
R
Z  
+ f (Xt – + γt (z )) – f (Xt – ) – γt (z )f 0 (Xt ) ν(dz )dt
R !
 Z 
= µt – γt (z )ν(dz ) 0 1 2 00
f (Xt ) + 2 σt f (Xt ) dt + σt f 0 (Xt )dWt
R
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 179
Z  
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz )
 R 
= µt0 f 0 (Xt ) + 12 σt2 f 00 (Xt ) dt + σt f 0 (Xt )dWt
Z  
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz ). (10.14)
R

Things are looking a bit more familiar now. The non-integral terms in (10.14) arise from the µt0 dt +σt dWt
part of dXt . To understand the integral term in (10.14), suppose there is a jump of size y at time t .
Since ν(R) < ∞, there can only be a single jump at time t and thus

N(dt , dz ) = δy (z )dz .

It follows that
Z Z
∆Xt := Xt – Xt – = γt (z )N(dt , dz ) = γt (z )δy (z )dz = γt (y),
R R

and finally, that

∆f (Xt ) = f (Xt ) – f (Xt – ) = f (Xt – + γt (y)) – f (Xt – )


Z  
= f (Xt – + γt (z )) – f (Xt – ) δy (z )dz
R
Z  
= f (Xt – + γt (z )) – f (Xt – ) N(dt , dz ).
R

This last expression agrees with the integral term in (10.14). Of course, if there is no jump at time t
then N(dt , dz ) = 0 for all dz and in this case ∆Xt = ∆f (Xt ) = 0.

Example 10.2.2. Consider the following Geometric Lévy process


Z  
dXt = bXt dt + aXt dWt + ec(z ) – 1 Xt – N(dt
e , dz ). (10.15)
R

Note, if X0 > 0 then this process remains strictly positive since


 
Xt – + ec(z ) – 1 Xt – = ec(z ) Xt – > 0.

Comparing (10.15) with (10.12) we identify


 
µt = bXt , σt = aXt , γt (z ) = ec(z ) – 1 Xt – .

We see an explicit expression for Xt . To this end, we define


1 1
f (Xt ) = log Xt , f 0 (Xt ) = , f 00 (Xt ) = – 2 ,
Xt Xt
180 CHAPTER 10. JUMP DIFFUSIONS

Plugging all of the above into (10.13) we obtain

df (Xt ) = d log Xt
1 1 1
 
= bXt – 12 a 2 X2t 2 dt + aXt dWt
Xt Xt Xt
Z      
+ log Xt – + ec(z ) – 1 Xt – – log (Xt – ) N(dt
e , dz )
R
Z 
1
      
+ log Xt – + ec(z ) – 1 Xt – – log (Xt – ) – ec(z ) – 1 Xt ν(dz )dt
R Xt
  Z Z  
= b– 1a2 dt + adWt + c(z )N(dt
e , dz ) + c(z ) – ec(z ) + 1 ν(dz )dt
2 R R
Z
= b 0 dt + adWt + c(z )N(dt
e , dz ),
Z R
b 0 = b – 12 a 2 –
 
e )
c(z – 1 – c(z ) ν(dz ).
R

Integrating from 0 to t , we obtain


Z
log Xt – log X0 = b 0 t + aWt + c(z )N(t
e , dz ).
R

Finally, adding log X0 to both sides and exponentiating, we have


 Z 
Xt = X0 exp b 0 t + aWt + c(z )N(t
e , dz ) .
R

Theorem 10.2.3 (Lévy-Itô Isometry). Let X = (Xt )t ≥0 satisfy


Z
dXt = σt dWt + γt (z )N(dt
e , dz ), X0 = 0,
R

and assume that σ = (σt )t ≥0 and γ(z ) = (γt (z ))t ≥0 are F-adapted. Then
Z T Z 
EX2T =E σt2 + 2
γt (z )ν(dz ) dt ,
0 R

assuming the right-hand side is finite.

Proof. One could prove Theorem 10.2.3 directly from the definition of the Itô integral. However, we
will use the Itô formula instead. Setting f (x ) = x 2 and noting that

f 0 (Xt ) = 2Xt , f 00 (Xt ) = 2, f (Xt – + γt (z )) – f (Xt – ) = 2Xt – γt (z ) + γt2 (z ),

we obtain from (10.13) that

df (Xt ) = 12 σt2 · 2dt + σt 2Xt dWt


Z  
+ 2
2Xt – γt (z ) + γt (z ) N(dt
e , dz )
R
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 181
Z  
+ 2Xt – γt (z ) + γt2 (z ) – γt (z ) · 2Xt – ν(dz )dt
 R Z  Z
= σt2 + γt2 (z )ν(dz ) dt + (· · · )dWt + (· · · )N(dt
e , dz ),
R R

Integrating from 0 to T and taking an expectation, we obtain


Z T Z  Z T Z TZ
EX2T = E σt2 + γt2 (z )ν(dz ) dt + E (· · · )dWt + E (· · · )N(dt
e , dz )
0 R 0 0 R
Z T Z 
=E σt2 + 2
γt (z )ν(dz ) dt ,
0 R

where we have used E 0T (· · · )dWt = 0 and E 0T R (· · · )N(dt


R R R
e , dz ) = 0.

We now provide the multi-dimensional version of the Itô formula.

Theorem 10.2.4. Let X = (Xt )t ≥0 be a d-dimensional Lévy-Itô process with components satisfying
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ),
Rm

where W is an n-dimensional Brownian motion, N(dt


e , dz ) = N(dt , dz ) – ν(dz )dt is k -dimensional
compensated Poisson random measure on Rm and

µ : R+ × Ω → Rd , σ : R+ × Ω → Rd×n , γ : R+ × Ω × Rm → Rd×k .

Component-wise, we can express the dynamics of X as


n k Z
(i ) (i ) X (i ,j ) (j ) X (i ,j ) (j )
dXt = µt dt + σt dWt + m
γt (z )N
e (dt , dz ), i = 1, 2, . . . , d,
j =1 j =1 R

Let f : Rd → R satisfy f ∈ C2 (Rd ). Then

d
X d X d  d X n
(i ) 1
X
T
X (i ,j ) (j )
df (Xt ) = µt ∂xi f (Xt ) + 2 (σt σt )i ,j ∂xi ∂xj f (Xt ) dt + σt ∂xi f (Xt )dWt
i =1 i =1 j =1 i =1 j =1
k Z  
( · ,j ) e (j ) (dt , dz )+
X
+ m
f (X t – + γt (z )) – f (X t – ) N
j =1 R
k Z  
( · ,j ) ( · ,j )
(z ) · ∇f (Xt – ) ν (j ) (dz )dt .
X
+ m
f (Xt – + γt (z )) – f (Xt – ) – γt
j =1 R

I would not recommend memorizing the above formula. Simply book-mark this page. If you work with
Lévy-Itô processes for a sufficient amount of time (about 5 years for the author of these notes), you will
eventually have the above formula fixed in your mind.
182 CHAPTER 10. JUMP DIFFUSIONS

Theorem 10.2.5 (Quadratic covariation). Let X = (Xt )t ≥0 and Y = (Yt )t ≥0 be given by


n k Z
X (j ) (j ) X (j ) (j )
dXt = µt dt + σt dWt + γt (z )N
e (dt , dz ), (10.16)
j =1 j =1 Rm
n k Z
X (j ) (j ) X (j ) (j )
dYt = bt dt + at dWt + m
gt (z )N
e (dt , dz ).
j =1 j =1 R

Here, X and Y are both one-dimensional processes, W is a n-dimensional Brownian motion, N


e is

an k -dimensional compensated Poisson random measure on Rm and

µt , bt ∈ R, σt , at ∈ Rn , γt (·), gt (·) : Rm → Rk .

Recall the Definition 7.3.4 of quadratic covariation. The quadratic covaration of X and Y up to
time T, denoted [X, Y]T , is given by
n
Z T X k Z
Z T X
(j ) (j ) (j ) (j )
[X, Y]T = σt at dt + γt (z ) · gt (z )N(j ) (dt , dz )
0 j =1 0 j =1 Rm
Z T Xn k Z 
(j ) (j ) (j ) (j )
(z )ν (j ) (dz )
X
= σ t at + m
γt (z ) · g t dt
0 j =1 j =1 R
Z T Xk Z
(j ) (j ) (j )
+ γt (z ) · gt (z )N
e (dt , dz ).
0 j =1 Rm

We will not prove Theorem 10.2.5. Rather, we simply note that the above result relies on the following
facts

[W(i ) , N(j ) (·, A)]t = 0, [N(i ) (·, A), N(j ) (·, B)]t = δi ,j N(i ) (t , A ∩ B), [N(i ) (·, B), Id]t = 0,

in addition to the previously derived facts

[W(i ) , W(j ) ]t = δi ,j t , [W(i ) , Id]t = 0, [Id, Id]]t = 0.

Heurisitcally, one can derive Theorem 10.2.5 by writing d[X, Y]t = dXt dYt and using the rules
(i ) (j ) (i ) (j )
dWt dWt = δi ,j dt , N
e (dt , dz )N
e (dt , dy) = δi ,j δ(z – y)N(i ) (dt , dz )dy,
(i ) (i ) e (j )
dWt dt = 0, dWt N (dt , dz ) = 0,
(i )
dt dt = 0, N
e (dt , dz ) dt = 0.

Example 10.2.6. Suppose X is given by (10.16). Then, from Theorem 10.2.5, the quadratic variation of
X, denoted [X, X]T , is given by
n
Z T X k Z
Z T X
(j ) (j )
[X, X]T = (σt )2 dt + (γt (z ))2 N(j ) (dt , dz )
0 j =1 0 j =1 Rm
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 183

n
Z T X k Z 
(j ) (j )
(σt )2 (γt (z ))2 ν (j ) (dz )
X
= + m
dt
0 j =1 j =1 R
Z T Xk Z
(j ) (j )
+ (γt (z ))2 N
e (dt , dz ).
0 j =1 Rm

Example 10.2.7. Some texts define quadratic covaration of two semimartingales X and Y as the unique
process [X, Y]T that satisfies
Z T Z T
XT YT = X0 Y0 + Xt – dYt + Yt – dXt + [X, Y]T .
0 0

Let us check that this definition yields the same result as Theorem 10.2.5. For simplicity, let us take
Z
dXt = µt dt + σt dWt + m
γt (z )N(dt
e , dz ),
ZR
dYt = bt dt + at dWt + gt (z )N(dt
e , dz ),
Rm

where W is a one-dimensional Brownian motion and N


e is a one-dimensional compensated Poisson random

measure on Rm . Let f (Xt , Yt ) = Xt Yt and observe that

∂x f (Xt , Yt ) = Yt , ∂y f (Xt , Yt ) = Xt , ∂x ∂y f (Xt , Yt ) = 1, ∂x2 f (Xt , Yt ) = 0, ∂y2 f (Xt , Yt ) = 0,

and also that

f (Xt – + γt (z ), Yt – + gt (z )) – f (Xt – , Yt – ) = Xt – gt (z ) + Yt – γt (z ) + γt (z )gt (z ).

Plugging all of this into the multidimensional Itô formula yields

d(Xt Yt ) = Yt (µt dt + σt dWt ) + Xt (bt dt + at dWt )


Z  
+ Xt – gt (z ) + Yt – γt (z ) + γt (z )gt (z ) N(dt
e , dz )
Rm  
Z
+ Xt – gt (z ) + Yt – γt (z ) + γt (z )gt (z ) – Xt – gt (z ) – Yt – γt (z ) ν(dz )dt
Rm   
Z Z
= Yt – µt dt + σt dWt + γt (z )N(dt
e , dz ) + Xt – bt dt + at dWt + gt (z )N(dt
e , dz )
Z Rm Z Rm
+ γt (z )gt (z )N(dt
e , dz ) + γt (z )gt (z )ν(dz )dt
Rm Rm
= Yt – dXt + Xt – dYt + d[X, Y]t .

Thus, we have verified the new definition of quadratic covariation agrees (at least, for the processes in
this example) with Theorem 10.2.5.
184 CHAPTER 10. JUMP DIFFUSIONS

10.3 Lévy-Itô SDE


We now introduce a Lévy-Itô Stochastic Differential equation (SDE), which is an equation of the form
Z
dXt = µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ), X0 = x ∈ Rd , (10.17)
Rm

where W is a n-dimensional Brownian motion, N(dt


e , dz ) is a k -dimensional compensated Poisson random
measure on Rm , and

µ : R+ × Rd → Rd , σ : R+ × Rd → Rd×n , γ : R+ × Rd × Rm → Rd×k .

Dor i = 1, 2, . . . , d, the dynamics of the i th component of X are then given by


n k Z
(i ) (j ) (j )
= µ(i ) (t , Xt )dt + σ (i ,j ) (t , Xt )dWt γ (i ,j ) (t , Xt – , z )N
X X
dXt + e (dt , dz ),
j =1 j =1 Rm

(i )
with X0 = xi ∈ Rd .

Definition 10.3.1. A solution (specifically, a strong solution) of (10.17) is a functional F of the


Brownian motion W and compensated Poisson random measure N
e

Xt = F(Ws , N(s,
e dz ); 0 ≤ s ≤ t ),

such that
Z T Z T Z TZ
XT = x + µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ), ∀ T ≥ 0.
0 0 0 Rm

Theorem 10.3.2. Consider the one-dimensional process X driven by a single Brownian motion
and a single Poisson random measure on R. If there exist constants C1 and C2 such that the linear
growth condition
Z
µ2 (t , x ) + σ 2 (t , x ) + γ 2 (t , x , z )ν(dz ) < C1 (1 + x 2 ), ∀ t ≥ 0,
R

and the Lipschitz continuiety condition

|µ(t , x ) – µ(t , x )|2 + |σ(t , x ) – σ(t , y)|2


Z
+ |γ(t , x , z ) – γ(t , y, z )|2 ν(dz ) < C2 |x – y|2 , ∀ t ≥ 0, x , y ∈ R,
R

then SDE (10.17) has a unique strong solution adapted to Ft := σ(Xs : 0 ≤ s ≤ t ) (i.e., the solution
X of SDE (10.17) is a Markov process). The conditions for higher dimensions are analogous.
10.3. LÉVY-ITÔ SDE 185

Since X, the solution (10.17) is a Markov we can define a corresponding semigroup P(t , s) and infinitesimal
generator A(t ). For a sufficiently nice function ϕ : Rd → R, we have

P(t , s)ϕ(x ) := E[ϕ(Xs )|Xt = x ], 0 ≤ t ≤ s < ∞,


1
 
A(t )ϕ(x ) := lim P(t , s)ϕ(x ) – ϕ(x ) .
s&t s – t

Theorem 10.3.3. Suppose X is given by SDE (10.17). Then


d d X
d
A(t ) = µi (t , x )∂xi + 12 (σσ T )i ,j (t , x )∂xi ∂xj
X X

i =1 i =1 j =1
k Z  
ν (j ) (dz ) θγ ( · ,j ) (t ,x ,z ) – 1 – γ ( · ,j ) (t , x , z ) · ∇ ,
X
+ m
j =1 R

where θγ is the shift operator: θγ f (x ) := f (x + γ) and C20 (Rd ) ∈ dom(A(t )).

Proof. The proof is the same as in the no-jump case. First, write ϕ(Xs ) = ϕ(x ) + ts dϕ(Xu ). Next,
R

use the Itô formula to write dϕ(Xu ) explicitly as terms involving du, dWu and N(du,
e dz ). Finally, take
e are martingales, and send s & t . We omit
an expectation, note that integrals with respect to W and N
the details.

We can now write the Itô formula for the solution X of Lévy-Itô SDE (10.17) in a more compact form
Z  
dϕ(Xt ) = A(t )ϕ(Xt )dt + ∇ϕ · σ(t , Xt )dWt + ϕ(Xt – + γ(t , Xt – , z )) – ϕ(Xt – ) N(dt
e , dz ).
Rd

Theorem 10.3.4. Let X = (Xt )t ≥0 satisfy


Z
dXt = µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ).
Rd

Then the function u(t , x ) := E[ϕ(XT )|Xt = x ] satisfies the Kolmogorov Backward Equation

(∂t + A(t ))u(t , ·) = 0, u(T, ·) = ϕ.

Proof. The proof is exactly analogous to the diffusion case. First, show that E[ϕ(XT )|Ft ] = u(t , Xt ) is
a martingale. Take the differential of u(t , Xt ) and set the dt -term equal to zero to obtain the partial
integro-differential equation that must be satisfied by u. The terminal condition is obtained from
E[ϕ(XT )|FT ] = ϕ(XT ) = u(T, XT ).

Example 10.3.5. Suppose X = (Xt )t ≥0 is the following one-dimensional Lévy process


Z
dXt = µdt + σdWt + z N(dt
e , dz ),
R
186 CHAPTER 10. JUMP DIFFUSIONS

where W is a one-dimensional Brownian motion and N


e is a one-dimensional compensated Poisson random

measure on R. We wish to find an expression for u(t , x ) := E[ϕ(XT )|Xt = x ], which satisfies the KBE;
in this case
Z
(∂t + A)u(t , ·) = 0, u(T, ·) = ϕ, A = µ∂x + 1 2 2
2 σ ∂x +
R
ν(dz )(θz – 1 – z ∂x ).

It will be useful to define the Fourier and inverse Fourier transofrms


Z
Fourier Tranform : F[ϕ](ξ) = ϕ(ξ)
b := e–iξx ϕ(x )dx ,
R
1 Z iξx
Inverse Tranform : F–1 [ϕ](x
b ) := e ϕ(ξ)dξ.
b
2π R
Now, note that
Z Z
F[∂t u(t , ·)](ξ) = e–iξx ∂t u(t , x )dx = ∂t e–iξx u(t , x )dx = ∂t u(t
b , ξ)
ZR Z R
F[Au(t , ·)](ξ) = e–iξx Au(t , x )dx = u(t , x )A∗ e–iξx dx
ZR R
= u(t , x )ψ(ξ)e–iξx dx = ψ(ξ)u(t
b , ξ).
R

where A∗ is the L2 (R, dx ) adjoint of A and ψ(ξ) is the characteristic exponent of X. Specifically
Z
A∗ = –µ∂x + 1 σ2∂ 2
2 x + ν(dz )(θ–z – 1 + z ∂x ),
Z R
ψ(ξ) = µiξ – 12 σ 2 ξ 2 + ν(dz )(eiξz – 1 – iξz ),
R

(check that you can derive A∗ and ψ(ξ) for yourself!). Taking the Fourier transform of the KBE and
using the above results, we obtain an ODE in t for u(t
b , ξ)

(∂t + ψ(ξ))u(t
b , ξ) = 0, u(T,
b ξ) = ϕ(ξ).
b

The solution of this ODE is

b , ξ) = e(T–t )ψ(ξ) ϕ(ξ).


u(t b

Finally, we obtain u as the inverse Fourier transform of ub

1 Z iξx +(T–t )ψ(ξ)


u(t , x ) = F–1 [u(t
b , ·)](x ) = e ϕ(ξ).
b
2π R

where, we have assumed, for simplicity, that ϕ = F–1 [ϕ].


b This is usually satisfied if ϕ is continuous and
in L1 (R, dx ).
10.4. CHANGE OF MEASURE 187

10.4 Change of measure


Assumption 10.4.1. Throughout Section 10.4, W is a one-dimensional Brownian motion and N
e is a

one-dimensional compensated Poisson random measure on R. All processes are scalar, unless specifically
stated otherwise.

Review of the Girsanov change of measure for an Itô process


Fix a probability space (Ω, F, P) and let F = (Ft )t ≥0 be a filtration on this space. In Section 1.7 we
showed that if Z is a random variable that satisfies

Z ≥ 0, EZ = 1,

Then we can define a new probability measure P


e using

P(A)
e = EZ1A , ∀ A ∈ F,

and we call Z = ddP


P the Radon-Nikodym derivative of P
e with respect to P. Moreover, if Z > 0, then
e

probabilities under P can be obtained from probabilities under P


e using

e 11 ,
P(A) = E ∀ A ∈ F,
Z A
and thus, we identify Z1 = dP
e as the Radon-Nikodym derivative of P with respect to P.
e
dP
In Chapter 8, we defined a Radon-Nikodým derivative process (Definition 8.5.1) by fixing a time horizon
T and setting

Zt := E[Z|Ft ], 0 ≤ t ≤ T.

We showed (Lemma 8.5.4) that


1
E[Y|F
e
s] = E[Zt Y|Fs ], Y ∈ Ft , 0 ≤ s ≤ t ≤ T.
Zs
In particular if a process Y = (Yt )0≤t ≤T is adapted to (Ft )0≤t ≤T and (Yt Zt )0≤t ≤T is a martingale
under P then the process (Yt )0≤t ≤T is a martingale under P since
1 1
E[Y
e
t |Fs ] = E[Zt Yt |Fs ] = Zs Y s = Y s . (10.18)
Zs Zs
As a specific example of this machinery, we showed (Girsanov’s Theorem 8.5.5) that if we defined
Z t Z t
dP
e  
Z= = ZT , Zt = exp – 1 Θ2 ds – Θs dWs ,
s
dP 0 2 0
188 CHAPTER 10. JUMP DIFFUSIONS

then, under P,
e the process

Z t
W
f
t = Wt + Θs ds, 0 ≤ t ≤ T,
0

is a martingale (in fact, a Brownian motion). We are now going to apply this machinary to a Lévy-Itô
process.

Girsanov change of measure for Lévy-Itô processes


As changes of measure are more complicated for Lévy-Itô processes than they are for continuous processes,
we present Girsanov’s Theorem in three parts.

Theorem 10.4.2 (Girsanov Theorem for Lévy-Itô process (part I)). On a probability space
(Ω, F, P), let F = (Ft )0≤t ≤T be a filtration and let X = (Xt )0≤t ≤T be an Lévy-Itô process of the form
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ), (10.19)
R

where W is a Brownian motion and N


e is a compensated Poisson random meaure with associated

Lévy measure ν. Suppose there exist F-adapted processes β = (βt )0≤t ≤T and η(z ) = (ηt (z ))0≤t ≤T
such that
Z
µt = σ t βt – γt (z )(eηt (z ) – 1)ν(dz ). (10.20)
R

Define a process (Zt )0≤t ≤T by


Z t Z t Z tZ 
Zt := exp αs ds – βs dWs + ηt (z )N(dt
e , dz ) , (10.21)
0
Z 0 0 R
 
αt = – 21 βt2 – eηt (z ) – 1 – ηt (z ) ν(dz ),
R

and assume that (β, η) are such that EZT < ∞. Set Z = ddP
P = ZT . Then X is a local martingale
b

under P.
b

Proof. First, we observe that (Zt )0≤t ≤T is a martingale. To see this, simply note that
 Z   
dZt = Zt – – βs dWs + eηt (z ) – 1 N(dt
e , dz ) .
R

Thus, we have

1 = Z0 = EZT = EZ, Zt = E[ZT |Ft ] = E[Z|Ft ].


10.4. CHANGE OF MEASURE 189

Since, EZ = 1 and, by construction, Z > 0, the random variable Z defines a Radon-Nikodym derivative
dP
dP . Also, since Zt = E[Z|F] we see that (Zt )0≤t ≤T is a Radon-Nikodym derivative process. Now, note
b

that (Xt )0≤t ≤T is adapted to the filtration F. In light of (10.18), to show that (Xt )0≤t ≤T is a martingale
under P
b we need only to show that (X Z )
t t 0≤t ≤T is a martingale under P. To this end we compute

d(Xt Zt ) = Xt dZt + Zt dXt + d[X, Z]t


 Z 
= Xt dZt + Zt – µt dt + σt dWt + γt (z )N(dt
e , dz )
 R 
Z   Z  
+ Zt – – σt βt + γt (z ) eηt (z ) – 1 ν(dz ) dt + Zt – γt (z ) eηt (z ) – 1 N(dt
e , dz )
 R  R
Z
= Xt dZt + Zt – σt dWt + γt (z )N(dt
e , dz )
Z R
 
+ Zt – γt (z ) eηt (z ) – 1 N(dt
e , dz ).
R

Since integrals with respect to dZt , dWt and N(dt


e , dz ) are martingales under P, we see that (Xt Zt )t ≥0
is a martingale under P. And thus, (Xt )0≤t ≤T is a martingale under P.
b

Theorem 10.4.3 (Girsanov Theorem for Lévy-Itô process (part II)). On a probability space
(Ω, F, P), let Z = ddP
P and (Zt )t ≥0 be as defined in Theorem 10.4.2. Define
b

dW
c := β dt + dW ,
t t t (10.22)
 
N(dt
b , dz ) = N(dt
e , dz ) – eηt (z ) – 1 ν(dz )dt = N(dt , dz ) – eηt (z ) ν(dz )dt . (10.23)

Then W
c is a Brownian motion under P
b and N
b is a compensated Poisson random measure under

P
b in the sense that the process M = (M )
t 0≤t ≤T , defined by
Z tZ
Mt := γt (z )N(dt
b , dz ),
0 R

is a martingale for all Ft -adapted processes (γt (z ))0≤t ≤T satisfying


Z TZ
γt2 (z )eηt (z ) ν(dz ) < ∞.
0 R

Proof. For any ε ∈ [0, 1] we define

Xεt := εW
c +M .
t t

Writing all terms explicitly, we have


Z
dXεt = εdW
c +
t γt (z )N(dt
b , dz )
R Z
= µεt dt + εdWt + γt (z )N(dt
e , dz ),
R
190 CHAPTER 10. JUMP DIFFUSIONS
Z  
µεt = εβt – γt (z ) eηt (z ) – 1 ν(dz ).
R

By Theorem 10.4.2, we see that, for all ε ∈ [0, 1] the process Xε is a martingale under P.
b In particular

X0 = M is a martingale under P,
b as claimed. Next, note that

X1t – Mt = W
c ,
t

b (since X1 and M are martingales under P).


is a martingale under P b Moreover, W
c is continuous and

[W,
c W]
c = t . Thus by Lévy’s characterization of Brownian motion (Theorem 8.3.7), we conclude that W
t
c

is a Brownian motion under P.


b

Theorem 10.4.4 (Girsanov Theorem for Lévy-Itô process (part III)). On a probability space
(Ω, F, P), equipped with a filtration F = (Ft )0≤t ≤T , let X and Z = ddP
P be as in equations (10.19) and
b

(10.21), respectively. Then X has the representation


Z
dXt = σt dW
c
t + γt (z )N(dt
b , dz ), (10.24)
R

where W
c and N
b are defined in (10.22) and (10.23) and X is a martingale under P.
b

Proof. To show that X has the representation (10.24) we simply note that
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz )
R Z
     
= µt dt + σt dW
c – β dt +
t t γt (z ) N(dt
b , dz ) + eηt (z ) – 1 ν(dz )dt
Z R
= σt dW
c +
t γt (z )N(dt
b , dz )
R

where, in the second line, we used (10.22) and (10.23), and in the last line we used (10.20). Since W
c is

a Brownian motion under P


b and N
b is a compensated Poisson measure under P
b with associated Lévy

measure eηt (z ) ν(dz ), it follows that X is a martingale under P.


b

Example 10.4.5. Consider a pure-jump Lévy process X = (Xt )0≤t ≤T


Z
dXt = µdt + z N(dt
e , dz ), N(dt
e , dz ) = N(dt , dz ) – ν(dz )dt .
R

We wish to find a change of measure Z = ddP


P such that, under P
b jumps of size dz arrive with intensity
e

νb(dz ). From Theorem 10.4.2, equation (10.21), and the time-homogeneity of X, we see that Z must be of
the form
Z t Z tZ  Z  
Zt := exp αds + η(z )N(ds,
e dz ) , α=– eη(z ) – 1 – η(z ) ν(dz ). (10.25)
0 0 R R
10.5. HAWKES PROCESSES 191

With Z given by (10.25) we have that


 
N(dt
b , dz ) = N(dt
e , dz ) – eηt (z ) – 1 ν(dz )dt = N(dt , dz ) – eηt (z ) ν(dz )dt .

is a compensated Poisson random measure. We want

eηt (z ) ν(dz ) = νb(dz ).

Thus, we identify
νb(dz )
η(z ) = log .
ν(dz )
νb(dz )
Note that, in order for η to be well-defined we must have 0 < ν(dz ) < ∞, which holds if an only if νb ∼ ν
(i.e., the measures are equivalent). Re-writing the dynamics of X as follows
Z    
dXt = µdt + z N(dt
b , dz ) + eηt (z ) – 1 ν(dz )dt
R ! !
Z
νb(dz )
= µdt + z N(dt
b , dz ) + – 1 ν(dz )dt
R ν(dz )
Z  
= µdt + z N(dt
b , dz ) + (νb(dz ) – ν(dz )) dt
ZR
= µdt
b + z N(dt
b , dz )
Z R
µb = µ + z (νb(dz ) – ν(dz )) ,
R

we see that X will be a martingale under P


b if and only if µ
b = 0. Suppose, for example, that ν = λF,
where λ = ν(R) is the intensity of all jumps (assumed finite) and F(dz ) = ν(dz )/ν(R) is the jump
distribution. If we wish to change only the intensity of jumps (say, from λ to λ)
b and not the distribution

of jumps F, then, if we want X to be a martingale under P


b we must have

Z –1
λ
b–λ =µ· z F(dz ) .
R

Note that, had we included a Brownian component in the P dynamics of X, we would have complete
freedom to change the jump measure since any non-zero drift could be absorbed into the drift of the
Brownian motion under the change of measure.

10.5 Hawkes processes


Recall that if P = (Pt )t ≥0 is a Poisson process with intensity λ, then we have

E[dPt |Ft ] = EdPt = λdt .


192 CHAPTER 10. JUMP DIFFUSIONS

Note that the probability of a jump P(dPt = 1) = EdPt in the instant dt is entirely independent of this
history Ft of P. This memorylessness property is convenient from a computational point of view and
is necessary in order for P to be Markov. But, from a modeling perspective, limiting our analysis to
Poisson processes (or more generally, Poisson random measures) is very restrictive. For example, if we
want to model the arrival of earthquakes in Seattle, it is well-known that the occurrence of an earthquake
today will increase the probability of an earthquake tomorrow. Thus, it would be useful to have some
way of modeling events whose arrival intensities are history-dependent. This is precisely what a Hawkes
process will allow us to do.

Definition 10.5.1. A Hawkes process N = (Nt )t ≥0 is a simple counting process satisfying


 Z t– 
E[dNt |Ft ] = λt dt , λt := Λ h(t – s)dNs , (10.26)
0

where h(·) : R+ → R+ is the kernel (sometimes also called an exciting function) and Λ(·) : R+ → R+
is locally integrable. The intensity λt is an Ft -measurable random variable where Ft is the natural
filtration for N. When the function Λ(·) is linear, the process Nt is known as the linear Hawkes process,
otherwise, it is called a nonlinear Hawkes process.

If we denote by τ1 , τ2 , . . . the (random) times that N jumps, then the intensity at time t is given by
 X 
λt = Λ h(t – τi ) .
τi <t

Typically, h is a strictly decreasing function and Λ is increasing. Under these assumptions, the intensity
λt is decreasing over the intervals of the form [τi , τi +1 ).

For reason of analytic tractability, in what follows, we concentrate on the linear Hawkes processes.
Specifically, let us assume

Λ(z ) = ν + z ,

where ν > 0 is a strictly positive constant. Thus, from (10.34), we have


Z t–
λt = ν + h(t – s)dNs (10.27)
0
Z t– Z t–
=ν+ h(t – s)dN
e +
s h(t – s)λs ds, (10.28)
0 0

where we have introduced the compensated Hawkes process, whose differential is given by

dN
e = dN – λ ds.
s s s
10.5. HAWKES PROCESSES 193

As, by construction, the expected infinitesimal change of N


e is zero

E[dN
e |F ] = E[dN |F ] – λ dt = 0,
t t t t t

it follows that the compensated Hawkes process is a martingale


Z r
E[N
e
t |Fs ] = E[E(dN
e |F )|F ] + N
u u s
e =N
s
e ,
s 0 ≤ s ≤ t < ∞.
s
Thus, integrals with respect to the compensated Hawkes process are martingales as well.

Suppose the intensity λt of a Hawkes process approaches a stationary distribution as t → ∞. This


distribution must have a mean, which we shall call λ̄. If we start the process from time –∞, which is
equivalent to looking at the behavior of the process as t → ∞, then we have Eλt = λ̄ for all t > –∞ and
thus, taking an expectation of (10.28), we find
Z t– Z t–
Eλt = ν + h(t – s)EdN
e
s + h(t – s)Eλs ds
–∞ –∞
Z t–
=ν+ h(t – s)Eλs ds,
–∞
Z ∞
λ̄ = ν + λ̄ h(s)ds.
0

Hence, we find that the mean λ̄ of the stationary distribution of the intensity is given by
 Z ∞ –1
λ̄ = ν · 1 – h(s)ds .
0
Clearly, a stationary distribution only exists if h satisfies
Z ∞
h(s)ds < 1. (10.29)
0
Thus, some texts require that h satisfy (10.29)

In general, a given Hawkes process N = (Nt )t ≥0 and the associated intensity process λ = (λt )t ≥0 are
non-Markovian. However, when the kernel h(·) has an exponential form

h(t ) = ae–bt , a, b > 0, (10.30)

then the intensity process λ by itself and the pair (λ, N) are Markov. To see this, observe from (10.27)
and (10.30) that
Z t–
λt = ν + ae–b(t –s) dNt .
0
Taking the differential of λt we find
 Z t– 
dλt = –b ae–b(t –s) dN t dt + adNt
0
194 CHAPTER 10. JUMP DIFFUSIONS

= –b(λt – ν)dt + adNt ,


= –b(λt – ν)dt + adN
e + aλ dt
t t
 
= (a – b)λt + bν dt + adN
e .
t

Since the intensity of Nt is λt , we see that the future dynamics of λ depend only on the value of λt and
not the entire history Ft . Thus, the process λ is Markov and its generator is
 
A = (a – b)λ + bν ∂λ + λ (θa – 1 – a∂λ )

= b(ν – λ)∂λ + λ (θa – 1) ,

where, we remind the reader that θa is the shift operator: θa f (λ) = f (λ + a). Since λ alone is Markov,
it follows that the pair (λ, N) is Markov and has generator
 
A = b(ν – λ)∂λ + λ θ(a,1) – 1 , (10.31)

where θ(a,1) f (λ, n) = f (λ + a, n + 1).

We cannot find the probability mass function of NT . However, we can find its Laplace transform Ee–ηNT .
Let us define

u(t , λ, n) = E[e–ηNT |λt = λ, Nt = n].

The function u satisfies the Kolmogorov backward equation

0 = (∂t + A)u, u(T, λ, n) = e–ηn . (10.32)

where A is given by (10.31). We seek a solution to (10.32) in exponential affine form

u(t , λ, n) = e–ηn+A(t )+B(t )λ . (10.33)

Inserting (10.33) into (10.32), we obtain


 
0 = u(t , λ, n) (A0 (t ) + λB0 (t )) + b(ν – λ)B(t ) + λ(e–η+aB(t ) – 1) , e–ηn+A(T)+B(T)λ = e–ηn .

This must hold for all λ. Thus we collect terms of like order in λ and obtain a pair of coupled ODEs for
A and B

O(λ0 ) : 0 = A0 (t ) + bνB(t ), A(T) = 0,


O(λ1 ) : 0 = B0 (t ) – bB(t ) + (e–η+aB(t ) – 1), B(T) = 0.

These ODEs must be solved numerically.

We finish this Section by defining a d-dimensional Hawkes process.


10.6. EXERCISES 195

(1) (2) (d)


Definition 10.5.2. A d-dimensional Hawkes process N = (Nt , Nt , . . . , Nt )t ≥0 is a simple counting
process satisfying

d Z t–
X 
(i ) (i ) (i ) (j )
E[dNt |Ft ] = λt dt , λt := Λ(i ) h (i ,j ) (t – s)dNs , (10.34)
j =1 0

(i )
where h (i ,j ) (·) : R+ → R+ and Λ(i ) (·) : R+ → R+ is locally integrable. The intensity λt is an
Ft -measurable random variable where Ft is the natural filtration for the d-dimenional process (N(i ) )di=1 .
When the function Λ(i ) (·) is linear for all i , the process Nt is known as the linear d-dimensional Hawkes
process, otherwise, it is called a nonlinear d-dimensional Hawkes process.

As in the one-dimensional case, linear d-dimensional Hawkes processes with exponential kernels are
Markov and admit many analytically tractable results.

10.6 Exercises
Exercise 10.1. Let P = (Pt )t ≥0 be a Poisson process with intensity λ.
(a) What is the Lévy measure ν of P?
(b) Let dXt = dPt . Define u(t , x ) := E[ϕ(XT )|Xt = x ]. Find u(t , x ) and verify that it solves the
Kolmogorov Backward equation.

Exercise 10.2. Let X = (Xt )t ≥0 and Y = (Yt )t ≥0 be Lévy processes defined by


Z Z
Xt = bt + σWt + z N(t
e , dz ), Yt = γt + z M(t , dz ),
R R+

where b ∈ R, σ, γ ∈ R+ , N
e is a compensated Poisson random measure with Lévy measure ν and M is a

Poisson random measure with Lévy measure µ. Assume that N and M are independent and ν(R) < ∞
and µ(R+ ) < ∞. As the process Y is strictly increasing, we can constrict a process Z = (Zt )t ≥0 as follows

Zt := XYt .

(a) Show that Z is a Lévy process. For this problem, you may assume that Z is continuous in probability.
(b) Find ψZ where EeiξZt = et ψZ (ξ) . You may leave your answer as a composition of functions if you
find it helpful.
(c) Suppose ν ≡ 0 (i.e., the X process experiences no jumps). Find the drift A, volatility Σ and Lévy
measure Π(dz ) corresponding to Z. Write the Lévy-Itô decomposition of ZT in terms of a Brownian
motion B and a Poisson randome measure P whose Lévy measure is Π.
196 CHAPTER 10. JUMP DIFFUSIONS

Exercise 10.3. Let X = (Xt )t ≥0 and Y = (Yt )t ≥0 be processes defined by


Z
dXt = µt Xt dt + σt Xt dWt + (eγt (z ) – 1)Xt – N(dt
e , dz ),
ZR
dYt = bt Yt dt + at Yt dWt + (egt (z ) – 1)Yt – N(dt
e , dz ),
R

where W is a one-dimensional Brownian motion, N


e is a one-dimensional compensated Poisson random

measure on R, and µ, b, σ, a, γ, g and F-adapted stochastic processes.


(a) Define Zt := Xt /Yt . Compute the differential dZt . Your answer should not involve Xt or Yt .
(b) Find µt so that Z is a martingale.

Exercise 10.4. Let η = (ηt )t ≥0 be a one-dimensional Lévy process and define X = (Xt )t ≥0 by

dXt = κ(θ – Xt )dt + dηt

(a) Find Xt explicitly as a functional of η.


R
(b) Assume ηt = σWt + R z N(t
e , dz ). Compute m(t ) := EX and c(t , s) := E(X – m(t ))(X – m(s)).
t t s

Exercise 10.5. Let X be the following one-dimensional jump-diffusion


Z
dXt = µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ),
R

where W is a one-dimesnaional Brownian motion and N


e is a one-dimensional compensated Poisson

random measure on R. Derive using the Lévy-Itô formula the infinitesimal generator A(t ) of the X
process

E[ϕ(Xs )|Xt = x ] – ϕ(x )


A(t )ϕ(x ) := lim .
s&t s–t
Chapter 11

Stochastic Control

Notes from this chapter are taken primarily from (Björk, 2004, Chapter 19) and (van Handel, 2007,
Chapter 6).

11.1 Problem formulation


Fix a probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 and consider the following controlled
diffusion X = (Xt )t ≥0 , whose dynamics are given by

dXut = µ(t , Xt , ut )dt + σ(t , Xt , ut )dWt , X0 = x ∈ Rd , (11.1)

where W is an n-dimensional Brownian motion with respect to (P, F) and

µ : R+ × Rd × U → Rd , σ := R+ × Rd × U → Rd×n

Here, Xu is called the state process and u = (ut )t ≥0 is the control. The control must live in some
control set U. Frequently, we take the control set U = Rk , but this is not required. The superscript
u on Xu indicates the dependence of state process Xu on the control u (clearly, if you change u you
change Xu ).

Definition 11.1.1. We call a control u is admissible if the following holds:

1. the control is F-adapted, that is, ut ∈ Ft for all t ≥ 0,


2. the control remains in the control set: ut (ω) ∈ U for all (t , ω),
3. the process Xu has a unique (strong) solution.

We denote by U the set of admissible controls.

197
198 CHAPTER 11. STOCHASTIC CONTROL

Within the class of admissible controls is a subset of controls that we will find very useful:

Definition 11.1.2. We call an admissible control u a feedback or Markov control if it is of the form:
ut = α(t , Xt ) for some function α : R+ × Rd → U.

Note Markov controls are a strict subset of admissible controls, since a Markov control at time t is only
allowed to depend on (t , Xt ) whereas, in general, an admissible control at time t may depend on the
entire history Ft . Nevertheless, Markov controls are important because they are easy to implement, and
(as we shall see) they can actually be computed! Moreover, it often turns out that the optimal Markov
control coincides with the optimal admissible control.

If we consider a Markov control ut = α(t , Xt ) then Xu satisfies an SDE

dXut = µ(t , Xt , α(t , Xt ))dt + σ(t , Xt , α(t , Xt ))dWt , X0 = x ∈ Rd .

Since the solution of an SDE is a Markov process, a Markov control yields a Markov state process Xu .

Below, we introduce some cost or gain functionals J(·) that assign to each admissible control strategy
u a cost or gain J(u). The goal of optimal control is to find an optimal strategy u ∗ (obviuosly!), which
minimizes or maximizes this cost functional. Three common types of cost functionals are:

1. Finite time. Here, the cost functional is given by


Z T 
J(u) = E F(t , Xut , ut )dt + ϕ(XuT ) , (11.2)
0

where F : [0, T] × Rd × U → R and ϕ : Rd → R.


2. Indefinite time. Here, the cost functional is given by
 Z τu 
J(u) = E F(Xut , ut )dt + ϕ(Xuτ u ) , τ u = inf{t ≥ 0 : Xut ∈
/ A},
0

where A ⊂ Rd is some open set, F : A × U → R and ϕ : ∂A → R.


3. Infinite time. In this case, the cost functional is given by
Z ∞
J(u) = E e–λt F(Xut , ut )dt ,
0

where F : Rd × U → R and λ > 0.

Note, for the Indefinite time and Infinite time functionls, the dynamics of Xu should be time-homogeneous
(i.e., the coefficient functions µ and σ should not depend on t ).

Of course, one might construct other cost functionals, but the above three are by far the most common.
As stated above, the optimal control (if it exists) is the strategy u ∗ that minimizes or maximizes a given
cost or gain functional J(u). Thus, we make the following definition:
11.2. THE DYNAMIC PROGRAMMING PRINCIPLE AND THE HJB PDE 199

Definition 11.1.3. For a given cost functional J : U → R, we say define the optimal control u ∗ , if it
exists, by

u ∗ := argmax J(u), if our goal is to maximize J(u),


u∈U
u ∗ := argmin J(u), if our goal is to minimize J(u).
u∈U

11.2 The Dynamic Programming Principle and the HJB PDE


In this Section we focus on the finite time horizon control problem and we shall attempt to find the
control u ∗ that maximizes J(u) in (11.2). The minimization procedure is completely analogous. We
shall restrict our controls to Markov controls: ut = α(t , Xut ). We shall make the following assumption.

Assumption 11.2.1. We assume that there exists an admissible Markov strategy u ∗ , which is optimal.

Clearly, this assumption is not always justified. Nevertheless, we will go with it for the time being. We
define the reward-to-go function Ju (or cost-to-go in the case of minimization) as
Z T  Z T 
Ju (t , Xut ) F(s, Xus , us )ds u F(s, Xus , us )ds u u
+ ϕ(XT ) Ft

:= E =E + ϕ(XT ) Xt ,

t t

where we have used the Markov property of Xu to replace Ft with the σ-algebra generated by Xut . We
also define the value function V as

V(t , x ) := Ju (t , x ) = max Ju (t , x ). (11.3)
u∈U

The idea of the Dynamic Programming Principle (DPP) is to split the optimization problem into two
intervals [t , t + δ) and [t + δ, T], where δ is small and positive. Note that, for any δ ≥ 0 we have
 Z t +δ Z T  
Ju (t , x ) F(s, Xus , us )ds F(s, Xs , us )ds + ϕ(XT ) Ft +δ Xut
u u
=E +E =x
t t +δ
 Z t +δ 
F(s, Xus , us )ds + Ju (t + δ, Xut+δ ) Xut = x

=E
t

Now, fix an arbitrary control ub and set



 u
bs s ∈ [t , t + δ),
us =  u∗
s s ∈ [t + δ, T].

In words, the control u may be sub-optimal over the interval [t , t + δ) and it is optimal over the interval
[t + δ, T]. Clearly, we have
 Z t +δ 
F(s, Xus , us )ds + δ, Xt +δ ) Xut
u
V(t , x ) ≥ E + V(t =x . (11.4)
t
200 CHAPTER 11. STOCHASTIC CONTROL

The inequality arises from the fact that the strategy u is not necessarily optimal over in interval [t , t + δ).
If we had ub = u ∗ , then we would have obtained an equality since, in this case, we have u = u ∗ . Now
observe that
Z t +δ
V(t + δ, Xut+δ ) = V(t , Xut ) + (∂s + Au (s))V(s, Xus )ds + martingale part,
t

where Au (s) is the infinitesimal generator of Xu . Taking an expectation, we have


 Z t +δ 
E[V(t + δ, Xut+δ )|Xut = x ] = V(t , x ) + E (∂s + Au (s))V(s, Xus )ds Xut = x .

(11.5)
t

Inserting (11.5) into (11.4) and obtain


 Z t +δ   
F(s, Xus , us ) + (∂s + A (s))V(s, Xs ) ds Xut
u u
0≥E =x .
t

Finally, we devide by δ and take a limit as δ → 0. Assuming there is no problem with passing a limit
through an expectation, we have

1 Z t +δ
   
0 ≥ lim E u u u
F(s, Xs , us ) + (∂s + A (s))V(s, Xs ) ds Xt = x

δ→0 δ t
1 Z t +δ
   
F(s, Xus , us ) + (∂s + Au (s))V(s, Xs ) ds Xut = x

= E lim
δ→0 δ t
= F(t , x , u) + (∂t + Au (t ))V(t , x ).

Once again, if u is optimal, we obtain an equality. Thus, we have arrived at the Hamilton-Jacobi-Bellman
(HJB) PDE

0 = ∂t V + max (Fu (t ) + Au (t )V) , V(T, x ) = ϕ(x ), (11.6)


u∈U

where we have defined


d d X
d  
Fu (t ) = F(t , x , u), Au (t ) = µi (t , x , u)∂xi + 21 σσ T
X X
(t , x , u)∂xi ∂xj .
i ,j
i =1 i =1 j =1

We summarize this result with the following Theorem

Theorem 11.2.2. If (i) an optimal control u ∗ exists and is Markov, (ii) the value function V,
defined in (11.3), satisfies V ∈ C1,2 and (iii) the limiting procedures performed above are allowed,
then V satisfies the HJB PDE (11.6) and the optimal control u ∗ is given by

ut∗ = α(t , Xut ), where α(t , x ) := argmax (F(t , x , u) + Au V(t , x )) . (11.7)
u∈U
11.3. SOLVING THE HJB PDE 201

Theorem 11.2.2 is so incredibly unsatisfactory! It only tells us that, if an optimal controls exists and is
Markov, then (modulo some technical conditions) the value function V satisfies the HJB PDE and the
optimal control is given by (11.7). But, this is not what we want. What we would like, is solve the HJB
PDE (11.6) and conclude that the solution actually is the value function and u ∗ , given by (11.7), is the
optimal control. The following Theorem gives us such a result.

Theorem 11.2.3 (Verification Theorem). Suppose H : [0, T] × Rd → R solves the HJB PDE
(11.6). Define

g(t , x ) = argmax (F(t , x , u) + Au H(t , x )) .


u∈U

Suppose that Mt := H(t , Xut ) – 0t (∂s + Au (s))H(s, Xus )ds is a true martingale (not just a local
R

martingale), where ut = g(t , Xut ). Then the value function V and optimal control u ∗ are given by

V(t , x ) = H(t , x ), ut∗ = ut .

The Verification Theorem tells us that, if we solve the HJB PDE, and the solution satisfies some regularity
and integrability conditions, then the the solution is the value function and the Markov control is the
optimal control. We will not prove the verification theorem.

11.3 Solving the HJB PDE


Solving the HJB PDE essentially involves two steps

Step 1 (easy part) : find u ∗ (t , x ) = argmax(F(t , x , u) + Au V(t , x )).


u∈U
∗ ∗
Step 2 (hard part) : solve 0 = (∂t + Au (t ))V + Fu , V(T, ·) = ϕ.

We illustrate this process with a simple, explicitly-solvable example.

Example 11.3.1 (Merton problem). The Merton problem, due to Nobel Prize winner Robert Merton,
is a classical problem in mathematical finance whereby an investor seeks to optimize his expected utility
at a fixed future date T by investing in a stock. Specifically, suppose a stock S = (St )0≤t ≤T follows a
geometric Brownian motion

dSt = µSt dt + σt St dWt .

Let Xu = (Xut )0≤t ≤T be the wealth process of an investor who invests ut dollars in S at time t and keeps
the rest of his money in a bank account (which we assume pays no interest). The dynamics of Xu are
202 CHAPTER 11. STOCHASTIC CONTROL

given by
u
dXut = t dSt = µut dt + σt ut dWt .
St

The investor’s control problem is to maximize


1 1–γ
J(u) = max E U(XuT )], U(x ) = x , γ ∈ (0, 1) ∪ (1, ∞). (11.8)
u 1–γ

The function U is called the investor’s utility function; it is intended to map the investor’s wealth to his
happiness. Since “more money” = “more happy” the function U is strictly increasing. It is also concave
since, an additional dollar means less to somebody with a wealth of 1 million dollars than it does to
somebody with a wealth of 10 dollars. For the investor’s control problem (11.8), we identify

Fu = 0, Au = uµ∂x + 21 u 2 σ 2 ∂x2 ,

In order to maximize (Fu + Au V) we simply differentiate this quantity, set it equal to zero, and solve for
u. We have
–µ∂x V
0 = ∂u (Fu + Au V) |u ∗ = µ∂x V + uσ 2 ∂x2 V|u ∗ ⇒ u∗ = .
σ 2 ∂x2 V

Next, we try to solve


∗ ∗
0 = (∂t + Au )V + Fu
! !2
µ∂x V µ∂x V
= ∂t V + – 2 2 µ∂x V + 12 – 2 2 σ 2 ∂x2 V
σ ∂x V σ ∂x V
1 (∂x V)2 µ
= ∂t V – λ2 2 , λ= , V(T, x ) = U(x ). (11.9)
2 ∂x V σ

where we have introduced the Sharpe ratio λ. Of course, if you have never seen the horrible looking
non-linear PDE before you would never know how to solve it. However, if you have spent some time
around optimal investment problems you would know that the correct thing to do at this point is to
guess that V is of the form

V(t , x ) = f (t )U(x ), f (T) = 1. (11.10)

Such a guess clearly satisfies the terminal condition V(T, x ) = U(x ). If we insert the guess (11.10) into
(11.9) we find
 
0 = f 0 + 12 λ2 1–γ
γ f U,
11.3. SOLVING THE HJB PDE 203

which, together with the terminal condition f (T) = 1 implies that


 
f (t ) = exp 12 λ2 1–γ
γ (T – t ) .

Thus, we have obtained the value function V(t , x ) = U(x )f (t ). The optimal Markov control is given by

µ ∂x V(t , x ) µx
u ∗ (t , x ) = – 2 = .
σ ∂x V(t , x ) γσ 2
ut∗ µ
Thus, the total proportion of wealth the investor should keep in the stock is constant: ∗ = γσ 2.
Xut

Example 11.3.2 (Linear quadartic regulator). Suppose the dynamics of a controlled process
Xu = (Xut )0≤t ≤T are given by

dXut = (aXut + but )dt + cdWt ,

and consider the following cost functional, which we wish to minimize


Z T 
J(u) = E (q(Xut )2 + rut2 )dt + h(XuT )2 . (11.11)
0

One could imagine, for example, that Xu represents the position of a particle that we are attempting to
keep near the origin. We can adjust the drift of the particle through the control u, but this has a cost,
which is quadratic in u. Likewise, there is a quadratic cost for allowing Xu to be away from the origin.
This control problem is known as the Linear Quadratic Regulator (LQR) because the dynamics of X
are linear in the state process Xu and control process u and the costs are quadratic in Xu and u. The
HJB equation associated with cost functional (11.11) is

0 = Vt + sup (Au V + Fu ), V(t , x ) = hx 2 ,


u∈R

where the operator Au and the function Fu are given by

Au = (ax + bu)∂x + 12 c 2 ∂x2 , Fu = qx 2 + ru 2 .

Step one in solving the HJB PDE is finding the optimal control u ∗ in feedback form. To this end, we
have
–bVx

u u
∂u A V + F = bVx + 2ru ∗ , u∗ =

0= ⇒ .
u=u ∗ 2r
The HJB PDE thus becomes
∗ ∗
0 = V t + Au V + F u
204 CHAPTER 11. STOCHASTIC CONTROL

–bVx –bVx 2
    
= Vt + ax + b Vx + 1 c 2V + qx 2 +r
2 xx
2r 2r
b 2 V2x
= Vt + ax Vx + 21 c 2 Vxx + qx 2 – . (11.12)
4r
As with all non-linear PDEs with explicit solutions, the method to solve (11.12) is to guess. In this case,
the correct guess is

V(t , x ) = P(t )x 2 + Q(t ), P(T) = h, Q(T) = 0,

where the terminal conditions for P and Q will ensure that V(T, x ) = hx 2 . We have

Vt = P 0 x 2 + Q 0 , Vx = 2x P, Vxx = 2P. (11.13)

Inserting (11.13) into (11.12) we obtain


b2
0 = (P0 x 2 + Q0 ) + ax (2x P) + 21 c 2 (2P) + qx 2 – (2x P)2
   4r

= x 2 P0 + 2aP + q – (b 2 /r )P2 + Q0 + c 2 P .

Collecting terms of like order in x we obtain

O(x 0 ) : 0 = Q0 + c 2 P, Q(T) = 0,
O(x 2 ) : 0 = P0 + 2aP + q – (b 2 /r )P2 , P(T) = h.

Thus, we have obtained a coupled system of ODEs for (P, Q). The O(x 2 ) equation is a Riccati equation,
which can be solved analytically (though, the solution is a tad messy and not worth writing down here).
Once one has obtained an expression for the function P one can obtain Q as an integral
Z T
Q(t ) = c 2 P(s)ds.
t
Finally, the optimal control is given by
∗ –b ∗ –b ∗
ut∗ = u ∗ (t , Xut ) = Vx (t , Xut ) = P(t )Xut .
2r r

11.4 HJB equations associated with other cost functionals


In this section we supposed the dynamics of a controlled diffusion Xu are time-homogenous

dXut = µ(Xut , ut )dt + σ(Xut , ut )dWt . (11.14)

In what follows, we state without proof the HJB PDEs associated with the indefinite time and infinite
time cost functionals J(u), which were introduced introduced in Section 11.1.
11.5. EXERCISES 205

1. Consider the Indefinite time cost functional


 Z τu 
J(u) = E F(Xut , ut )dt u
+ ϕ(Xτ u ) , τ u = inf{t ≥ 0 : Xut ∈
/ A},
0

where A ⊂ Rd is some open set, F : A × U → R and ϕ : ∂A → R. The associated HJB PDE is

0 = sup (Au V + Fu ), x ∈ A,
u∈U
V(x ) = ϕ(x ), x ∈ ∂A.

where the operator Au is the generator of the process Xu defined in (11.14) with ut = u fixed. In
the one-dimensional case, we have

Au = µ(x , u)∂x + 12 σ 2 (x , u)∂x2 .

2. Consider the Infinite time cost functional


Z ∞
J(u) = E e–λt F(Xut , ut )dt ,
0

where F : Rd × U → R and λ > 0. The associated HJB PDE is

0 = sup (Au V + Fu ) – λV.


u∈U

We leave the formal proof of the above HJB equations as an exercise for the reader.

11.5 Exercises
Exercise 11.1. Derive the HJB equations in Section 11.4.

Exercise 11.2. Optimal investment with consumption.


206 CHAPTER 11. STOCHASTIC CONTROL
Chapter 12

Optimal Stopping

Notes from this chapter closely follow (van Handel, 2007, Chapter 9).

12.1 Strong Markov processes

We have previously defined the notion of a Markov process. For optimal stopping, we will need to
consider a subset of these processes called strong Markov processes.

Definition 12.1.1. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ≥0 be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process X = (Xt )t ≥0 .
We say X is a strong Markov process or simply “X is strong Markov,” if, for any F stopping time τ and
t ≥ 0 we have

E[f (Xτ +t )|Fτ ] = E[f (Xτ +t )|Xτ ].

Proposition 12.1.2. Suppose that on a probability space (Ω, F, P) equipped with a filtration F =
(Ft )t ≥0 the process X is the unique strong solution of the time-homogeneous SDE

dXt = µ(Xt )dt + σ(Xt )dWt , X0 ∈ Rd ,

where W = (Wt )t ≥0 is a Brownian motion in Rn and µ : Rd → Rd and σ : Rd → Rd×n satisfy


(9.37)-(9.38). Then X is a strong Markov process.

207
208 CHAPTER 12. OPTIMAL STOPPING

12.2 Optimal Stopping Problem Formulation


On a probability space (Ω, F, P) equipped with a filtration F = (Ft )t ≥0 we consider a time-homogeneous
diffusion X = (Xt )t ≥0 , whose dynamics are given by

dXt = µ(Xt )dt + σ(Xt )dWt , X0 ∈ Rd . (12.1)

As usual, W = (Wt )t ≥0 is an m-dimensional Brownian motion. Accordingly, we have

µ : Rd → Rd , σ : Rd → Rd×m .

Although we have specified a time-homogenous diffusion X, our framework is general enough to incorporate
time-inhomogenous diffusions by considering the augmented the R1+d -dimensional process Y = (Ys )s≥0
given by Yt = (t , Xt ).

For and F-stopping time τ , we introduce the cost functional (again, some authors prefer the phrase
gain functional if the goal is to maximize)
Z τ 
J(τ ) = E e–λt F(Xt )dt + e–λτ ϕ(Xτ ) , (12.2)
0
where F, ϕ : Rd → R and λ ≥ 0. Expression (12.2) is not the only possible choice for a cost functional.
However, (12.2) is general enough to encompass many optimal stopping problems of pracitcal interest.
Definition 12.2.1. We call a F-stopping time τ admissible if (12.2) is well-defined.
We will denote by T the set of admissible stopping times. The goal of optimal stopping is to find the
admissible stopping time τ ∗ that maximizes or minimizes J(τ ). Thus, we make the following definition:
Definition 12.2.2. For a given cost functional J : T → R (not necessarily (12.2)), we define the optimal
stopping time τ ∗ , if it exists, by

τ ∗ := argmax J(τ ), if our goal is to maximize J(τ ),


τ ∈T
τ ∗ := argmin J(τ ), if our goal is to minimize J(τ ).
τ ∈T
In principle, an optimal stopping time τ ∗ may be arbitrarily complicated functional of the path of X.
However, there is a class of simple stopping times that play the same role in optimal stopping as Markov
controls played in optimal controls.
Definition 12.2.3. We call an F-stopping to τ a first exit time if τ can be expressed as

τ = inf{t ≥ 0 : Xt ∈
/ D}, (12.3)

where D ⊂ Rd is an open set.


The set D in (12.3) is call the continuation region for the stopping time τ . Conveniently, optimal
stopping rules of the form (12.3) are Markov.
12.3. VARIATIONAL INEQUALITY 209

12.3 Variational inequality


In this section we will attempt to find optimal stopping time τ ∗ that maximizes the cost functional J(τ )
given in (12.2) (the minimization problem is completely analagous). We will restrict our attention to
first exit time strategies. To this end, we make the following assumption:

Assumption 12.3.1. We assume there exists an optimal stopping time τ ∗ which is a first exit time. We
denote by D the continuation region of τ ∗ . We also assume that X is a strong Markov process.

A priori, there is no reason that the optimal stopping time τ ∗ must be an exit time. But, we will go with
it for the time being. To begin, we define the reward-to-go functional (or, cost-to-go, if the goal is to
minimize)
Z τ 
Jτ (x ) e–λt F(Xt )dt

:= E + ϕ(Xτ ) X0 =x .
0
Observe, Jτ (x ) is the cost of the stopping rule τ when X0 = x . We define the value function as the
optimal cost-to-go

V(x ) := Jτ (x ) = max Jτ (u).
τ ∈T
We will attempt to find an equation for V(x ). To this end, let τ be any admissible stopping time and
define

τ 0 := inf{t ≥ τ : Xt ∈
/ D},

where D is the continuation region of τ ∗ . Observe that τ 0 ≥ τ by construction and τ 0 = τ ∗ after time τ .
Thus, we have

V(X0 ) = Jτ (X0 )
0
≥ Jτ (X0 )
 Z τ0 
0
e–λt F(Xt )dt –λτ
=E +e ϕ(Xτ 0 ) X0
0
Z τ  Z τ0  
0
e–λt F(Xt )dt –λt –λτ
=E +E e F(Xt )dt + e ϕ(Xτ 0 ) Xτ X0
0 τ
Z τ  Z τ0  
0 –τ )
e–λt F(Xt )dt –λτ –λ(t –τ ) –λ(τ
=E +e E e F(Xt )dt + e ϕ(Xτ 0 ) Xτ X0
 Z0τ τ 
e–λt F(Xt )dt + e–λτ V(Xτ ) X0 .

=E (12.4)
0
Here, we are using the fact that X is a strong Markov process. Now, supposing that V is sufficiently
smooth to apply Itô’s Lemma, we have
Z τ
e–λτ V(Xτ ) = V(X0 ) + e–λt (A – λ)V(Xt )dt + local martingale, (12.5)
0
210 CHAPTER 12. OPTIMAL STOPPING

where A is the infinitesimal generator of the X process. If the local martingale above is a true martingale,
then inserting (12.5) into (12.4) we obtain
Z τ   
e–λt (with equality if τ 0 = τ ∗ ).

0≥E F(Xt ) + (A – λ)V(Xt ) dt X0 , (12.6)
0

We will consider the following special case:

Case: τ ≤ τ ∗ . In this case, we have by construction that τ 0 = τ ∗ and in inequality (12.6) becomes an
equality
Z τ   
e–λt F(Xt ) + (A – λ)V(Xt ) dt X0 .

0=E
0

As the above must hold for any τ ≤ τ ∗ we must have

0 = F(X0 ) + (A – λ)V(X0 ), provided that τ ∗ > 0.

If τ ∗ = 0 then we clearly have V(x ) = ϕ(x ). Noting that τ ∗ > 0 if and only if X0 = x ∈ D, we obtain

0 = F(x ) + (A – λ)V(x ), x ∈ D, V(x ) = ϕ(x ), x ∈


/ D. (12.7)

Note that, at this point, we do not know D. Without knowing D we cannot solve for V using (12.7).
Now, we consider the general case.

Case: τ is arbitrary. In this case, it is possible that τ 0 =


6 τ ∗ , and the inequality (12.6) must hold for any
τ . Hence, we must have

0 ≥ F(X0 ) + (A – λ)V(X0 ), provided that τ > 0.

In the case of τ = 0 we have J(τ ∗ ) ≥ J(0) and hence V(x ) ≥ ϕ(x ). Putting everything together we obtain

0 = max{F(x ) + (A – λ)V(x ), ϕ(x ) – V(x )}. (12.8)

The PDE (12.8) is called a variational inequality. Note that if V is the solution to (12.8) we can
reconstruct the continuation region D as follows

D = {x ∈ Rd : V(x ) > ϕ(x )}. (12.9)

Note that the formal derivation of the variational inequality (12.8) assumes that the optimal stopping
rule τ ∗ is a first exit time. What we would like is to simply find V(x ) by solving (12.8) and then conclude
that the optimal stopping rule τ ∗ is a first exit time with continuation region D given by (12.9). This is
the subject of the following theorem.
12.4. EXAMPLE: OPTIMAL STOPPING OF RESOURCE EXTRACTION 211

Theorem 12.3.2 (Verification). Fix probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 and let
X = (Xt )t ≥0 be the d-dimensional time-homogenous diffusion given by (12.1). Let J(τ ) be the cost
functional defined in (12.2). Suppose there is a function V : Rd → R that is sufficiently smooth to
apply Itô’s Lemma and satisfies the variational inequality (12.8). Let D be given by (12.9). Denote
by T be the class of admissible stopping times (see Definition 12.2.1) that satisfy
d X
m Z τ
(j )
e–λs ∂xi V(Xt ) · σ (ij ) (Xt )dWt
X
E = 0. (12.10)
i =1 j =1 0

Define τ ∗ := inf{t ≥ 0 : Xt ∈
/ D} and suppose τ ∗ ∈ T. Then τ ∗ is optimal: J(τ ∗ ) ≥ J(τ ) for all τ ∈ T.

Moreover, the optimal expected cost-to-go satisfies Jτ (x ) = V(x ).

We will not prove Theorem 12.3.2. Rather, we will simply comment on the theorem’s usefulness. What
Theorem 12.3.2 tells us is that, if we find a function V that satisfies the variational inequality (12.8) and
define τ ∗ = inf{t ≥ 0 : Xt ∈
/ D} where D is given by (12.9), then if V is sufficiently smooth and satisfies
(12.10) with τ = τ ∗ , then we can conclude that τ ∗ is the optimal stopping rule.

Remark 12.3.3 (Principle of smooth fit). There will often be more than one solution to the
variational inequality (12.8). However, only one of these solutions will be sufficiently smooth to apply
Itô’s Lemma, namely, the solution for which both V and its gradient ∇V are continuous on the boundary
∂D. Thus, when looking for solutions of the variational inequality, we often impose that the solution V
of (12.8) and its gradient ∇V be continuous on the boundary ∂D. We call this requirement the principle
of smooth fit.

12.4 Example: Optimal stopping of resource extraction


Consider a company that extracts a resource (e.g., metal, oil, gas) from the earth. Denote by R = (Rt )t ≥0
the total amount of resource lefft to be extracted. We suppose the dynamics of R are given by

dRt = –γRt dt .

Suppose the market price P = (Pt )t ≥0 of the resource follows a geometric Brownian motion

dPt = bPt dt + σPt dWt ,

where W = (Wt )t ≥0 is a Brownian motion. The company incurs a fixed cost c per unit time to extract
R from the earth. Thus, if the company operates from time t = 0 to time t = τ it generates a profit of
Z τ
Profit = (γRt Pt – c)dt .
0
212 CHAPTER 12. OPTIMAL STOPPING

Let us define X = (Xt )t ≥0 by Xt = Rt Pt . One easily deduces that the dynamics of X are given by

dXt = (b – γ)Xt dt + aXt dWt .

Suppose the company wishes to choose a stopping rule τ that maximizes its expected profit. The cost
functional J(τ ) is given by
Z τ
J(τ ) = E (γXt – c)dt . (12.11)
0

Comparing (12.11) with (12.2) we identify

F(x ) = γx – c, ϕ(x ) = 0, λ = 0.

The infinitesimal generator A of the process X is given by

A = (b – λ)x ∂x + 21 a 2 x 2 ∂x2 .

From (12.8), we see that our candidate value function V is a solution to

0 = max{γx – c + (b – λ)x V 0 (x ) + 21 a 2 x 2 V 00 (x ), –V(x )}.

Thus, we must have

0 = γx – c + (b – γ)x V 0 (x ) + 21 a 2 x 2 V 00 (x ), 0 > –V(x ), x ∈ D, (12.12)


0 ≥ γx – c + (b – γ)x V 0 (x ) + 12 a 2 x 2 V 00 (x ), 0 = –V(x ), x ∈ Dc . (12.13)

Let us focus first on (12.13). We have

x ∈ Dc , ⇒ V(x ) = 0,
⇒ γx – c + (b – γ)x V 0 (x ) + 12 a 2 x 2 V 00 (x ) = γx – c ≤ 0,
⇒ x ≤ c/γ,
⇒ D ⊆ (c/γ, ∞). (12.14)

Now, let us try to solve for V on D. We guess a solution of the form


c log x γx
V(x ) = 2 – – d =: Vd (x ), x ∈D
b – γ – a /2 b – γ

where d ∈ R is a constant to be determined. One can easily verify by direct substitution that

0 = x – c + (b – γ)x Vd0 (x ) + 21 a 2 x 2 Vd00 (x ), ∀ x > 0.


12.5. EXERCISES 213

From (12.12) we must have

x ∈D ⇒ Vd (x ) > 0, ⇒ b < γ.

Intuitively, it makes sense that we must have b < γ. If we had b ≥ γ then the growth of X would be
non-negative, and it would make sense to extract resources for ever (i.e., τ ∗ = ∞). We will assume that
b < γ, which will lead to an optimal stopping rule τ ∗ which is finite.

The continuation region D must be of the form

D = (x ∗ , ∞), x ∗ ≥ c/γ,

where the restriction x ∗ ≥ c/γ comes from (12.14). In order to find x ∗ and d we shall use the principle
of smooth fit (see Remark 12.3.3). Namely, we shall require that V and V 0 be continuous at the boundary
of D. We have

lim V(x ∗ – ε) = lim V(x ∗ + ε), lim V 0 (x ∗ – ε) = lim V 0 (x ∗ + ε),


ε→0+ ε→0+ ε→0+ ε→0+
⇒ 0 = Vd (x ∗ ), 0= Vd0 (x ∗ ). (12.15)

We can use the two equations in (12.15) to solve for d and x ∗ . We find

c b–γ c log x ∗ γx ∗
x∗ = , d= – =: d ∗ .
γ b – γ – a 2 /2 2
b – γ – a /2 b – γ

Thus, we have obtained a formal solution for value function V and optimal stopping rule τ ∗

Vd ∗ (x )
 x > x ∗,
V(x ) =  τ ∗ = inf{t ≥ 0 : Xt ≤ x ∗ }.
0 x ≤ x ∗,

We leave it as an exercise for the reader to check that the value function V satisfies the conditions of
Theorem 12.3.2 (and therefore, V is the value function and τ ∗ is the optimal stopping rule).

12.5 Exercises
To Do.
214 CHAPTER 12. OPTIMAL STOPPING
Chapter 13

Monte Carlo Methods

Notes from this chapter are derived from Glasserman (2013).

13.1 Overview of Monte Carlo methods


Let X be a random variable defined on a probability space (Ω, F, P). Suppose we wish to compute the
expectation µ := EX but that we do not have an explicit expression for the distribution FX or the
characteristic function φX of X. Then we cannot compute X using
Z
µ= FX (dx ) x , or 0 (0).
µ = –iφX
R

Suppose, however that we have the ability to generate a sequence of iid random variables (Xi )i ∈N where
Xi ∼ FX . Let us define

1 X n
µb n := X.
n i =1 i

Note that µb n is a random variable. By the SLLN we have

µb n → µ, n → ∞.

Thus, it seems reasonable to take µb n as an estimate for µ. A natural question to ask, then, is: for a
given n, how good is our estimate µb n of µ? To answer this question, let us define σ 2 := VX. We
compute
n 1 Xn 1 1
VXi = 2 nVX = σ 2 .
X
Vµn = V(Xi /n) = 2
i =1
n i =1 n n

215
216 CHAPTER 13. MONTE CARLO METHODS
√ √
So the variance of µn scales like 1/n and the standard deviation Vµn scales like 1/ n. From the CLT,
we know that µb n is asymptotically normal
D
µb n ≈ Zn ∼ N(µ, σ 2 /n).

Using this, we can construct the c% confidence interval (c-C.I.) for µ


Z z
σ σ 1
 
c 2
c-C.I. = µb n – zc √ , µb n + zc √ , dx √ e–x /2 = c.
n n –zc 2π
Thus, if we want to decrease the size of the c-C.I. for µ by a factor of n, then we need to multiply the
number of samples we take of X by a factor of n 2 .

One quick side note: without an expression for FX or φX we cannot know σ 2 . However, we can estimate
σ 2 with the sample variance
1 X n
σbn2 := (X – µb n )2 .
n – 1 i =1 i

When σ 2 is unknown and n is large, we can construct the c-C.I. by replacing σ with σbn .

An alternative method of estimating µ can be described as follows. Suppose we can generate a sequence
of iid random variables (Yi ) and we know that EYi = µ. Then the following random variable
1 X n
νbn := Y ,
n i =1 i

is an alternative estimator of µ. In particular, νbn is unbiased (as Eνbn = µ) and satisfies (by the SLLN)
νbn → µ. Moreover, if VYi < VX, then it will follow that Vνbn < Vµb n . Choosing (Yi ) cleverly so that
Vνbn < Vµb n is called variance reduction.

The methods described above for estimating a mean µ of a random variable X by generating a sequence
of iid random variables – either (Xi ) or (Yi ) – is the Monte Carlo method in a nutshell. More generally,
we may be interested in computing, not just the expectation of a random variable, but the expectation
EF[X] of a functional F of an entire stochastic process X = (Xt )t ≥0 . For example, we may wish to
compute the time-weighted average of X, given by
1ZT
 
EF[X] := E Xt dt .
T 0
(i )
To do this, we will need to know how to generate iid random process X(i ) = (Xt )t ≥0 , i = 1, 2, . . ..
Note that a collection of processes (X(i ) ) is iid if the processes are independent and have the same
finite-dimensional distributions

X(i ) ⊥⊥ X(j ) , i 6= j ,
13.2. GENERATING RANDOM VARIABLES 217

(i ) (i ) (i ) D (j ) (j ) (j )
(Xt1 , Xt2 , . . . , Xtn ) = (Xt1 , Xt2 , . . . , Xtn ), ∀ 0 ≤ t1 ≤ t2 ≤ . . . ≤ tn < ∞.

With the above in mind, the focus of this chapter will be an answering the follow two questions:

1. How can we generate iid random variables? Or, more generally, how can we generate iid random
processes?

2. How can we “intelligently sample” (i.e., construct estimators of µ) in order to reduce the variance
of our estimates?

The first question will occupy Sections 13.2 through 13.6. Variance reduction methods will be the focus
on Section 13.7.

13.2 Generating Random variables


Although programs such as Mathematica and Matlab can generate random variables with a variety of
distributions, it is worth knowing how this is accomplished.

Assumption 13.2.1. Throughout Section 13.2 both U and the sequence (Ui ) will represent iid random
variables that are uniformly distributed Ui ∼ U(0, 1).

Nearly every computer can generate uniformly distributed random variables.

13.2.1 Inverse Transform Method


When the distribution function FX of a random variable X is known in closed-form, we can use the
Inverse Transform Method to generate iid random variables with the same distribution as X. The
following theorem describes how this is accomplished.

Theorem 13.2.2 (Inverse Transform Method). Let FX be the distribution function of a random
variable X defined on (Ω, F, P). Define

F–1
X (u) := inf{x : F(x ) ≥ u}. (13.1)

Then we have

F–1
X (U) ∼ FX . (13.2)
218 CHAPTER 13. MONTE CARLO METHODS

Proof. For simplicity, let us Suppose FX is strictly increasing (this is always the case for continuous
random variables). Noting that FX (x ) ∈ (0, 1) for all x ∈ R and P(U ≤ y) = y for all y ∈ (0, 1) we have

P(F–1
X (U) ≤ x ) = P(U ≤ FX (x )) = FX (x ),

which establishes (13.2).

Example 13.2.3. Let us apply Theorem 13.2.2 to generate an exponential random variable X ∼ E(λ).
For x ∈ (0, ∞) we have
Z x
–1
FX (x ) = dx λe–λx = 1 – e–λx , ⇒ F–1
X (u) = log(1 – u).
0 λ
Using the fact that U ∼ 1 – U, we find that
–1
log U ∼ E(λ).
λ
Example 13.2.4. Suppose X is a discrete random variable with P(X = xi ) = pi for i = 1, 2, . . . , n. Then
n i
F–1
X X X
FX (x ) = pi , ⇒ X (u) = xi 1{Pi –1 <u≤Pi } , Pi = pj . (13.3)
i :xi ≤x i =1 j =1

where P0 := 0. To understand (13.3), it may help to draw a graph of FX and F–1


X.

Theorem 13.2.5. Let FX be the distribution of a random variable X defined on (Ω, F, P). Let
Y := X|a < X ≤ b be a random variable whose distribution FY is the conditional distribution of X
given a < X ≤ b. Define
 
V := FX (a) + FX (b) – FX (a) U.

Then, with F–1


X as defined in (13.1), we have

F–1
X (V) ∼ FY = FX|a<X≤b . (13.4)

Proof. Assume x ∈ (a, b). Then we have

P(F–1
X (V) ≤ x ) = P(V ≤ FX (x )) = P(FX (a) + (FX (b) – FX (a)) U ≤ FX (x ))
!
F (x ) – FX (a)
=P U≤ X
FX (b) – FX (a)
F (x ) – FX (a) P(X ≤ x , a < X ≤ b)
= X =
FX (b) – FX (a) P(a < X ≤ b)
= P(X ≤ x |a < X ≤ b) = FX|a<X≤b (x ),

which proves (13.4).


13.2. GENERATING RANDOM VARIABLES 219

13.2.2 Acceptance-Rejection method


The inverse transform method described in Section 13.2.1 works only when the function F–1
X is known
in closed form. Unfortunately, there are many random variables for which F–1
X is not known in closed-
form. An alternative method of generating random variables with a particular distribution FX is the
Acceptance-Rejection Method, which we describe in the following theorem.

Theorem 13.2.6 (Acceptance-Rejection Method). Suppose X ∈ D ⊂ Rd and Y ∈ D are random


vectors defined on (Ω, F, P) with densities fX : D → R+ and fY : D → R+ , respectively. Suppose
further that there exists a constant c > 0 such that fY (x ) ≤ cfX (x ) for all x ∈ D. Let U ∼ U(0, 1) be
independent of X. Then

D f (X)
Y = X|U ≤ Y . (13.5)
cfX (X)

Proof. Suppose A ⊂ B(D). Then we have


 
  P X ∈ A, U ≤ c1 fY (X)/fX (X)
P X ∈ A|U ≤ c1 fY (X)/fX (X) =   . (13.6)
P U ≤ c1 fY (X)/fX (X)

For the numerator in (13.6), we have


   
P X ∈ A, U ≤ c1 fY (X)/fX (X) = EP X ∈ A, U ≤ c1 fY (X)/fX (X)|X
= E1{X∈A} c1 fY (X)/fX (X)
Z
= dx fX (x ) c1 fY (x )/fX (x )
ZA
= dx c1 fY (x ) = c1 P(Y ∈ A). (13.7)
A

The denominator of (13.6) is obtained from (13.7) by considering A = D, which yeilds


 
P U ≤ c1 fY (X)/fX (X) = c1 P(Y ∈ D) = c1 . (13.8)

Inserting (13.7) and (13.8) into (13.8) yields


 
P X ∈ A|U ≤ c1 fY (X)/fX (X) = P(Y ∈ A),

which proves (13.5).

Using Theorem 13.2.6, if we can generate a random variable X with density fX , then we can generate a
random variable Y with density fY using the following algorithm

1. Generate X.
220 CHAPTER 13. MONTE CARLO METHODS

2. Generate U.
3. If U ≤ c1 fY (X)/fX (X), return X. Otherwise, go back to Step 1.

In order for the acceptance rejection method to work efficiently, we would like P(U ≤ c1 fY (X)/fX (X)) to
be as close to one as possible. Thus, we should choose fX to be similar to fY and we should choose c to
be as small as possible.

13.2.3 Generating Gaussian RVs


In this section, we will show how to generate independent and correlated Gaussian RVs.

Box-Mullar Method

The Box-Mullar Method of generating two independent Gaussian random variables is summarized in
the following theorem.

Theorem 13.2.7 (Box-Muller Method). For i = 1, 2, define


q q
X= –2 log(U1 ) cos(2πU2 ), Y= –2 log(U1 ) sin(2πU2 )

Then X ⊥⊥ Y and X, Y ∼ N(0, 1).


q
Proof. Define R := –2 log(U1 ). Then

P(R ≤ r ) = P(–2 log(U) ≤ r 2 ) = P(log(U) ≥ –r 2 /2)


2 2
= 1 – P(U < e–r /2 ) = 1 – e–r /2 .

Denoting by fR the density of R we have


d 2
fR (r ) = FR (r ) = r e–r , r > 0.
dr
Next, define Φ := 2πU2 . It is obvious that Φ ∼ U(0, 2π) and thus
1
fΦ (φ) = , φ ∈ (0, 2π).

As U1 ⊥⊥ U2 by assumption, it follows that R ⊥⊥ Φ and thus
r –r 2 /2
fR,Φ (r , φ) = fR (r )fΦ (φ) = e .

Next, define g : R+ × (0, 2π) → (0, 1) × (0, 1) by
q
g(r , φ)‘ := (r cos φ, r sin φ), ⇒ g –1 (x , y) = ( x 2 + y 2 , tan–1 (y/x )).
13.2. GENERATING RANDOM VARIABLES 221

Noting that (X, Y) = g(R, Φ), we have


 
1 ∂g1 /∂r ∂g1 /∂φ
fX,Y (x , y) = fR,Φ (g –1 (x , y)) 0 –1 g 0 (r , φ) =  ,
|g (g (x , y))| ∂g2 /∂r ∂g2 /∂φ
1 2 1 2 1 2
= √ e–x /2 √ e–y /2 = fZ (x ) · fZ (y), fZ (z ) := √ e–z /2 ,
2π 2π 2π
where we have omitted a some nontrivial by straightforward algebraic manipulations.

Note that we generate a a normal random variable X ∼ N(µ, σ 2 ) from a standard normal as follows

X := µ + σZ ∼ N(µ, σ), Z ∼ N(0, 1).

Generating correlated Gaussian RVs

We now know how to generate independent Gaussian random variables. In this section, we will show
how to generate correlated Gaussian random variables. The density fX of multivariate Guassian vector
X ∈ Rd is given by
1  
fX (x ) = q exp – 21 (x – µ)T Σ–1 (x – µ) ,
(2π)d |Σ|
where we have defined the mean vector µ and the covariance matrix Σ
   
 EX1   E(X1 – µ1 )(X1 – µ1 ) E(X1 – µ1 )(X2 – µ2 ) . . . E(X1 – µ1 )(Xd – µd ) 
   
 EX2   E(X2 – µ2 )(X1 – µ1 ) E(X2 – µ2 )(X2 – µ2 ) . . . E(X2 – µ2 )(Xd – µd ) 
µ= , Σ = .
   
 ..   .. .. ... ..

 .,  
 . . . 

   
EXd E(Xd – µd )(X1 – µ1 ) E(Xd – µd )(X2 – µ2 ) . . . E(Xd – µd )(Xd – µd )

If I is the d-dimensional identity matrix then, by definition, Z ∼ N(0, I) is a d-dimensional Gaussian


with independent components. Suppose now that A ∈ Rd×d . One can show by direct computation that

X := µ + AZ ∼ N(µ, AAT ), Z ∼ N(0, I).

Thus, if we wish to generate a d-dimensional Gaussian random vector X ∼ (µ, Σ) we need only to find a
matrix A such that AAT = Σ. Such a matrix A can be constructed via the Cholesky Factorization,
which we describe in the following theorem.

Theorem 13.2.8 (Cholesky Decomposition). Suppose Σ is a d × d symmetric positive definite


matrix. Define a d × d lower triangular (i.e. Aij = 0 if j > i ) matrix A by
 1/2
iX
–1
Aii = Σii – A2ik  , i = 1, 2, . . . , d,
k =1
222 CHAPTER 13. MONTE CARLO METHODS

jX–1

1 
Aij = Σij – Aik Ajk  , j = 1, 2, . . . , i .
Ajj k =1

Then AAT = Σ.

We will not provide a proof of Theorem 13.2.8. Instead, we direct the interested reader to (Glasserman,
2013, Chapter 2).

13.3 Simulating Discrete-Time Markov Chains


Let X = (Xi )i ∈N0 be a time-homogeneous discrete-time Markov chain with state space S = {1, 2, . . . , |S|}
(note that |S| may be infinite). Define p : S × S → [0, 1] by

p(i , j ) = P(Xn+1 = j |Xn = i ).

We can simulate a path of X by setting

Xn+1 = Yn+1 , n = 0, 1, 2, . . . ,

where Yn+1 is a discrete random variable with

P(Yn+1 = j |Xn ) = p(Xn , j ), j = 1, 2, . . . , |S|.

Note that Yn+1 can be generated using the Inverse Transform Method, described in Section 13.2.1.

13.4 Simulating Continuous-Time Markov Chains


Let X = (Xt )t ≥0 be a time-homogeneous continuous-time Markov chain with state space S = {1, 2, . . . , |S|}
(note that |S| may be infinite). Denote by G = (g(i , j ))i ,j the generator of X. Let us denote by τn the
time of the nth jump of X with τ0 := 0. Recall that
–g(Xτn , j )
τn+1 – τn |Xτn ∼ E(–g(Xτn , Xτn )), P(Xτn+1 = j |Xτn ) = .
g(Xτn , Xτn )
With this in mind, we can simulate a path of X by setting

τn+1 = τn + Tn+1 Xτn+1 = Yn+1 , n = 0, 1, 2, . . . ,

where Tn+1 is an exponentially distributed random variable and Yn+1 is a discrete random variable with
–g(Xτn , j )
Tn+1 |Xτn ∼ E(–g(Xτn , Xτn )), P(Yn+1 = j |Xτn ) = , j = 1, 2, . . . , |S|.
g(Xτn , Xτn )
13.5. SIMULATING DIFFUSIONS 223

Note that both Tn+1 and Yn+1 can be generated using the Inverse Transform Method, described in
Section 13.2.1. Finally, as X is constant between jumps, we have

Xτj +s = Xτj , s = [0, τj +1 ).

13.5 Simulating Diffusions


In this section, we shall assume that X = (Xt )t ≥0 is the solution of an SDE of the form

dXt = µ(t , Xt )dt + σ(t , Xt )dWt , (13.9)

or, in integral form


Z t Z t
i +1 i +1
dXti +1 = Xti + µ(t , Xt )dt + σ(t , Xt )dWt .
ti ti

In principle, X and W could live in Rd and Rm , respectively. However, for simplicity, we shall assume
that d = m = 1.

Assumption 13.5.1. Throughout Sections 13.5 and 13.6 the random variable Z and the sequence (Zi )
will represent iid random variables that are Normally distributed Z, Zi ∼ N(0, 1).

13.5.1 Non-random coefficients


Let us consider the case when the drift and diffusion coefficients µ and σ of the diffusion X are deterministic
functions of time. In this case we have
Z t Z t
i +1 i +1
Xti +1 = Xti + µ(t )dt + σ(t )dWt . (13.10)
ti ti

From Proposition 8.2.9 we have that the Itô integral is independent of Xti and normally distributed
Z t Z t
i +1 i +1
σ(t )dWt ∼ N(0, v 2 (ti , ti +1 )), v 2 (ti , ti +1 ) := σ 2 (t )dt .
ti ti

Thus, if we wish to simulate the value of X on a grid Π = {t0 = 0, t1 , t2 , . . .}, we can do so using the
following algorithm
Z t
i +1
Xti +1 = Xti + µ(t )dt + v (ti , ti +1 )Zi +1 , (13.11)
ti

where we have used v Zi +1 ∼ N(0, v 2 ). Notice that the algorithm (13.11) used to generate the value of X
on the grid Π is exact, meaning the distributions of (Xt0 , Xt1 , Xt2 , . . .) as generated using (13.10) and
(13.11) are the same.
224 CHAPTER 13. MONTE CARLO METHODS

Example 13.5.2. Suppose S = (St )t ≥0 is a geometric Brownian motion with time-dependent coefficients

dSt = b(t )St dt + a(t )St dWt .

Note that the drift and diffusion coefficients of S are random. Nevertheless, we can simulate S exactly as
follows. First define X = log S. Then, by Itô’s Lemma, we have
 
dXt = b(t ) – 21 a 2 (t ) dt + a(t )dWt .

As the drift and diffusion coefficients of X are deterministic functions of time, we can simulate
(Xt0 , Xt1 , Xt2 , . . .) exactly. Then we can set (St0 , St1 , St2 , . . .) = (eXt0 , eXt1 , eXt2 , . . .).

13.5.2 Ornstein-Uhlenbeck and other Gaussian Processes


Suppose that X = (Xt )t ≥0 is an OU process

dXt = κ(θ – Xt )dt + adWt .

As we showed in Example 8.3.6, we can solve the SDE for X explicitly


Z t
i +1 –κ(t
Xti +1 = θ + e–κ(ti +1 –ti ) (Xti – θ) + e i +1 –t ) adW .
t
ti

Fom Proposition 8.2.9 we have that the Itô integral is independent of Xti and normally distributed
Z t Z t
i +1 i +1
e–κ(ti +1 –t ) adWt ∼ N(0, v 2 (ti , ti +1 )), v 2 (ti , ti +1 ) = e–2κ(ti +1 –t ) a 2 dt
ti ti
a2
 
= 1– e–2κ(ti +1 –ti ) . (13.12)

Thus, if we wish to simulate the value of X on a grid Π = {t0 = 0, t1 , t2 , . . .}, we can do so using the
following algorithm

Xti +1 = θ + e–κ(ti +1 –ti ) (Xti – θ) + v (ti , ti +1 )Zi +1 ,

where v is given by (13.12). Note that this simulation is exact. More generally, one can simulate exactly,
any process of the form
 
dXt = f (t ) + g(t )Xt dt + h(t )dWt ,

where (f , g, h) are deterministic functions of time. We leave it as an exercise for students to show that
Xti +1 |Xti is normally distributed.
13.5. SIMULATING DIFFUSIONS 225

13.5.3 Euler Discretization of an SDE


Unfortunately, most SDEs cannot be simulated exactly. In such cases, we must resort to approximation
schemes. The work-horse of approximations schemes for SDEs is the Euler Discretization Scheme,
which we now describe. To begin, we fix a time horizon T > 0 and a time grid Π as follows

Π = {t0 , t1 , . . . , tn }, ti = i δ, δ = T/n.

For δ  1 is seems reasonable to approximate the process X in (13.9) on the grid Π by a process X
b given

by
Z t Z t
i +1 i +1
X
b
ti +1 = Xti +
b µ(ti , X
b )dt +
ti σ(ti , X
b )dW
ti t
ti ti
 
=X
b + µ(t , X
ti i
b ) (t
ti i +1 – ti ) + σ(ti , Xti ) Wti +1 – Wti .
b (13.13)

b converges (in L2 (Ω, F, P)


From the definition of the Itô integral, as n → ∞ we know that the process X
and therefore, in probability) to X. Moreover, for a fixed n, we can simulate the process X
b exactly using


X
b
ti +1 = Xti + µ(ti , Xti )δ + σ(ti , Xti ) δZi +1 ,
b b b

 
where we have used ti +1 – ti = δ and Wti +1 – Wti ∼ N(0, δ). The question naturally arises then: how
b converge to X as δ → 0? The give a quantifiable response to this question, we need to
fast does X
define some reasonable measures of convergence.

Definition 13.5.3. Let X and X


b be as given in (13.9) and (13.13), respectively. We say that X
b converges

strongly to X with order β > 0 if there exists a constant C such that

β
E|X nδ – XT | ≤ Cδ , as δ → 0. (13.14)
b

We say that X
b converges weakly to X with order β > 0 if there exists a constant C such that

b ) – Ef (X )| ≤ Cδ β , 2β+1
|Ef (X nδ T as δ → 0 ∀ f ∈ CP , (13.15)

where CkP is the set of functions whose derivatives up to order k are polynomially bounded.

Although we will not prove it, the Euler discretization X


b converges strongly to X with order β = 1/2 if

the drift and diffusion coefficients of X satisfy the linear growth and Lipchitz conditions required for
existence and uniqueness of a strong solution of an SDE (see Theorem 9.1.4) and additionally satisfy

|µ(t , x ) – µ(s, x )| + |σ(t , x ) – σ(s, x )| ≤ C(1 + |x |) t – s, ∀ 0 ≤ s < t < ∞,
226 CHAPTER 13. MONTE CARLO METHODS

2β+1
for some C. Under the additional conditions that µ, σ ∈ CP , one can show that Xb converges weakly

to X with order β = 1. For detailed statements concerning weak and strong convergence of X
b to X we

refer the reader to Kloeden and Platen (1992).

Statements concerning order of convergence give us information about how quickly the approximation X
b

approaches X. However, as the constants C in (13.14) and (13.15) are typically unknown, an order of
convergence statement does not provide much guidance for choosing the size of δ to obtain a desired level
of accuracy. From a practical standpoint, one typically must resort to trial-and-error in order to find
an appropriate δ. For example, if one wishes to estimate Ef (XT ) one could simulate m sample paths
(1) b (2) b (m) )
(X
b ,X ,...,X with δ fixed and then compute

1 Xm
b (i ) ).
f (X (13.16)
m i =1 nδ

One could then repeat this procedure with smaller and smaller δ’s until the value of (13.16) does not
change “too much,” where “too much” depends on the desired level of accuracy.

13.6 Simulating Jump-Diffusions


In this Section, we suppose that X = (Xt )t ≥0 is the solution of the following SDE
Z
dXt = µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt , dz ), (13.17)
R

where N is a finite-activity Poisson random measure

EN(dt , dz ) = ν(dz )dt = λF(dz )dt , ν(R) = λ.

We could have alternatively written the dynamics of X using a compensated Poisson random measure
N(dt
e , dz ) = N(dt , dz ) – ν(dz )dt as follows
 Z  Z
dXt = µ(t , Xt ) + γ(t , Xt – , z )ν(dz ) dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ).
R R

However, for the purposes of simulating sample paths of X, it is preferable to work with the expression
(13.17). We will present two Euler-like discretizations of (13.17)

13.6.1 Euler Discretization on a fixed time grid


As in Section 13.5.3, let us begin by fixing a fixing a time horizon T > 0 and a time grad

Π = {t0 , t1 , . . . , tn }, ti = i δ, δ = T/n.
13.6. SIMULATING JUMP-DIFFUSIONS 227

We can approximate X on Π by a process X


b given by
Z t Z t Z t Z
i +1 i +1 i +1
X
b
ti +1 = Xti + µ(ti , X
b )dt
ti + σ(ti , X
b )dW
ti t + γ(ti , X
b , z )N(dt , dz )
ti
ti ti ti R
= Xti + µ(ti , X
b )(t
ti i +1 – ti ) + σ(ti , Xti )(Wti +1 – Wti )
b
Z  
+ γ(ti , X
b , z ) N(t
ti i +1 , dz ) – N(ti , dz ) . (13.18)
R
We know that ti +1 – ti = δ and Wti +1 – Wti ∼ N(0, δ). Now, for the jump-term we note that
 
P N(ti +1 , R) – N(ti , R) = 0 = 1 – λδ + O(δ 2 ),
 
P N(ti +1 , R) – N(ti , R) = 1 = λδ + O(δ 2 ),
 
P N(ti +1 , R) – N(ti , R) ≥ 2 = O(δ 2 ).

Thus, we can approximate the jump-term in (13.18) using a Bernoulli random variable Bi +1 and a
random varaible Yi +1 with distribution F as follows
Z  D
γ(ti , X
b , z ) N(t
ti i +1 , dz ) – N(ti , dz ) ≈ γ(ti , Xti , Yi +1 )Bi +1 ,
b Yi +1 ∼ F, Bi +1 ∼ B(λδ).
R

Putting everything together, we can approximately simulate X b using the following algorithm

√ 2
X ti +1 = Xti + µ(ti , Xti )δ + σ(ti , Xti ) δZi +1 + γ(ti , Xti , Yi +1 )Bi +1 + O(δ )
b b b b b

b is not exact due to the error term of O(δ 2 ). However, as we are already
Note that our simulation of X
making an error of O(δ) be approximating X with X,
b the O(δ 2 ) term is not problematic.

13.6.2 Euler Discretization on a random time grid


An alternative Euler-like Discretization of X can be constructed as follows. First, we fix a time-horizon
T < ∞. Noting that N(t , R) is a Poisson process with intensity λ = ν(R) we can simulate the jump
times τi of X as follows

τi +1 = τi + Ti +1 , τ0 := 0, Ti +1 ∼ E(λ).

Next, we construct a time grid

Π = {t0 = 0, t1 , t2 , . . . , ti , τ1 –, τ1 , ti +1 , . . . , tj , τ2 –, τ2 , tj +1 , . . . , tn = T}, tk = k δ, δ = T/n,

where the jump times (τi ) and the fixed times (ti ) are placed in chronological order. Now, from time τi
to τi +1 – the dynamics of X are given by

dXt = µ(t , Xt )dt + σ(t , Xt )dWt , t ∈ [τi , τi +1 ).


228 CHAPTER 13. MONTE CARLO METHODS

Thus, in between jump dates, we can simulate X using the Euler scheme described in Section 13.5.3. At
the jump dates we have
D
Z
Xτi = Xτi – + γ(τi , Xτi – , z )∆N(τi , dz ) = Xτi – + γ(τi , Xτi – , Yi +1 ), Yi +1 ∼ F.
R
One advantage of simulating sample paths of X on a random time grid is that we obtain for each sample
path, the exact jump times for X. When we simulate X on a fixed time grid, we can only specify intervals
[ti , ti +1 ) during which jumps occur.

13.7 Variance Reduction Techniques


Having thoroughly discuses how generate iid sample paths for random processes, we now turn our
attention to variance reduction techniques. The basic idea of this section can be summarized as follows.
Suppose we wish to estimate µ = EX. If we can generate iid random variabes (Yi ) such that EYi = µ
(where Yi may have a different distribution that X), we can construct an unbiased estimator of µ as
follows
1 X n
νbn = Y .
n i =1 i

The variance of our estimator is given by


n
X 1 X n 1
Vνbn = V(Yi /n) = VY i = VY.
i =1
n 2 i =1 n

The goal, then, is to clevarly choose Y so that VY is small.

13.7.1 Control Variates


Let us suppose that (X, Y) are correlated random variables and that we wish to compute µY := EY. Let
us further suppose that we have a method of generating a sequence (Xi , Yi ) of iid random vectors with
the same distribution as (X, Y). Our usual method of estimating µY is to compute
1 X n
µb n := Y .
n i =1 i
2 /n where σ 2 := VY, the only way to improve our estimate µ
As Vµb n = σY b n of µY is to make n larger.
Y
Now, suppose we know that EX = µX . We can use this information to create a new estimator µb n (b),
which is given by
1 X n
µb n (b) := Y (b), Yi (b) := Yi – b(Xi – µX ), b ∈ R.
n i =1 i
13.7. VARIANCE REDUCTION TECHNIQUES 229

Note that µb n (b), like µb n ,is an unbiased estimator of µY as Eµb n (b) = µ. Moreover, we will show that,
by choosing b appropriately, we can obtain Vµb n (b) ≤ Vµn . To see how this is so, we compute

VYi (b) = E(Yi (b) – EYi (b))2


2 – 2bρ
= σY 2 2 2
XY σX σY + b σX =: σY (b),

2 = VX and ρ
where σX XY is the correlation between X and Y. Note that σY (b) is quadratic and convex
as a function of b and achieves a minimum at
ρ σ CoV(X, Y)
b ∗ = XY Y = .
σX VX
As b ∗ is the minimizer of σY (b), we know that σY (b ∗ ) ≤ σY (0) = σY . Thus, we have
1 1
Vµb n (b ∗ ) = σY (b ∗ ) ≤ σY = Vµb n .
n n
Thus, by using µb n (b ∗ ) in place of µb n , we can achieve a lower variance estimator of µY . The variable X
whose expectation µX is known, is called a control variate.

In may practical scenarios, it may not be realistic to assume that we know CoV(X, Y) and VX, which
are needed to compute b ∗ . However, we can always estimate b ∗ using
Pn
(Xi – EX)(Yi – µb n )
bbn∗ = i =1
Pn 2 .
i =1 (Xi – EX)
One cautionary note: unlike µb n (b), which is an unbiased estimator of µY for any fixed b ∈ R, the
estimator µb n (bbn∗ ) is a biased estimator of µY (because Eµb n (bbn∗ ) 6= µY ).

Example 13.7.1. Suppose we wish to estimate Ef (ST ) where S = (St )t ≥0 is the solution of the following
SDE

dSt = a(St )dWt .

Note that S is a martingale and thus EST = S0 . Assuming we can simulate iid paths of S (either exactly
or approximately), then we can estimate Ef (ST ) using the control variate method with (Xi , Yi ) =
(i ) (i )
(ST , f (ST )).

13.7.2 Importance Sampling


Consider a random variable X defined on (Ω, F, P). Let P
e be a probability measure satisfying P(A)
e =
0 ⇒ P(A) = 0 (i.e., P  P).
e We can compute µ := Eϕ(X) in two ways

dP
Eϕ(X) = EZϕ(X),
e Z= .
dP
e
230 CHAPTER 13. MONTE CARLO METHODS

Let us denote by F the distribution of X under P and by F


e the distribution of (X, Z) under P.
e Suppose

we can generate iid random variables (Xi ) with Xi ∼ F and iid random vectors (Xi , Zi ) with (Xi , Zi ) ∼ F.
e

Then we can estimate µ in two ways

1 X n
µb n = ϕ(Xi ), Xi ∼ F,
n i =1
1 X n
νbn = Z ϕ(Xi ), (Xi , Zi ) ∼ F.
e
n i =1 i

Both µb n and νbn are unbiased estimators of µ as Eµb n = Eνbn = µ. To see which estimator has a lower
variance, we compute

Vϕ(X) = Eϕ2 (X) – (Eϕ(X))2 , VZϕ(X)


e e 2 ϕ2 (X) – (EZϕ(X))
= EZ e 2.

Noting that Eϕ(X) = EZϕ(X),


e it follows that

if Eϕ2 (X) ≥ EZ
e 2 ϕ2 (X) = EZϕ2 (X), then Vµb n ≥ V
e νb .
n

Now, let us consider the case where Z = p(X)/pe (X) where p is the density of X under P and pe is some
other density. Note that, with this choice for Z, we have P(X
e ∈ dx ) = pe (x )dx because

p(X) Z
p(x ) Z
P(X ∈ A) = E1{X∈A} = EZ1
e
{X∈A} = E
e 1{X∈A} = dx pe (x ) = dx p(x ).
pe (X) A pe (x ) A

e then we would not have P(X ∈ A) = R dx p(x ). The question


If pe were not the density of X under P, A
arises: how should we choose pe in order to minimize EZϕ2 (X)? If we set
1 Z
pe (x ) = p(x )|ϕ(x )|, c= dx p(x )|ϕ(x )| = E|ϕ(X)|, (13.19)
c
then we have
p(X) 2 p(X)
EZϕ2 (x ) = E ϕ (X) = cE |ϕ2 (X)| = (E|ϕ(X)|)2 ≤ Eϕ2 (X).
pe (X) p(X)|ϕ(X)|

Thus, if we choose Z = p(X)/pe (X) with pe given by (13.19), we will obtain an estimator νbn of µ with a
lower variance than µb n . Of course, from a practical standpoint, if we knew p, then we could compute
Eϕ(X) without the need for Monte Carlo simulation. Nevertheless, this exercise gives us insight as to
how we might choose Z. Namely, we should try to choose Z so that pe (x ) is large when |ϕ(x )|p(x ) is large.

Example 13.7.2. Let X = (Xt )t ≥0 be a process on (Ω, F, P) defined by

dXt = a(Xt )dWt ,


13.7. VARIANCE REDUCTION TECHNIQUES 231

where W is a Brownian motion under P. Suppose we wish to compute P(XT > x ) = E1{X>x } where
x  X0 . We could approximate X in a grid Π = {t0 = 0, t1 , t2 , . . . , tm = T} with a process X
b where

X
b
ti +1 = Xti + a(Xti )(Wti +1 – Wti ),
b b i = 0, 1, . . . , m – 1.

Next, we can estimate µ := P(XT > x ) = E1{X>x } using

1 X n
µb n = 1 b (i ) , b (i ) are iid realizations of X
where X b under P.
n i =1 {X T >x }

Alternatively, consider the change of measure

dP –1 2 1 2 m–1Y γ(W ft )
–W
2 γ T–γ WT = e– 2 γ T
f ft
Z= = e e i +1 i ,
dP
e
i =0

for some γ > 0. The dynamcs of X can be written as

dXt = a(Xt )(γdt + dW


f ),
t

where W e We could approximate X in the grid Π with a process X


f is a Brownian motion under P. b where
 
X
b
ti +1 = Xti + a(Xti ) γ(tt +1 – ti ) + (Wti +1 – Wti ) ,
b b f f i = 0, 1, . . . , m – 1.

Then, using

e dP 1
P(XT > x ) = E {XT >x } = EZ1{XT >x } ,
e
dP
e

we could compute

1 X n
νbn = Z(i ) 1 b (i ) , b (i ) , Z(i )) are iid realizations of (X,
where (X b Z) under P.,
e
n i =1 {XT >x }
m–1 f(i ) f(i )
– 21 γ 2 T Y γ(Wtj +1 –Wtj )
Z(i ) =e e .
j =0

As X has a positive drift γa(X) under P,


e it is likely that Vνb < Vµ
n bn .

13.7.3 Antithetic Variates


Suppose a random variable Y+ can be expressed as a function g of a sequence of m iid Standard Normal
RVs (Zi )

Y+ = g(Z1 , Z2 , . . . , Zm ).
232 CHAPTER 13. MONTE CARLO METHODS

Because Zi ∼ –Zi it follows that

D
Y– := g(–Z1 , –Z2 , . . . , –Zm ) = Y+ .

We call the pair (Y+ , Y– ) antithetic variables. Note that Y+ and Y– are not independent.

Now, suppose we wish to compute µ := EY+ . As we know how to generate iid standard normal random
variables (Zi ) we can generate iid random variable (Y+ –
i ) and (Y i ). Thus, we can construct two unbiased
estimators of µ as follows

1 X n 1 X n
µb n := (Y+ + Y+
n+i ), νbn := (Y+ + Y–i ),
2n i =1 i 2n i =1 i

We would like to know, under which conditions we have Vνbn < Vµb n . Noting that

V(Y+ + + + +
i + Y n+i ) = VYi + VYn+i = 2VY ,
V(Y+ – + – + – + + –
i + Y i ) = VYi + VYi + 2CoV(Y i , Yi ) = 2VY + 2CoV(Y , Y ),

we see that, if CoV(Y+ , Y– ) < 0, then we will have Vνbn < Vµb n . Note that we will have CoV(Y+ –
i , Yi ) < 0
if the function g is monotonic in each of its components, i.e., if

zj < yj ⇒ g(z1 , . . . , zj , . . . , zm ) ≤ g(z1 , . . . , yj , . . . , zm ), ∀ j = 1, 2, . . . m.

Example 13.7.3. Suppose we wish to compute Eϕ(XT ) where X = (Xt )t ≥0 satisfies the SDE dXt =
µ(t , Xt )dt + σ(t , Xt )dWt . We can generate an approximate path of X on a grid Π = {t0 = 0, t1 , . . . , tm =
b + defined by
T} with a process X
+ + + + q
X
b
ti +1 = Xti + µ(ti , Xti )(ti +1 – ti ) + σ(ti , Xti ) ti +1 – ti Zi +1 ,
b b b i = 0, 1, . . . , m – 1.

b – defined by
Alternatively, we can generate an approximate path of X on Π with a process X
– – – – q
X
b
ti +1 = Xti + µ(ti , Xti )(ti +1 – ti ) + σ(ti , Xti ) ti +1 – ti (–Zi +1 ),
b b b i = 0, 1, . . . , m – 1.

+ –
We can now construct antithetic variables (Y+ , Y– ) = (ϕ(X
b ), ϕ(X
T
b )).
T

13.8 Exercises
To do.
Bibliography

Björk, T. (2004). Arbitrage theory in continuous time. Oxford university press.

Cont, R. and P. Tankov (2004). Financial modelling with jump processes, Volume 2. Chapman & Hall.

Glasserman, P. (2013). Monte Carlo methods in financial engineering, Volume 53. Springer Science &
Business Media.

Grimmett, G. and D. Stirzaker (2001). Probability and random processes. Oxford university press.

Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes.
Biometrika 58 (1), 83–90.

Karlin, S. and H. Taylor (1981). A second course in stochastic processes. Academic Press.

Kloeden, P. and E. Platen (1992). Numerical Solution of Stochastic Differential Equations, Volume 23.
Springer-Verlag Berlin Heidelberg.

Linetsky, V. (2007). Chapter 6 spectral methods in derivatives pricing. In J. R. Birge and V. Linetsky
(Eds.), Financial Engineering, Volume 15 of Handbooks in Operations Research and Management
Science, pp. 223 – 299. Elsevier.

Øksendal, B. (2005). Stochastic Differential Equations: An Introduction with Applications (6 ed.).


Springer-Verlag.

Øksendal, B. and A. Sulem (2005). Applied stochastic control of jump diffusions. Springer Verlag.

Shreve, S. E. (2004). Stochastic calculus for finance II: Continuous-time models, Volume 11. Springer
Science & Business Media.

van Handel, R. (2007). Stochastic Calculus, Filtering, and Stochastic Control.

233

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy