ProbStochProc 1.42 NoSolns PDF
ProbStochProc 1.42 NoSolns PDF
Matthew Lorig 1
1 Department of Applied Mathematics, University of Washington, Seattle, WA, USA. e-mail: mlorig@uw.edu
ii
Contents
Preface vii
1 Review of probability 1
1.1 Events as sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Infinite probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Uniform Lebesgue measure on (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Infinite sequence of coin tosses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Random variables and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.1 Integration in the Lebesgue sense . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6.2 Computing expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Change of measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
iv CONTENTS
These notes are intended to give first-year PhD students in applied mathematics a broad introduction
to measure-theoretic probability and stochastic processes. Because the focus of this course is on the
applied aspects of these topics, we will sometimes forgo mathemical rigor, favoring instead a heuristic
development. The mathematical statements in these notes should be taken as “true in spirit,” but perhaps
not always rigorously true in the mathematical sense. The hope is that, what the notes lack in rigor,
they make up in clarity. Each chapter begins with a list of references, which the interested student can
go to for rigorously true statements.
It should be noted that these notes are a work in progress. Students are encouraged to e-mail the
professor if they find errors.
Acknowledgments
The author of these notes wishes to express his sincere thanks to Weston Barger and Yu-Chen Cheng for
checking and writing homework solutions as well making corrections and improvements to the text.
vii
viii PREFACE
Chapter 1
Review of probability
The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 1) and (Grimmett and
Stirzaker, 2001, Chapters 1–5).
Definition 1.1.2. An event is a subset of the sample space. We usually denote events by capital roman
letters A, B, C, . . ..
Example 1.1.3 (Toss a coin once). When one tosses a coin, there are two possible outcomes: heads
(H) and tails (T). Thus, we have Ω = {H, T}. One element of Ω is, e.g., ω = H. Possible event: “toss a
head.” A = H.
Example 1.1.4 (Toss two distinguishable coins). Ω = {(HH), (HT), (HT), (TT)}. One element of
Ω is, e.g., ω = (HT). Possible event: ‘’second toss a tail.” A = {(HT), (TT)}.
Example 1.1.5 (Toss two indistinguishable coins). Ω = {{HH}, {HT}, {TT}}. One element of Ω
is, e.g., {HH}. Possible event: ‘’coins match.” A = {{HH}, {TT}}.
Example 1.1.6 (Roll a die). Ω = {1, 2, 3, 4, 5, 6}. One element of Ω is, e.g., ω = 2. Possible event:
“roll an odd number.” A = {1, 3, 5}.
1
2 CHAPTER 1. REVIEW OF PROBABILITY
Here, we use (·) to denote an ordered sequence and we use {·} to denote an unordered set. Thus,
(HT) 6= (TH) but {HT} = {TH}.
If A and B are subsets of Ω, we can reasonably concern ourselves with events such as “not A” (Ac ), “A
or B” (A ∪ B), “A and B” (A ∩ B), etc. A σ-algebra is a mathematical way to descibe all possible sets of
interest for a given sample space Ω.
Note that if F is a σ-algebra then Ω ∈ F by items 1 and 3. Note also that F is closed under countable
intersections since
Alternatively, one can define of a σ-algebra F as a set of subsets of Ω that contains at least the empty set
∅ and is closed under countable set operations (though, not necessarily closed under uncountable set
operators).
Example 1.1.8 (Trivial σ-algebra). The set of subsets F0 := {∅, Ω} of Ω is commonly referred to as
the trivial σ-algebra.
Example 1.1.10. The power set of Ω, written 2Ω is the collection of all subsets of Ω. The power set
F = 2Ω is a σ-algebra.
Definition 1.1.11. Let G be a collection of subsets of Ω. The σ-algebra generated by G, written σ(G),
is the smallest σ-algebra that contains G.
By “smallest” σ-algebra we mean the σ-algebra with the fewest sets. One can show (although we will not
do so in these notes) that σ(G) is equal to the intersection of all σ-algebras that contain G.
1.2. PROBABILITY 3
Example 1.1.12. The collection of sets G = {∅, A, Ω} is not a σ-algebra because it does not contain
Ac . However, we could create a σ-algebra from G by simply adding the set Ac . Thus, we have
σ(G) = {∅, Ω, A, Ac }.
Definition 1.1.13. The pair (Ω, F) where Ω is a sample space and F is a σ-algebra of subsets of Ω is
called a measurable space.
1.2 Probability
So far, we have not yet talked about probabilities at all – only outcomes of a random experiment (elements
ω ∈ Ω) and events (subsets A ⊆ Ω). A probability measure assigns probabilities to events.
Definition 1.2.1. A probability measure defined on (Ω, F) is a function P : F → [0, 1] that satisfies
1. P(Ω) = 1;
2. if Ai ∩ Aj = ∅ for i 6= j then P(∪i Ai ) =
P
i P(Ai ). (countable additivity)
As we shall see in Section 1.3.2, it is very important to recognize that Item 2 holds only for countable
P
unions. For an uncountable union it is not true that P(∪α Aα ) = α P(Aα ).
How can we see the well-known result P(Ac ) = 1 – P(A) from the above definition? Simply note that
A probability measure P does not need to correspond to empirically observed probabilities! For example,
from experience, we know that if we toss a fair coin we have P(H) = P(T) = 1/2. However, we can
always define a measure P
e that assigns different probabilities P(H)
e = p and P(T)
e = 1 – p. As long as
e is a probability measure on (Ω, F) where Ω = {H, T} and F = {∅, Ω, H, T}.
p ∈ [0, 1] the measure P
The triple (Ω, F, P) is often referred to as a probability space or probability triple. To review, the sample
space Ω is the collection of all possible outcomes of an experiment. The σ-algebra F is all sets of interest
of an experiment. And the probability measure P assigns probabilities to these sets.
When a sample space is finite Ω = {ω1 , ω2 , . . . , ωn }, we can always take the σ-algebra as the power set
F = 2Ω and construct a probability measure P on (Ω, F) by specifying the probabilities of each individual
outcome P(ωi ) = pi . However, when the sample space Ω is infinite, choosing an appropriate σ-algebra F,
and constructing a probability measure P on (Ω, F) is a more delicate procedure.
where we have defined the Lebesgue measure µ. Equation (1.1) tells us how to determine the probability
that ω falls within an open interval. But, in fact, (1.1) tells us more than that. If P is to be a probability
measure, then it must satisfy the countable additivity property given in Definition 1.2.1. Thus, we also
know, for example, that
P({ω : ω ∈ (a, b) ∪ (c, d)}) = P({ω : ω ∈ (a, b)}) + P({ω : ω ∈ (c, d)})
= (b – a) + (d – c), 0 < a < b < c < d < 1.
It is natural to ask, what are all of the sub-sets of (0, 1) whose probabilities are determined by (1.1)
and the properties of probability measures given in Definition 1.2.1? Surprisingly, the answer is not
the power set 2(0,1) . It turns out that the power set 2(0,1) has sets whose probabilities are not determined
by (1.1). The sets whose probabilities are uniquely determined by (1.1) and Definition 1.2.1 are the sets
in σ-algebra generated by the open intervals
We call B((0, 1)) the Borel σ-algebra on (0, 1). Thus, the appropriate sample space for our experiment
is (Ω, F, P) = ((0, 1), B((0, 1)), µ).
Definition 1.3.1. Let Ω be some topological space and let O(Ω) be the set of open sets in Ω. We define
we define the Borel σ-algebra on Ω, denoted B(Ω) by B(Ω) := σ(O(Ω)).
Remark 1.3.2. Do not worry too much about what exactly Borel σ-algebras are. Just think of them as
“reasonable” sets. In fact, you would have to think very hard to come up with a set that is not a Borel set.
Note that this set is not only infinite but uncountably infinite because there is a one-to-one correspondence
between Ω and the set of reals in [0, 1]. We will denote a generic element of Ω as follows:
ω = ω1 ω2 ω3 . . .
where ωi is the result of the i th coin toss. We want to construct a σ-algebra for this experiment.
F0 = {∅, Ω}.
AH = {ω ∈ Ω : ω1 = H}, AT = {ω ∈ Ω : ω1 = T}.
F1 := {∅, Ω, AH , AT },
satisfies the conditions of σ-algebra. Given ω1 it is possible to say whether or not ω is in each of the sets
in F1 . For example, if ω1 = H then ω ∈ AH and ω ∈ Ω, but ω ∈
/ AT and ω ∈
/ ∅. Next define four sets
We wish to construct a σ-algebra that contains these sets and the sets in F1 . The smallest such σ-algebra
is
∅, Ω, AH , AT , AHH , AHT , ATT , ATH , Ac , Ac , Ac , Ac
HH HT TT TH
F2 = .
HH ∪ ATH , AHH ∪ ATT , AHT ∪ ATH , AHT ∪ ATT
A
Given ω1 and ω2 , we can say if ω belongs to each of the sets in F2 . Continuing in this way, we can define
a σ-algebra Fn for every n ∈ N. Finally, we take
F := σ(F∞ ), F∞ = ∪n Fn .
One might ask if we could have simply taken F = F∞ ? Well, F∞ contains every set that can be described
in terms of finitely many coin tosses. However, we may be interested in sets such as “sequences for
which x percent of coin tosses are heads,” and these sets are not in F∞ . It turns out such sets are in F.
Now, we want to construct a probability measure on F. Let us assume the coin tosses are independent (a
6 CHAPTER 1. REVIEW OF PROBABILITY
term we will describe rigorously later on) and that the probability of a head is p. Setting q = 1 – p, it
should be obvious that
Continuing in this way, we can define P(A) for every A ∈ F∞ . What about the sets that are in F but
not in F∞ ? It turns out that once we have defined P for sets in F∞ there is only one way to assign
probabilities to those sets that are in F but not in F∞ . We refer the interested reader to Carathéodory’s
Extension Theorem for details.
The strong law of large numbers (SLLN) tells us that P(A) = 1 if p = 1/2 and P(A) = 0 if p 6= 1/2
(if you have not yet seen the SLLN, you should be able to see this from intuition). Now it should be
clear why uncountable additivity does not hold for probability measures. The probability of any given
sequence of infinite coin tosses is zero: P(ω) = 0. If we were to attempt to compute P(A) by adding up
the probabilities P(ω) of all elements ω ∈ A we would find
X X
P(ω) = 0 = 0 6= 1 = P(A), (when p = 1/2).
ω∈A ω∈A
We finish this example (we will come back to it!) with the following definition
Definition 1.3.3. Let (Ω, F, P) be a probability space. If a set A ∈ F satisfies P(A) = 1, we say that
the event A occurs P almost surely (written, P-a.s.).
Note in the example above that, when p = 1/2 we have P(A) = 1 and thus A occurs almost surely. But
6 Ω and Ac 6= ∅. The elements of Ac are part of the sample space Ω,
it is important to recognize that A =
but they have zero probability of occurring.
Definition 1.4.1. A random variable defined on (Ω, F) is a function X : Ω → R with the property that
{X ∈ A} := {ω ∈ Ω : X(ω) ∈ A} ∈ F,
Observe all any random variables must be defined on a measurable space (Ω, F), as these appear in the
definition. Note, however, that the probability meausre P does not appear in the definition. Random
variables are defined independent of a probability measure P.
What does Definition 1.4.1 mean? Recall that a probability measure P defined on (Ω, F) maps F → [0, 1].
In order for us to answer the question: “what is the probability that X ∈ A?” we need for the set
{X ∈ A} ∈ F. And this is precisely what Definition 1.4.1 requires. Why do we only consider sets A ∈ B(R)
rather than any set A ⊂ R? The answer is rather technical and, frankly, not worth exploring at the
moment.
A word on notation: the standard convention is to use capital Roman letters (typically, X, Y, Z) for
random variables and lower case Roman letters (x , y, z ) for real numbers.
Example 1.4.2 (Discrete time model for stock prices). Consider the infinite sequence of coin
tosses in Section 1.3.2. We Define a sequence of random variables (Sn )n≥0 via
uSn
if ωn = H
S0 (ω) = 1, Sn+1 (ω) = (1.2)
dSn
if ωn = T
Here, Sn represents the value of a stock at time n. Note that P(S1 = u) = P(AH ) = p. Likewise
P(S2 = ud) = P(AHT ∪ ATH ) = 2pq. More generally, one can show that
!
n k n–k
P(Sn = u k d n–k ) = p q . (1.3)
k
Note if we had simply defined the random variables (Sn )n≥1 as having probabilities given by (1.3) we
would have no information about how, e.g., Sn relates to Sn–1 . From the above construction (1.2),
however, we know that if Sn = u n then Sn–1 = u n–1 . Thus, the structure of a given probability space,
not just the probabilities of events, is very important.
Example 1.4.3. Let (Ω, F) = ((0, 1), B((0, 1)) Define random variables X(ω) = ω and Y(ω) = 1 – ω.
Clearly, we have X = 1 – Y. Now, suppose we defined P(dω) := dω. Then X and Y have the same
distribution. For x ∈ [0, 1] we have
Z x Z x
P(X ≤ x ) = P(ω ≤ x ) = P(dω) = dω = x ,
0 0
8 CHAPTER 1. REVIEW OF PROBABILITY
Z 1 Z 1
P(Y ≤ x ) = P(1 – ω ≤ x ) = P(dω) = dω = x .
1–x 1–x
The distribution of a random variable X is most easily described through its cumulative distribution
function.
FX (x ) := P(X ≤ x ).
Observe that, while a random variable X is defined with respect to (Ω, F) (with no reference to P), the
distribution FX is specific to a probability measure P.
Note that we put the random variable X in the subscript of FX to remind us that FX is the distribution
function corresponding to the random variable X (and not, e.g., Y). It is a good idea to do this.
Many (but not all) random variables fall in to one of two categories: discrete and continuous. We describe
these two categories below.
Definition 1.4.5. A random variable X is called discrete if it takes values in some countable set
A := {x1 , x2 , . . .} ⊂ R. We associate is a discrete random variable a probability mass function
fX : A → R, defined by fX (xi ) := P(X = xi ).
Definition 1.4.6. A random variable X is called continuous if its distribution function FX can be
written as
Z x
FX (x ) = du fX (u), x ∈ R,
–∞
If X is either discrete or continuous, it is easy to compute P(X ∈ A) for any A ∈ B(R). We have
X
discrete : P(X ∈ A) = fX (xi ),
{i :xi ∈A}
Z
continuous : P(X ∈ A) = dx fX (x ).
A
Example 1.4.8. If X is distributed as a Bernoulli random variable with parameter p ∈ [0, 1], written
X ∼ Ber(p), then
1 –
p k = 0,
X ∈ {0, 1}, fX (k ) =
p k = 1.
Example 1.4.10. If X is distributed as a Geometric random variable with parameter p ∈ [0, 1], written
X ∼ Geo(p), then
X ∈ N, fX (k ) = p(1 – p)k –1 .
Note that Xi ∼ Ber(p) i = 1, 2, . . . are independent of each other then Y := inf{i : Xi = 1} ∼ Geo(p).
λk –λ
X ∈ {0} ∪ N, fX (k ) = e .
k!
10 CHAPTER 1. REVIEW OF PROBABILITY
Definition 1.4.12. Let A be a set in some topological space Ω (e.g., Ω = Rd ). The indicator function
1A : Ω → {0, 1} is defined as follows
1
if x ∈ A,
1A (x ) :=
0 if x ∈
/ A.
Notation: We will sometimes write 1A (x ) = 1{x ∈A} .
We now introduce some continuous random variables that frequently arise in applications.
Example 1.4.13. If X is distributed as a Uniform random variable on the interval [a, b] ⊂ R, written
X ∼ U([a, b]), then
1
X ∈ [a, b], fX (x ) = 1[a,b] (x ) .
b–a
Example 1.4.14. If X is distributed as a Exponential random variable with mean λ > 0, written
X ∼ E(λ), then
Example 1.4.15. If X is distributed as a Gaussian or Normal random variable with mean µ ∈ R and
variance σ 2 > 0 (we will give a meaning for “mean” and “variance” below), written X ∼ N(µ, σ 2 ), then
(x – µ)2
!
1
X ∈ R, fX (x ) = √ exp – .
2πσ 2 2σ 2
A random variable Z ∼ N(0, 1) is referred to as standard normal.
We can think of a stochastic process X : T × Ω → R in (at least) two ways. First, for any t ∈ T we have
that Xt : Ω → R is random variable. Second, for any ω ∈ Ω, we have that X· (ω) : T → R is a function of
time. Both interpretations can be useful.
1.6 Expectation
When we think of averaging we think of weighting outcomes by their probabilities. The mathematical
way to encode this is via the expectation.
Definition 1.6.1. Let X be a random variable defined on (Ω, F, P). The expectation of X, writtien EX,
is defined as
Z
EX := X(ω)P(dω),
Ω
Definition 1.6.2. Fix a probability space (Ω, F, P). Let A ∈ F. The indicator random variable,
denoted 1A , is defined by
1
ω ∈ A,
1A (ω) :=
0
ω∈
/ A.
Observe that 1A ∼ Ber(p) with p = P(A). For disjoint sets A and B we have
1A∪B = 1A + 1B , A ∩ B = ∅.
1A∩B = 1A 1B .
Definition 1.6.4. Let (Ai ) be a finite partition of Ω. A non-negative random variable X, defined on a
probability space (Ω, F, P), which is of the form
n
Ai ∈ F,
X
X(ω) = xi 1Ai (ω), xi ≥ 0,
i =1
is called simple.
E1A = P(A).
Thus, we can always represent probabilities of sets as expectations of indicator random variables.
Now, consider a non-negative random variable X, which is not necessarily simple. Let (Xn )n≥0 be an
increasing sequence of simple random variables that converges almost surely to X. That is
Xi ≤ Xi +1 , lim Xi → X, P-a.s..
i →∞
where each of the expectations on the right-hand side are well-defined because all of the Xi are simple by
construction. Finally, consider a general random variable X that could take either positive or negative
values. Define
Note that X+ and X– are non-negative and X = X+ – X– . With this in mind, we define
EX := EX+ – EX– ,
Definition 1.6.1 of EX makes sense if E|X| < ∞ or if EX± = ∞ and EX∓ < ∞. In the latter case, we
have EX = ±∞. If both EX+ = ∞ and EX– = ∞, then we find ourselves in an ∞ – ∞ situation and, in
this case, EX is undefined.
1.6. EXPECTATION 13
If X is either discrete or continuous Definition 1.6.1 reduces to the formulas one learns as an undergraduate.
X
discrete : EX = xi fX (xi ),
Zi
continuous : EX = dx x fX (x ).
R
In the discrete case, the sum runs over all possible values of x . More generally, we can express the
expected value of X as
Z X xi + xi +1
EX = x FX (dx ) := lim FX (xi +1 ) – FX (xi ) , (1.5)
R kΠk→0 i 2
The expression on the righ-hand side of (1.5) is known as a Stieltjes integral. The advantage of using
R
the Stieltjes integral x FX (dx ) to compute an expectation is that every random variable X – whether it
be discrete or continuous – has a distribution FX . Thus, by usinig the Stieltjes integral, we avoid having
to treat discrete and continuous cases separately.
Note that E is a linear operator. If X and Y are random variables and a and b are constants, then
How does one compute Eg(X) where g : R → R? Although we have not stated it explicitly, it should be
obvious that if X is a random variable, then Y := g(X) is also a random variable. 1 Thus, we have
Z
EY = Eg(X) = g(X(ω))P(dω),
Ω
Theorem 1.7.1. Fix a probability space (Ω, F, P) and let Z ≥ 0 be a random variable satisfying
e : F → [0, 1] by
EZ = 1. Define a P
P(A)
e := EZ1A . (1.6)
Then
EX
e = EZX, and if Z > 0, then e 1 X.
EX = E (1.7)
Z
where X is a random variable defined on (Ω, F).
Definition 1.7.2. We call the random variable Z in Theorem 1.7.1 the Radon-Nikodym derivative of
P
e with respect to P.
P(Ω)
e = EZ1Ω = EZ = 1.
→
P P
Note that interchanging the sum with the expectation E i i E is allowed by Tonelli’s Theorem.
Finally, to show equation (1.7) holds, it is enough to check that it holds for simple random variables
P
X= i xi 1Ai . We have
X X X
EX
e =E
e xi 1Ai = xi EZ1Ai = xi P(A
e
i ),
i i i
which agrees with the definition of Expectation for simple random variables. Finally, if Z > 0, we have
e 1 X = EZ 1 X = EX.
E
Z Z
1.7. CHANGE OF MEASURE 15
Definition 1.7.3. A probability measure P defined on (Ω, F) is absolutely continuous with respect to
e written P P,
another probability measure P, e if
P(A) = 0 ⇒ P(A)
e = 0.
P(A) = 0 ⇔ P(A)
e = 0.
them is strictly positive Z > 0. Equivalent measures agree on which events will happen with probability
zero (and thus, they agree on which events will happen with probability one).
Example 1.7.5. Let us return to Example 1.4.3. We set (Ω, F) = ((0, 1), B((0, 1)). On this measure
space, we define two probability measures P(dω) = dω and P(dω)
e = 2ωdω. Note that we have
Z Z Z
P(A)
e = E1
e
A= 1A (ω)P(dω)
e = 1A (ω)2ωdω = 1A 2ωP(dω) = E1A Z, Z(ω) := 2ω.
Ω Ω Ω
true.
dP
e dP
e
Z(ω) = (ω), P(dω)
e = (ω)P(dω),
dP dP
gives the correct intuition. In particular, for the special case of an infinite probability space in which
P(dω) = p(ω)dω and P(dω)
e = pe (ω)dω and P P,
e we have Z(ω) = p
e (ω)/p(ω).
Example 1.7.6 (Change of measure Normal random variable). On (Ω, F, P) let X ∼ N(0, 1) and
define Y = X + θ. Clearly, we have Y ∼ N(θ, 1). Now, define a random variable Z by
1 2
Z = e–θX– 2 θ .
16 CHAPTER 1. REVIEW OF PROBABILITY
1 2
P(Y
e ≤ b) = EZ1{Y≤b} = Ee–θX– 2 θ 1{X≤b–θ}
Z b–θ 1 2
= dx e–θx – 2 θ fX (x )
–∞
Z b–θ
1 2
= dx √ e–(x +θ) /2
–∞ 2π
Z b
1 –z 2 /2
= dz √ e
–∞ 2π
e we see that Y ∼ N(0, 1). The Radon-Nykodym derivative Z changes the mean of Y from
Thus, under P
θ to 0, but it does not affect the variance of Y.
Example 1.7.7 (Change of measure Exponential random variable). On (Ω, F, P), let X ∼ E(λ)
and define
µ –(µ–λ)X
Z= e .
λ
Clearly, we have Z ≥ 0. We also have
Z ∞
µ µ
EZ = E e–(µ–λ)X = dx e–(µ–λ)x fX (x )
Z λ
∞
0 λ
µ –(µ–λ)x –λx Z ∞
= dx e λe = dx µe–µx = 1.
0 λ 0
µ
P(X
e ≤ b) = EZ1{X≤b} = E e–(µ–λ)X 1{X≤b}
λ
Z b Z b Z b
µ –(µ–λ)x µ
= e fX (x ) = dx e–(µ–λ)x λe–λx = dx µe–µx
0 λ 0 λ 0
e we have X ∼ E(µ).
Thus, under P,
1.8. EXERCISES 17
1.8 Exercises
Exercise 1.1. Let F be a σ-algebra of Ω. Suppose B ∈ F. Show that G := {A ∩ B : A ∈ F} is a σ-algebra
of B.
Exercise 1.2. Let F and G be σ-algebras of Ω. (a) Show that F ∩ G is a σ-algebra of Ω. (b) Show that
F ∪ G is not necessarily a σ-algebra of Ω.
Exercise 1.3. Describe the probability space (Ω, F, P) for the following three experiments: (a) a biased
coin is tossed three times; (b) two balls are drawn without replacement from an urn which originally
contained two blue and two red balls; (c) a biased coin is tossed repeatedly until a head turns up.
Exercise 1.4. Suppose X is a continuous random variable with distribution FX . Let g be a strictly
increasing continuous function. Define Y = g(X). (a) What if FY , the distribution of Y? (b) What is fY ,
the density of Y?
Exercise 1.5. Suppose X is a continuous random variable with distribution FX . Find FY where Y is
q
given by (a) X2 (b) |X| (c) sin X (d) FX (X).
Exercise 1.6. Suppose X is a continuous random variable defined on a probability space (Ω, F, P). Let
f be the density of X under P and assume f > 0. Let g be the density function of a random variable.
Define Z := g(X)/f (X). (a) Show that Z ≡ dP/dP
e defines a Radon-Nikodym derivative. (b) What is the
density of X under P?
e
Exercise 1.7. Let X be uniformly distributed on [0, 1]. For what function g is the random variable g(X)
exponentially distributed with parameter 1 (i.e. g(X) ∼ E(1))?
18 CHAPTER 1. REVIEW OF PROBABILITY
Chapter 2
The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 2).
Now suppose we are given the value of ω1 . What are the subsets of Ω for which we can say: “ω is in this
set” or “ω is not in this set”? The answer is the sets in F0 as well as AH and AT . Together, these sets
form the σ-algebra F1 = {∅, Ω, AH , AT }. We say the sets in F1 are resolved by the first coin toss.
Now suppose we are given the value of ω1 and ω2 . What are the subsets of Ω for which we can say: “ω is
in this set” or “ω is not in this set”? The answer is the sets in F2 , given by
∅, Ω, AH , AT , AHH , AHT , ATT , ATH , Ac , Ac , Ac , Ac
HH HT TT TH
F2 = .
HH ∪ ATH , AHH ∪ ATT , AHT ∪ ATH , AHT ∪ ATT
A
Continuing in this way, for each n ∈ N we can define Fn as the σ-algebra containing the sets that are
resolved by the first n coin tosses. Note that if a set A ∈ Fn then A ∈ Fn+1 . Thus, Fn ⊂ Fn+1 . In
other words, Fn+1 contains more “information” than Fn . This kind of structure is encapsulated in the
following definition.
Definition 2.1.1. Let Ω be a nonempty set. Let T be a fixed positive number, and assume that for
each t ∈ [0, T] there is a σ-algebra Ft . Assume further that if 0 ≤ s ≤ t ≤ T, then Fs ⊆ Ft . Then we
19
20 CHAPTER 2. INFORMATION AND CONDITIONING
A discrete time filtration is a sequence of σ-algebras F = (Fn )n∈N0 that satisfies Fn ⊆ Fn+1 for all n.
Example 2.1.2. Let Ω = C0 [0, T], the set of continuous functions defined on [0, T], starting from zero.
We denote by ω = (ωt )t ∈[0,T] and element of Ω. Let Ft be the σ-algebra generated by observing ω over
the interval [0, t ]. Mathematically, we write this as
Ft := σ(ωs , 0 ≤ s ≤ t ).
It should be obvious that the sequence of σ-algebras F = (Ft )t ∈[0,T] forms a filtration. Below, we define
two sets, one of which is in Ft , one of which is not.
A := {ω : sup ωs ≤ 1} ∈ Ft , / Ft .
B := {ω : ωT ≤ 1} ∈
0≤s≤t
The set A is an element of Ft because, given the path of ω over the interval [0, t ] one can answer the
question: is the maximum of ω over the interval [0, t ] less than 1? The set B is not an element of Ft
because one needs to know ωT in order to answer the question: is ωT ≤ 1?
Definition 2.1.3. Let X be a random variable defined on a nonempty sample space Ω. The σ-algebra
generated by X, denoted σ(X), is the collection of all subsets of Ω of the form {X ∈ A} where A ∈ B(R).
Example 2.1.4. Let us return to Example 1.4.2. What is σ(S2 )? From the definition, we need to ask,
which sets are of the form {S2 ∈ A}? Since S2 can only take three values, u 2 , ud and d 2 we check the
following sets
We add to these sets the sets that are necessary to form a σ-algebra (i.e., ∅, Ω and unions and complements
of the above sets) to obtain
Definition 2.1.5. Let X be a random variable defined on a nonempty sample space Ω. Let G be a
σ-algebra of subsets of Ω. If σ(X) ⊂ G we say that X is G-measurable, and we write X ∈ G.
2.2. INDEPENDENCE 21
A random variable X is G-measurable if and only if the information in G is sufficient to determine the
value of X. Obviously, if X ∈ G then g(X) ∈ G (assuming is g is a measurable map from (R, B(R)) to
(R, B(R))).
Eventually, we will want to consider stochastic processes X = (Xt )t ∈[0,T] and we will want to know at
each time t if Xt is measureable with respect to σ-algebra Ft .
Definition 2.1.6. Let Ω be a nonempty sample space equipped with a filtration F = (Ft )t ∈[0,T] . Let
X = (Xt )t ∈[0,T] be a collection of random variables indexed by t ∈ [0, T]. We say this collection of
random variables is F-adapted if Xt ∈ Ft for all t ∈ [0, T].
2.2 Independence
When X ∈ G this means that the information in G is sufficient to determine the value of X. On the other
extreme, if X is independent (a term we will define soon) of G this means that the informtion in G tells
us nothing about the value of X.
Definition 2.2.1. Let (Ω, F, P) be a probability space. We say that two sets two sets A and B in F are
independent, written A ⊥⊥ B, if P(A ∩ B) = P(A) · P(B).
Example 2.2.2. Let us return to the coin-toss example of Section 1.3.2. Consider two sets and their
intersection
Since the coin tosses are indepdendent, we should have {ω1 = H} ⊥⊥ {ω2 = H} Let us verify that these
events are independent according to Definition 2.2.1. We have
Example 2.2.3. Can a set be independent of itself? Surprisingly, the answer is “yes.” Suppose
A⊥
⊥ A. Then, by the definition of independent sets, we have P(A ∩ A) = P(A) · P(A). We also have
P(A ∩ A) = P(A). Combining these equations, we obtain P(A) · P(A) = P(A). This equation has two
solutions P(A) = 1 and P(A) = 0. Thus, a set is independent of itself if the probability of that set is
zero or one.
Having defined independent sets, we can now extend to independent σ-algebras and random variables.
22 CHAPTER 2. INFORMATION AND CONDITIONING
Definition 2.2.4. Let (Ω, F, P) be a probability space, and let G and H be sub-σ-algebras of F (i.e.,
G, H ⊆ F). We say these two σ-algebras are independent, written G ⊥⊥ H, if
Let X and Y be random variables on (Ω, F, P). We say these two random variables are independent,
written X ⊥⊥ Y, if σ(X) ⊥⊥ σ(Y). Lastly, we say the random variable X is independent of the σ-algebra
G, written X ⊥⊥ G, if σ(X) ⊥⊥ G.
Recall from Definition 2.1.3 that σ(X) contains all sets of the form {X ∈ A}, where A ∈ B(R). Combining
this with Definition 2.2.4 we see that
X ⊥⊥ Y ⇒ EXY = EX · EY.
The above notion of independence is called pairwise independence. If X ⊥⊥ Y and Y ⊥⊥ Z, this notion of
independence not imply X ⊥⊥ Z (for example, what if Z = X?). Thus, at times, we may need a stronger
notion of independence.
Let X1 , X2 , . . . , Xn be a sequence of random variables on (Ω, F, P). We say the sequence of random
variables are independent if the σ-algebras σ(X1 ), σ(X2 ), . . . , σ(Xn ) are independent.
As with with a pair of random variables, a sequence of random variables (Xi )i ≥1 is independent if and
only if
n
P (∩n ∀ A1 ∈ B(R), ∀ A2 ∈ B(R), . . . , ∀ An ∈ B(R).
Y
i =1 {Xi ∈ Ai }) = P(Xi ∈ Ai ),
i =1
We will often say that a sequence of random variables (Xi )i ≥0 is independent and identically distributed
(iid), by which me mean all Xi have the same distribution and (Xi )1≤i ≤n are independent for every
n ∈ N.
2.2. INDEPENDENCE 23
Example 2.2.6. Let us return to the coin-toss example of Section 1.3.2. Let us define a squence of
random variables (Xi )i ≥1 via
1
ω1 = H, 1
ω2 = H, X1
i odd,
X1 (ω) = , X2 (ω) = , Xi (ω) = ,
0
ω1 = T, 0
ω2 = T, X
i even.
2
Clearly, since the coin tosses are independent we have X1 ⊥⊥ X2 and Xi ⊥⊥ Xj if i is even and j is odd.
But, the sequence (Xi )1≤i ≤n is not independent for any n ≥ 3 since Xi = Xi +2n for any i , n ∈ N.
It is not easy to verify if two random variables X and Y are independent using Expression (2.1), since
the equation must be verified for all Borel sets A, B ∈ B(R). In fact, there is an easier way to check
independence.
Definition 2.2.7. The joint distribution function FX,Y : R2 → [0, 1] of two random variables X and Y
defined on a probability space (Ω, F, P) is given by
Again, we have two special cases for jointly discrete and jointly continuous random variables.
Definition 2.2.8. Two random variables X and Y are called jointly discrete if the pair (X, Y) takes
values in some countable set A = {x1 , x2 , . . .} × {y1 , y2 , . . .} ⊂ R2 . We associate is a discrete random
variable a probability mass function fX,Y : A → R, defined by fX,Y (xi , yj ) := P(X = xi , Y = yj ).
Definition 2.2.9. A pair of random variables X and Y is called jointly continuous if its joint distribution
function FX,Y can be written as
Z x Z y
FX,Y (x , y) = dudv fX,Y (u, v ), (x , y) ∈ R2 ,
–∞ –∞
for some fX,Y : R2 → [0, ∞) called the joint probability density function.
As in the one-dimensional case, it may help to think of the joint density function fX,Y as fX,Y (x , y)dx dy =
P(X ∈ dx , Y ∈ dy).
Note that for jointly continuous random variables X and Y we have fX,Y (x , y) = ∂x ∂y FX (x , y).
If the pair (X, Y) is either jointly discrete or jointly continuous, it is easy to compute P((X, Y) ∈ A) for
any A ∈ B(R2 ). We have
X
discrete : P((X, Y) ∈ A) = fX,Y (xi , yj ),
{i ,j :(xi ,yj )∈A}
24 CHAPTER 2. INFORMATION AND CONDITIONING
Z
continuous : P((X, Y) ∈ A) = dx dy fX,Y (x , y).
A
To recover the marginal distribution FX from FX,Y , simply note that
It follows that for the discrete and continuous cases, we have, respectively
X
discrete : fX (xi ) = fX,Y (xi , yj ),
j
Z
continuous : fX (x ) = dy fX,Y (x , y).
R
The following theorem gives some easy-to-check conditinos for independence.
Theorem 2.2.10. Let X and Y be random variables definied on a probability space (Ω, F, P). The
following conditions are equivalent (that is, if one of them holds, all of them hold)
1. X ⊥⊥ Y.
2. FX,Y (x , y) = FX (x )FY (y) for every (x , y) ∈ R2 .
3. Discrete case: fX,Y (x , y) = fX (x )fY (y) for every (x , y) ∈ R2 .
Continuous case: fX,Y (x , y) = fX (x )fY (y) for ‘almost’ every (x , y) ∈ R2 .
4. E[eiuX+iv Y ] = EeuiX · Eeiv Y for all (u, v ) ∈ R2 .
Together with expectation, the most important statistical properties of a random variable (or pair) are
the variance and co-variance.
Definition 2.2.12. The co-variance of two random variables X and Y, written CoV[X, Y] is defined by
Note that X ⊥⊥ Y implies X and Y are uncorrelated. However, the converse is not true.
2.3. CONDITIONAL EXPECTATION 25
Presumably, you have run across the following formula for the conditional probability of a set A given B
P(A ∩ B)
P(A|B) = , P(B) > 0.
P(B)
When (X, Y) have are jointly discrete or jointly continuous, this readily leads to conditional probability
mass function
And from this, we can define E[X|Y = y], the conditional expectation of X given Y = y
X
discrete : E[X|Y = yj ] := xi fX|Y (xi , yj ),
i
Z
continuous : E[X|Y = y] := dx x fX|Y (x , y)
R
Note that E[X|Y = y] is simply a function of y – there is nothing random about it.
Unfortunately, there are cases for which the pair (X, Y) are neither jointly discrete nor jointly continuous.
And, for these cases we need a more general notion of conditional expectation. Here we will make two
conceptual leaps:
We will just hop in with our new definition of conditional expectation and then we will see, through an
example, that his new definition makes sense.
Definition 2.3.1. Let (Ω, F, P) be a probability space, let G be a sub-σ-algebra of F, and let X be a
random variable that is either nonnegative or integrable. The conditional expectation of X given G,
denoted E[X|G], is any random variable that satisfies
26 CHAPTER 2. INFORMATION AND CONDITIONING
1. Measurability: E[X|G] ∈ G.
2. Partial averaging: E[1A E[X|G]] = E[1A X] for all A ∈ G.
Alternatively, E[ZE[X|G]] = E[ZX] for all Z ∈ G.
When G = σ(Y) we shall often use the short-hand notation E[X|Y] := E[X|σ(Y)].
Admittedly, Definition 2.3.1 is rather abstract (and, for the purposes of computation, useless). In fact, it
is not at all clear from Definition 2.3.1 that E[X|G] even exists! It does exist, though we will not prove
this here.
Conditional expectation has an interesting L2 interpretation. Consider a probability space (Ω, F, P). Let
G be a sub-σ-algebra of F (i.e., G ⊂ F). Define
and likewise for L2 (Ω, G, P). Clearly, since G ⊂ F we have L2 (Ω, G, P) ⊂ L2 (Ω, F, P). Next, define an
inner product
Thus, E[X|G] is the projection of X ∈ L2 (Ω, F, P) onto the subspace L2 (Ω, G, P).
When conditioning on the σ-algebra generated by a random variable, it is easiest to use the following
formula
The following example should help to build some intuition for conditional expectation.
Example 2.3.2. Let Ω = {a, b, c, d, e, f }, F = 2Ω and P(ω) = (1/6) for ω = a, b, . . . , f . Define two
random variables X and Y on (Ω, F, P) as follows
ω a b c d e f
X(ω) = 1 3 3 3 5 7 .
Y(ω) 2 2 1 1 7 7
2.3. CONDITIONAL EXPECTATION 27
Let G = σ(Y). Next, let us compute E[X|Y] using (2.2) and check if it agrees with Definition 2.3.1. We
have
ω a b c d e f
Y(ω) = 2 2 1 1 7 7 .
E[X|Y](ω) 2 2 3 3 6 6
So, yes, E[X|Y] ∈ σ(Y). Another way to think of measurability is to simply ask: given the value of Y,
can one determine the value of E[X|Y]? Clearly, in this case the answer is “yes.” Next, let us check the
partial averaging property: does E[1A E[X|Y]] = E[1A X] for all A ∈ σ(Y). Rather than check this for
every A ∈ σ(Y), let us just check that this holds for the sets {a, b}, {c, d} and {e, f }. We have
A = {a, b}, E[1A E[X|Y]] = P(a)2 + P(b)2 = 4/6, E[1A X] = P(a)1 + P(b)3 = 4/6,
A = {c, d}, E[1A E[X|Y]] = P(c)3 + P(d)3 = 6/6, E[1A X] = P(c)3 + P(d)3 = 6/6,
A = {e, f }, E[1A E[X|Y]] = P(e)6 + P(f )6 = 12/6, E[1A X] = P(e)5 + P(f )7 = 12/6.
The following properties are arguably more important to remember than the definition of conditional
expectation. Memorize them!
Theorem 2.3.3. Let (Ω, F, P) be a probability space and let G be a sub-σ-algebra of F. Conditional
expectations satisfy the following properties.
Theorem 2.3.3 can be proved directly from Definition 2.3.1, though we will not do so here. In addition to
the above properties, the following Theorem, which we state without proof is often useful:
Theorem 2.3.4 (Jensen’s inequality). Let X be a random variable defined on (Ω, F, P) and let G
be a sub-σ-algebra of F. Suppose φ : R → R is a convex function. Then we have
In order to keep straight which direction the inequality in (2.3) goes, it is helpful to remember that
φ(x ) = x 2 is a convex function and the conditional variance of a function satisfies
Now that we have defined conditional expectation and established some of its key properties, we can
define “Markov process” and “martingale” – two seemingly similar, but very distinct concepts.
Definition 2.3.5. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ∈[0,T] be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process
M = (Mt )t ∈[0,T] . We say that M is
We have given above the definition of a continuous time martingale (resp. sub-, super-). We can also
define discrete-time martingales by making the obvious modifications.
Admittedly, the definition of sub- and super-martingales seems backwards; sub-martingales tend to rise
in expectation, whereas a super-martingales tend to fall in expectation.
Note: when we say that a process M is a martingale (or sub- or super-martingale) this is with respect to
a fixed probability measure and filtration. If P and P
e are two probability measures and F and G are two
filtrations, it is entirely possible that a process M may be a martingale with respect to (P, F) and may
not be a martingale with respect to (P,
e F), (P, G) or (P,
e G).
Definition 2.3.6. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ∈[0,T] be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process
X = (Xt )t ∈[0,T] . Assume that for all 0 ≤ s ≤ t ≤ T and for every nonnegative, Borel-measurable function
f , there is another Borel-measurable function g (which depends on s, t , and f ) such that
Identifying g(Xs ) ≡ E[f (Xt )|Xs ] we can write the Markov property as follows
A Markov process is a process for which the following holds: given the present (i.e., Xs ), the future
(i.e, Xt , t ≥ s) is independent of the past (i.e, Fs ). What this means in practice is that
If Xt is a discrete or continuous random variable for every t then we have a transition kernel, written as
P in the discrete case and Γ in the continuous case.
If you can write the transition kernel of a process explicitly, then you have essentially proved that the
process is Markov.
Note that any process that has independent increments is Markov since, if Xt – Xs ⊥⊥ Xs for t ≥ s, then
Markov processes and Martingales are entirely separate concepts. A process X can be both a martingale
and a Markov process, it can be a martingale but not a Markov process, it can be a Markov process but
not a martingale, and it can be neither a Markov process nor a martingale. We illustrate the difference
with an example.
Example 2.3.7. Let us return to the stock price Example 1.4.2. Let us show that S = (Sn )0≤n is a
Markov process. Recall that Fm is the σ-algebra generated by observing ω1 , ω2 , . . . , ωm . Observe that
Sm ∈ Fm . Next, note that
!
n k n–k
P(Sn+m = Sm u k d n–k |Sm ) = p q .
k
Since we have written the transition kernel explicitly, we have established that S is Markov. Let us also
find the function g in Definition 2.3.6. For any f : R → R we have
n !
n k n–q
u k d n–k ) ·
X
E[f (Sn+m )|Fm ] = f (Sm p q =: g(Sm ).
k =0
k
Definition 2.4.1. Fix a probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 . A random time
τ : Ω → [0, ∞] is called a F-stopping time if it satisfies
Above, we have focused on the continuous-time setting. If we are working with a discrete-time filtration
F = (Fn )n∈N0 , then a stopping time is a random time τ : Ω → N0 ∪ {∞} that satisfies
{τ ≤ n} ∈ Fn , ∀ n ∈ N0 .
Observe that, stopping times, like martingales, are defined with respect to a specific filtration F. Also
note that a stopping time may be infinite. The meaning of a stopping time should be fairly clear from
(2.5). If τ is an F-stopping time, then for any t ≥ 0 we should be able to say whether or not τ has
occurred given the information in Ft . Below, we give two examples of random times, one of which is a
stopping time, one of which is not.
Example 2.4.2. Let X = (Xt )t ≥0 be a continuous time stochastic process on (Ω, F, P) and let F = (Ft )t ≥0
be the filtration generated by observing the path of X, that is, Ft = σ(Xs , 0 ≤ s ≤ t ). Define
The first random time τ is clearly a F-stopping time. At any time t ≥ 0, using the information in Ft , we
can clearly answer the question: has X hit a? On the other hand, the second random time τ is not a
stopping time. At time t we will not be able to answer the question: when is the last time that X will
hit a?
E[MT∧τ |Ft ] = Mt ∧τ .
We will not prove Theorem 2.4.3 rigorously. However, we will provide a bit of intuition for why the
theorem is true. For any t < τ we have Mτ ∧t = Mt and M is a martingale. Likewise, for any t ≥ τ we
have Mτt = Mτ , which is a constant (and thus trivially a martingale). As the process Mτ is a martingale
both prior to and after τ , it is reasonable to expect that Mτ is in fact a martingale.
Just as we can construct a σ-algebra Ft of information up to a fixed time t , we can construct a σ-algebra
Fτ of information up to a stopping time τ .
Definition 2.4.4. Fix a probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 . Suppose τ is an
F-stopping time. Then we define the σ-algebra Fτ at the stopping time τ as follows
Fτ := {A ∈ F∞ : A ∩ {τ ≤ t } ∈ Ft ∀ t ≥ 0}.
The idea underlying Definition (2.4.4) is that if a set A is observable at time τ then for any time t , its
restriction to the set {τ ≤ t } should be in Ft . The restriction to sets in A ∈ F∞ is to take account of the
possibility that the stopping time can be infinite and ensures that A ∩ {τ ≤ ∞} ∈ F∞ . If A ∈
/ F∞ then
we could have A ∩ {τ ≤ ∞} ∈ / F∞ . From the above definition, a random variable X is Fτ -measurable if
and only if 1{τ ≤t } X ∈ Ft for all t ∈ [0, ∞].
Now, suppose a process M = (Mt )t ≥0 is a martingale with respect to a filtration F = (Ft )t ≥0 . Suppose
further that M has a well-defined limit M∞ := limt →∞ Mt . As M is a martingale we have (by definition)
that Mt1 = E[Mt2 |Ft1 ] for any 0 ≤ t1 ≤ t2 ≤ ∞. Now, consider two F-stopping times τ1 and τ2 satisfying
0 ≤ τ1 ≤ τ2 ≤ ∞. One may ask: is it true that Mτ1 = E[Mτ2 |Fτ1 ]? Unfortunately, the answer to this
question is “no, not in general.” However, under certain conditions, provided in the following theorem,
we will have Mτ1 = E[Mτ2 |Fτ1 ].
32 CHAPTER 2. INFORMATION AND CONDITIONING
Theorem 2.4.5 (Doob’s Optional Stopping). Let M = (Mt )t ≥0 be martingale with respect to a
filtration F = (Ft )t ≥0 . Suppose that for any ε > 0 there exists a constant Kε ∈ [0, ∞) such that 1
Suppose that τ1 and τ2 are F-stopping times and that 0 ≤ τ1 ≤ τ2 . Then we have
We will not prove Theorem 2.4.5. Rather, we will demonstrate its usefulness through an example.
where the (Xi ) are i.i.d. random variables with P(Xi = 1) = P(Xi = –1) = 1/2. Let F = (Fn )n∈N0
where Fn := σ(Si , 0 ≤ i ≤ n). For any a ∈ Z define the first hitting time to a as follows
τa := inf{n ∈ N0 : Sn = a}.
Now, suppose a < x < b. We wish to find P(τa < τb ). To this end, we define
τ := τa ∧ τb .
Observe that τ is an F-stopping time and S is an F martingale. The stopped process Sτ := (Sn∧τ )n∈N
is a martingale by Theorem 2.4.3. Moreover, because a < Sτn < b we see that Sτ satisfies (2.6). Thus, we
have by Theorem 2.4.5 that
where we have used the fact that Sττ = a when τa < τb and Sττ = b when τb < τa . We also have
The condition given in (2.6) is essential in order for (2.7) to hold, as we shall see in the next example.
Example 2.4.7. Let B = (Bi )i ∈N be an iid sequence of Bernoulli random variables: B ∼ Ber(p) with
p = 1/2. Construct a sequence X = (Xi ) of random variables as follows
X0 = 1, Xn+1 = 2Bn+1 Xn .
Define a filtration F = (Fn )n∈N0 where Fn = σ(Bi , 0 ≤ i ≤ n). Observe that both B and X are
F-adapted. It is easy to see that X is a martingale with respect to F because
Note, however, that condition (2.6) does not hold. Now, let us define the first hitting time of X to zero:
τ := inf{n ≥ 0 : Xn = 0}. Then we have
EXτ = E0 = 0 6= X0 .
Thus, we see that, without condition (2.6), we cannot expect (2.7) to hold.
2.5 Exercises
Exercise 2.1. Let Ω = {a, b, c, d} and let F = 2Ω (the set of all subsets of Ω). We define a probability
measure P as follows
and Z = X + Y. (a) List the sets in σ(X). (b) What are the values of E[Y|X] for {a, b, c, d}? Verify the
partial averaging property: E[1A E[Y|X]] = E[1A Y] for all A ∈ σ(X). (c) What are the values of E[Z|X]
for {a, b, c, d}? Verify the partial averaging property.
Exercise 2.2. Fix a probability space (Ω, F, P). Let Y be a square integrable random variable: EY2 < ∞
and let G be a sub-σ-algebra of F. Show that
Exercise 2.3. Give an example of a probability space (Ω, F, P), a random variable X and a function
f such that σ(f (X)) is strictly smaller than σ(X) but σ(f (X)) 6= {∅, Ω}. Give a function g such that
σ(g(X)) = {∅, Ω}.
Exercise 2.4. On a probability space (Ω, F, P) define random variables X and Y0 , Y1 , Y2 , . . . and suppose
E|X| < ∞. Define Fn := σ(Y0 , Y1 , . . . , Yn ) and Xn = E[X|Fn ]. Show that the sequence X0 , X1 , X2 , . . .
is a martingale under P with respect to the filtration (Fn )n≥0 .
Exercise 2.5. Let X0 , X1 , . . . be i.i.d Bernoulli random variables with parameter p (i.e., P(Xi = 1) = p).
Pn
Define Sn = i =1 Xi where S0 = 0. Define
1 – p 2Sn –n
Zn := , n = 0, 1, 2, . . . .
p
The notes from this chapter are taken primarily from (Grimmett and Stirzaker, 2001, Chapter 5).
Definition 3.1.1. Suppose X is a discrete random variable taking values in {0} ∪ N. We define the
probability generating function of X, written GX (s), by
GX (s) := Es X = s k fX (k ).
X
Why is GX call the “probability generating function”? The reason is that the coefficient fX (k ) of the
O(s k ) term in the series expansion of GX (s) is precisely P(X = k ). So, one can expand GX as a power
series and obtain the probability mass function fX .
GX (s) = qs 0 + ps 1 = q + ps.
35
36 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS
(n) X!
GX (1) = E . (3.1)
(X – n)!
Proof. We have
(n) dn dn X X!
GX (s) = n G X (s) = E n s =E s X–n . (3.2)
ds ds (X – n)!
Taking s = 1 in (3.2) yields (3.1)
Given two independent random variables X and Y taking values in {0}∪N, one can compute P(X+Y = k )
using
n
X
P(X + Y = n) = P(X = k )P(Y = n – k ).
k =0
Alternatively, one can compute the probability generating function of X + Y and then expand the
generating function to obtain the probability P(X + Y = n).
Proof. We have
The real use of generating functions arises when one wants to compute P( n
P
i =1 Xi = k ) where (Xi )i ≥1
is an iid sequence of random variables. In this case, computing the generating function GSn (s) with
Pn
Sn = i =1 Xi is relatively easy. Upon computing the generating function GSn , one can compute
probabilities of the form P(Sn = k ) by expanding GSn . By contrast, computing P(Sn = k ) directly from
the probability mass function of Xi is difficult.
3.1. GENERATING FUNCTIONS 37
Pn
Theorem 3.1.7. Suppose (Xi )i ≥1 are iid with common distribution X. Define Sn := i =1 Xi . Then
Proof. We have
Pn
Example 3.1.8. Let (Xi )i ≥1 be iid with each Xi ∼ Ber(p). Define Sn := i =1 Xi . Recall from Example
1.4.9 that Sn ∼ Bin(n, p). We have
Thus, we have obtained the probability generating function of the binomial random variable Sn . Although
we could have have computed GSn using the probability mass function fSn directly, this approach would
have been much more work.
Pn
Theorem 3.1.9. Suppose (Xi )i ≥1 are iid with common distribution X. Define Sn := i =1 Xi . Let
N be independent of (Xi )i ≥1 . Then
Proof. We have
One can extend the notion of a probability generating function to multiple random variables in the
obvious way.
Definition 3.1.10. Suppose X and Y are discrete random variables taking values in {0} ∪ N. We define
the joint probability generating function of (X, Y), written GX,Y (s, t ), as
k m
The coefficient fX,Y (k , m) of the O(s k t m ) term in the power series expansion of GX,Y (s, t ) about (0, 0)
gives P(X = k , Y = m).
(n,m)
Theorem 3.1.11. Let GX,Y denote the (n, m)th partial derivative of GX with respect to the first
and second arguments. Then
(n,m) X! Y!
GX,Y (1, 1) = E .
(X – n)! (Y – m)!
We would like to know what the probability mass function, mean and variance of Zn are. Our route for
obtaining these will be through the generating function GZn .
Theorem 3.2.1. Assume Zn+1 is given by (3.4) and that the collection of random variables (Xn,i )
are iid. Define Gn (s) := Es Zn and G(s) := Es X . Then
Gn+m (s) = Gn (Gm (s)) = Gm (Gn (s)), and thus Gn (s) = G(G(. . . G(s) . . .)) .
| {z }
n-fold iteration
Proof. Let Ym,i denote the number of offspring in the (m + n)th generation that descend from the
i th member of the mth generation. Clearly, the (Ym,i ) are iid and Ym,i ∼ Zn . The number of members
of the (m + n)th generation is given by
We have a random sum of iid random veriables. By Theorem 3.1.9 it follows that Gm+n (s) = Gm (Gn (s)).
Obiouvsly, we can interchange m ↔ n and obtain Gm+n (s) = Gn (Gm (s)). Finally, we have
as claimed.
3.3. CHARACTERISTIC FUNCTIONS 39
In principle, one can obtain the probality mass function fn of the nth generation Zn by expanding the
generating function Gn (s) as a power series about s = 0. In practice, this may be difficult to carry out.
However, moments of Zn can usually be computed with relative ease.
where we have used G(1) = E1Z1 = 1. Iterating, we obtain EZn = µn . Next, from (3.3), we compute
d2 d 0
EZ2n – EZn = G (s)
= G (G(s)) · G 0 (s)
n
n–1
ds 2
s=1 ds
s=1
00 (G(s)) · (G0 (s))2 + G0 (G(s)) · G00 (s)
= Gn–1
n–1
s=1
Thus, using (3.6), as well as EZk = µk and EZ2k = VZk + (µk )2 , we obtain
Thus, we have obtained an expression for VZn in terms of VZn–1 , µ and σ 2 . Solving this explicitly
yields (3.5).
The characteristic function (unlike the moment generating function) always exists since E|eit X | = 1.
Clearly, if GX and φX both exist, then we have
–(x – µ)2
Z ∞ !
1 1
φX (t ) = dx eitx √ exp = exp iµt – σ 2 t 2 .
–∞ 2πσ 2 2σ 2 2
Characteristic functions have several uses. First, they can be used to capture the moments of a random
variable when they exist.
Proof. We have
dn dn it X
φ (t ) = E e = in EXn eit X ,
dt n X dt n
Now set t = 0 to complete the proof.
The characteristic function uniquely determines the distribution of a random variable. In other words,
there is a one-to-one correspondence between FX and φX . We show this for a continuous random variable.
3.3. CHARACTERISTIC FUNCTIONS 41
Theorem 3.3.6 (Inversion). Suppose X is a continuous random variable with density fX . Then
1 Z
fX (x ) = dt e–itx φX (t ),
2π R
for all x where fX is differentiable.
Rx
To obtain FX from fX simply use FX (x ) = –∞ dy fX (y).
Proof. The proof of Theorem 3.3.6 follows from standard Fourier results. We have
1 Z –itx
Z
1 Z –itx X 1 Z
fX (x ) = dt e ity
dy e fX (y) = dt e Ee it = dt e–itx φX (t ).
2π R R 2π R 2π R
An inversion theorem for random variables that are not continuous also exists. Though, it is not
particularly useful for the purposes of computation.
Perhaps the most important property of characteristic functions is they can be used to prove the
convergence of a sequence of random variables to a limiting distribution.
Definition 3.3.7. We say that a sequence of distribution functions (Fn )n≥1 converges to a distribution
F, written Fn → F, if limn→∞ Fn (x ) = F(x ) at all points x where F is continuous.
Theorem 3.3.8 (Continuity Theorem). Let (Fn )n≥1 be a sequence of distributions functions with
corresponding characteristic functions (φn )n≥1 .
Item 2 in Theorem 3.3.8 if particularly powerful. If Fn and φn are, respectively, the distribution and
characteristic function of a sum of n independent random variables, it is often easier to compute φn
than Fn . If we can compute φn and find its limit, then we can obtain F.
where λ > 0 is a fixed constant. For n large enough, λ/n < 1. Note that EYn = n1 EXn = n1 nλ = λ1 for
all n. We would like to know what the limiting distribution of the sequence (Yn )n≥0 is; we will use the
Continuity Theorem 3.3.8 to do this. We have
Definition 3.4.1. We say that a sequence of random variables (Xn )n≥1 converges in distribution to a
D
random variable X, written Xn → X, if FXn → FX .
Theorem 3.4.2 (Law of Large Numbers). Let (Xn )n≥1 be a sequence of iid random variables
D
with EXn = µ. Define a sequence of random variables (Sn )n≥1 by Sn := n1
Pn
i =1 Xi . Then Sn → µ.
In the proof of Theorem 3.4.2 we actually assumed that EX2n < ∞ when we wrote the O((t /n)2 ). In
fact, the Law of Large Numbers (LLN) holds even when EX2n = ∞.
Theorem 3.4.3 (Central Limit Theorem). Let (Xn )n≥1 be a sequence of iid random variables
with EXn = µ and VXn = σ 2 . Define two sequences of random variables (Sn )n≥1 and (Un )n≥1 by
n
X Sn – nµ
Sn = Xi , Un = √ .
i =1 nσ 2
D
Then Un → Z where Z ∼ N(0, 1).
3.5. LARGE DEVIATIONS PRINCIPLE 43
Note that EYn = 0 and VYn = 1. Next, we compute the characteristic function of Un . From Theorem
3.3.2 we have
!n
√ n t2 1 2
φUn (t ) = φY (t / n) = 1 – 3
+ O((t /n) ) → exp – t , n → ∞.
2n 2
From Example 3.3.4, we see that exp – 12 t 2 = φZ (t ) where Z ∼ N(0, 1). Thus, we have shown that
D
φUn → φZ . By Theorem 3.3.8 we have FUn → FZ and thus, from Definition 3.4.1, we have Un → Z, as
claimed.
In the proof of Theorem 3.4.3 we assume E|Xi |3 < ∞ when we wrote the O((t /n)3 ) term. The Central
Limit Theorem (CLT) holds even when E|Xi |3 = ∞.
M0 (0) EXet X
0 (0)
ΛX = X = = EX = µ.
MX (0) Eet X t =0
We also note that ΛX is convex because
00 (t ) – (M0 (t ))2
MX (t )MX Eet X EX2 et X – (EXet X )2
00 (t )
ΛX = X = ≥ 0, (3.8)
M2X (t ) M2X (t )
44 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS
Theorem 3.5.1 (Large Deviations Principle). Let (Xi ) be a sequence of iid random variables
with common distribution X. Define µ := EX and suppose the the moment generating function
MX (t ) := Eet X is finite in some neighborhood of t = 0. Let ΛX and Λ∗X be given by (3.7) and (3.9),
respectively. Suppose a > µ and P(X > a) > 0. Then Λ∗ (0) > 0 and
1 n
log P(Sn > na) = –Λ∗X (a),
X
lim Sn = Xi . (3.10)
n→∞ n
i =1
Theorem 3.5.1 asserts that, under appropriate conditions, we have P(Sn > na) ∼ exp(–nΛ∗X (a)).
Although the theorem appears to deal only with deviations of Sn in excess of its mean, the corresponding
result for deviations of Sn below the mean can be obtained by considering the sequence of iid random
variables (–Xi ).
Proof of Theorem 3.5.1. Without loss of generality, we may assume that µ = 0 (if µ 6= 0, we can
define Yi = Xi – µ and translate the result accordingly). We begin by proving that Λ∗X (a) > 0. We have
eat 1 + at + O(t 2 )
at – ΛX (t ) = log = log , where σ 2 := VX.
MX (t ) 1 + σ 2 t 2 /2 + O(t 3 )
For t > 0 sufficiently small we have
1 + at + O(t 2 )
> 1.
1 + σ 2 t 2 /2 + O(t 3 )
Hence, we have
1 + at + O(t 2 )
Λ∗X (a) = sup {at – ΛX (t )} ≥ sup log > 0.
t ∈R t ∈R+ 1 + σ 2 t 2 /2 + O(t 3 )
We make two notes for future use. First, by assumption we have a > µ = 0. As ΛX is convex with
0 (0) = µ = 0 and Λ (0) = 0 it follows that
ΛX X
00 exists.
ΛX is strictly convex at points where ΛX (3.12)
To see this, note that VX > 0 under the hypotheses of the theorem, implying by (3.8) and the Cuachy-
00 (t ) > 0.
Schwarz inequality that ΛX
We now proceed to derive an upper bound for P(Sn > na). Using the fact that et Sn > enat 1{Sn >na} for
all t > 0 we obtain
n
P(Sn > na) = E1{Sn >na} ≤ e–nat Eet Sn = e–at MX (t ) = e–n(at –ΛX (t )) , ∀ t > 0. (3.13)
1
log P(Sn > na) ≤ – sup{at – ΛX (t )} = –Λ∗X (a), (3.14)
n t >0
Before obtaining lower bound for P(Sn > na) let us define
This assumption is not necessary, but simplifies the proof considerably. As at – ΛX (t ) has a maxumm at
t ∗ , and as ΛX is C∞ on (0, T), the derivative of at – ΛX (t ) equals zero at t = t ∗ and hence
0 (t ∗ ) = a.
ΛX
Let (X e = Pn X
e ) be a sequence of iid random variables with common distribution F and define S
n i =1 i .
e
i X
e
Observe that
∗
Z Z
e(t +t )x MX (t + t ∗ )
MX
e (t ) := etx dF (x ) = dF X (x ) = .
R X R MX (t ∗ ) MX (t ∗ )
e
46 CHAPTER 3. GENERATING AND CHARACTERISTIC FUNCTIONS
0 (t ∗ )
MX
EX
e = M (0) = 0 (t ∗ ) = a,
= ΛX
X
e
M(t ∗ )
2 e 2 = M00 (0) – (M0 (0))2 = Λ00 (t ∗ ) ∈ (0, ∞).
VX
e = EX
e – (EX)
X
e X
e X
00 (t ∗ ) < ∞ follow from (3.12) and the assumption (3.15). Noting that S and
where, the fact that 0 < ΛX n
S
e are sums of iid random variables, we have
n
MSn (t ) = Mn
X (t ), MeS (t ) = Mne (t ).
n X
Using the above and (3.16), it is easy to show that FSn and FSe are related as follows
n
Z x
1 t ∗ y dF (y).
FeS (x ) = n e Sn
n MX (t ∗ ) –∞
Now, let b > a. We have
Z ∞
P(Sn > na) = dFSn (x )
Zna
∞
= Mn ∗ –t ∗ x dF (x )
X (t )e Sn
na
e
Z nb
≥ Mn ∗ –nt ∗ b
X (t )e dFeS (x )
n
na
∗ ∗
≥ e–n(t b–ΛX (t )) P(na <S
e < nb).
n
1 1
log P(Sn > na) ≥ –(t ∗ b – ΛX (t ∗ )) + log P(na < S
e < nb)
n
n n
→ –(t ∗ b – ΛX (t ∗ )) as n → ∞
→ –(t ∗ a – ΛX (t ∗ )) = –Λ∗X (a) as b → a. (3.17)
Example 3.5.2. Let (Xi ) be a sequence of iid random variables with P(Xi = 1) = P(Xi = –1) = 1/2.
We claim that
1
P(Sn > an)1/n → q , 0 < a < 1,
(1 + a)1+a (1 – a)1–a
3.6 Exercises
Exercise 3.1. Let X ∼ Bin(n, U) where U ∼ U((0, 1)). What is the probability Generating function
GX (s) of X? What is P(X = k ) where k ∈ {0, 1, 2, . . . , n}?
Exercise 3.2. Let Zn be the size of the nth generation in an ordinary branching process with Z0 = 1,
EZ1 = µ and VZ1 > 0. Show that EZn Zm = µn–m EZ2m for m ≤ n. Use this to find the correlation
coefficient ρ(Zm , Zn ) in terms of µ, n, and m. Consider the case µ = 1 and the case µ 6= 1.
Exercise 3.3. Consider a branching process with generation sizes Zn satisfying Z0 = 1 and P(Z1 = 0) = 0.
Pick two individuals at random with replacement from the nth generation and let L be the index of the
generation which contains their most recent common ancestor. Show that P(L = r ) = EZ–1 –1
r – EZr +1 for
0 ≤ r < n.
2
Exercise 3.5. Find φX2 (t ) := Eeit X where X ∼ N(µ, σ 2 ).
Exercise 3.7. A coin is tossed repeatedly, with heads turning up with probability p on each toss. Let
N be t he minimum number of tosses required to obtain k heads. Show that, as p → 0, the distribution
function of 2Np converges to that of a gamma distribution. Note that, if X ∼ Γ(λ, r ) then
1 r r –1 –λx
fX (x ) = λ x e 1{x ≥0} .
Γ(r )
The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 6).
Definition 4.1.1. A discrete time Markov chain X = (Xn )n∈N0 is a discrete-time Markov process
with a countable state space S.
Recall from (2.4) that a Markov process defined on a probability space (Ω, F, P), which is equipped with a
filtration F = (Fn )n∈N0 , is a proxess X = (Xn )n∈N0 that satisfies P(Xn+k ∈ A|Fn ) = P(Xn+k ∈ A|Xn ).
What the Markov property means in practice is that the evolution of a Markov chain is described by
its one-step transition probabilities P(Xn+1 = j |Xn = i ). In general, such probabilities may depend on
the time n. Throughout this Chapter, we will restrict our attention to Markov chains whose transition
probabilities do not depend on n.
We call the |S| × |S| matrix P = (p(i , j )) the one-step transition matrix.
49
50 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
Since P(Xn+1 = j |Xn = i ) = P(X1 = j |X0 = i ), it follows that P(Xn+m = j |Xm = i ) = P(Xn =
j |X0 = i ).
pn (i , j ) := P(Xn = j |X0 = i ) ∀ n ∈ N0 , ∀ i , j ∈ S.
We call the |S| × |S| matrix Pn = (pn (i , j )) the n-step transition matrix.
From the one-step transition matrix P, one can easily derive the n-step transition matrix Pn .
Theorem 4.1.4 (Chapman-Kolmogorov Equation). Let P and Pn the the one-step and n-step
transition matrices of a homogeneous discrete-time Markov chain. Then
Pm+n = Pm Pn , Pn = Pn ,
Pn = Pn–1 P = Pn–2 P2 = . . . = Pn ,
as claimed.
Lemma 4.1.5. Let X be a homogeneous discrete time Markov chain. Denote the probability mass
function of Xn by
Example 4.1.6. Consider a random walk on a circle with n nodes. At each step a particle moves
clockwise with proability p and counter-clockwise with probability q. Let us write the one-step transition
matrix. We have
0 p 0 0 ... 0 0 q
q 0 p 0 ... 0 0 0
0 q 0 p ... 0 0 0
P= .
.. .. .. .. . . .. .. ..
. . . . . . . .
0 0 0 0 ... q 0 p
p 0 0 0 ... 0 q 0
Example 4.1.7. Consider a random walk on N0 . At each step a particle moves to the up one unit with
probability p and returns to the origin with probability q. Let us write the one-step transition matrix.
We have
q p 0 0 0 ...
q 0 p 0 0 ...
P= .
q 0 0 p 0 ...
.. .. .. .. .. . .
. . . . . .
If a Markov chain visits a persistent state once, then it is guarnateed to return to that state. In fact, it is
guaranteed to return to that state infinitely often.
We would like to find conditions on the transition matrix P that enable us to classify a state as either
persistent or transient.
Definition 4.2.2. Let X be a Markov chain. We define the first passage to state j by
τj := inf{n ≥ 1 : Xn = j )
Pii (s) = 1 + Fii (s)Pii (s), Pij (s) = Fij (s)Pjj (s), i 6= j . (4.1)
Note that we must restrict |s| < 1 because P is not a probability generating function. When we need
s = 1 we can take a limit as s % 1 and use Abel’s Theorem. which states that
∞
X
lim Pij (s) = pn (i , j ). (4.2)
s%1 n=0
Proof. Fix i , j ∈ S and define
Clearly, τ̄j = ∞ if a state is transient. However, τ̄j may be infinite even if j is recurrent.
Definition 4.2.7. A recurrent state j is said to be null if τ̄j = ∞ and non-null or positive if τ̄j < ∞.
This is a simple condition which differentiates null recurrent from positive recurrent states.
Theorem 4.2.8. A persistent state j is null if and only if pn (j , j ) → 0 as n → ∞; if this holds then
pn (i , j ) → 0 as n → ∞ for all i .
54 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
Definition 4.2.9. The period of a state i , denoted d(i ), is defined by d(i ) = gcd{n : pn (i , i ) > 0}. If
d(i ) = 1 we say that state i is aperiodic. Here gcd means “greatest common divisor.”
Proof. We will prove Item 1. If i ↔ j then there exists k , n such that α := pk (i , j )pn (j , i ) > 0.
Therefore, for any m, we have
and thus j is transient is well. To show the converse simply switch i and j in the above argument.
4.3. CLASSIFICATION OF CHAINS 55
Definition 4.3.3. Let C ⊂ S. We say that C is closed if p(i , j ) = 0 for all i ∈ C and j ∈
/ C. We say
that C is irreducible if i ↔ j for all i , j ∈ C.
If a closed set C consists of a single state (e.g., C = {j }), then we call this state absorbing.
Theorem 4.3.4 (Markov chain decomposition). The state space S of a Markov chain can be
uniquely partitioned as follows
S = T ∪ C1 ∪ C2 ∪ . . . ,
where T is the set of transient states and C1 , C2 , . . . are closed sets of persistent states.
Lemma 4.3.5. If S is finite, then at least one state is persistent and all persistent states are
non-null.
Proof. First, suppose all state are transient. Then, by Corollary 4.2.5, we have
X X
0= lim pn (i , j ) = lim pn (i , j ) = 1.
n→∞ n→∞
j j
Thus, we have a contradiction and we conclude that at least one state is persistent. Now, suppose j ∈ Ck ,
where Ck is a closed set of persistent states. Suppose j is null. Then by Theorem 4.3.2, all states i ∈ Ck
are null. Then, by Theorem 4.2.8 we have
X X
0= lim pn (i , j ) = lim pn (i , j ) = 1.
n→∞ n→∞
j j
The states subsets C1 = {1, 2} and C2 = {5, 6} are closed, persistent, non-null sets, since, once the chain
visits these sets, it cannot escape. The subset T = {3, 4} is transient. If X0 ∈ T, it will eventually move
to either C1 or C2 , where it will remain forever.
Definition 4.4.1. Let X be a Markov chain with one-step transition matrix P. We say that a row vector
The vector π is called a stationary distribution for the following reason: suppose the Markov chain X
has an initial distribution µ0 = π. Then we have
µn = πPn = πPPn–1
= πPn–1 = πPPn–2
= ...
= πP = π.
Example 4.4.2. Consider a two-state Markov chain with state space S = {1, 2} and with one-step
transition matrix
1–p p
P= , q, p ∈ (0, 1).
q 1–q
Let us find π. From π = πP we derive
We also have π(1) + π(2) = 1. Solveing for π(1) and π(2), we obtain
q p
π(1) = , π(2) = .
p+q p+q
4.4. STATIONARY DISTRIBUTIONS AND THE LIMIT THEOREM 57
For a given Markov chain X, a stationary distribution may not exist. And, if it does exist, it may not be
unique.
Theorem 4.4.3. An irreducible chain X has a stationary distribution π if an only if all the states
a non-null persistent; in this case π is unique and is given by
π(i ) = 1/τ̄i , ∀ i ∈ S,
It is typically much easier to compute π rather than the recurrence times τ̄i . Theorem 4.4.3 gives us a
method of computing τ̄i from π(i ).
Example 4.4.4. Consider a Markov chain with state space S = {0, 1, 2, . . .} and with one-step transition
matrix
q p 0 0 0 ...
q 0 p 0 0 ...
0 q 0 p 0 ...
P= , p + q = 1.
0 0 q 0 p ...
.. .. .. .. .. . .
. . . . . .
Note that the chain is irreducible. Thus, if an invariant distribution π exists, it is unique. Let us find π
(if it exists). From π = πP we have
1–q p
π(0) = qπ(0) + qπ(1), ⇒ π(1) = π(0) = π(0),
q q
!2
1 p
π(1) = pπ(0) + qπ(2), ⇒ π(2) = π(1) – pπ(0) = π(0),
q q
!3
1 p
π(2) = pπ(1) + qπ(3), ⇒ π(3) = π(2) – pπ(1) = π(0).
q q
This suggests that π(n) = (p/q)n π(0). Assume this is true for every i ≤ n. We show it holds for n + 1
(and thus, for every n). We have
1
π(n) = pπ(n – 1) + qπ(n + 1), ⇒ π(n + 1) = π(n) – pπ(n – 1)
q
!n !n–1
1 p p
= π(0) – p π(0)
q q q
!n+1
p
= π(0).
q
58 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
P
Next, we use n π(n) = 1 to find π(0). We have
∞ ∞ 1
(p/q)n = π(0)
X X
1= π(n) = π(0) , if (p/q) < 1.
n=0 n=0 1 – (p/q)
Thus, if (p/q) < 1 then π(0) = 1 – (p/q). Finally, let us find the mean recurrence time of the nth state.
We have
!n !n
1 1 q 1 q
τ̄n = = = .
π(n) π(0) p 1 – (p/q) p
Note that if (p/q) > 1, then there is no invariant distribution π. In this case, the chain is transient and
the mean recurrence time for every state is infinite. What happens ini p = q? It turns out that in this
case, all states are null recurrent (need to check).
4.5 Reversibility
Definition 4.5.1. Let X = (Xn )0≤n≤N be a Markov chain. The time reversal of X is the process
Y = (Yn )0≤n≤N , where
Yn = XN–n .
Theorem 4.5.2. Let X = (Xn )0≤n≤N be an irreducible Markov chain with a one-step transition
matrix P and invariant distribution π. Suppose X0 ∼ π (so that µn = π for every n). Then Y, the
time reversal of X, is a Markov chain and its one-step transition matrix, denoted Q = (q(i , j ))ij ,
is given by
π(j )
q(i , j ) = p(j , i ). (4.5)
π(i )
Proof. Let Fn = σ(Yk , 0 ≤ k ≤ n). We cannot assume, a priori that Y is Markov. Thus, in computing
P(Yn+1 = in+1 |Fn ) we must condition on the entire history Fn of Y rather than conditioning on Yn
only. We have
Thus, we have show that Y is Markov and its one-step transition matrix is defined by (4.5).
Definition 4.5.3. Let X = (Xn )n≥0 be a irreducible Markov chain with a one-step transition matrix P
and invariant distribution π. Suppose X0 ∼ π (so that µn = π for every n). Let Y be the time reversal
of X and denote by Q the one-step transition matrix of Y. We say that X is reversible if P = Q, or
equivalently (by Theorem 4.5.2), if
Definition 4.5.4. Let P be an |S| × |S| one step transition matrix. We say that a distribution
λ = (λ(1), λ(2), . . . , λ(|S|)) is in detailed balance with P if
Theorem 4.5.5. Let P be the one step transition matrix of an irreducible Markov chain X. Suppose
there exists a distribution λ such that (4.6) holds. Then λ is a stationary distribution for X. That
is, λ = λP.
1. λ1 = 1 is an eigenvalue of P.
60 CHAPTER 4. DISCRETE TIME MARKOV CHAINS
λn = ω n–1 , ω = e2πi/d , n = 1, 2, . . . , d,
are eigenvalues of P.
3. The remaining eigenvalues λd+1 , λd+2 , . . . , λ|S| , satisfy |λj | < 1.
Suppose the eigenvalues of P are distinct. Then it is well-known that P has the decomposition
e1 λ1 0 ... 0
e2 0 λ2 . . . 0
P = U–1 ΛU, U=
.. ,
Λ=
.. .. . .
,
.
. . . ...
e|S| 0 0 . . . λ|S|
where (ei )1≤i ≤|S| are the left eigenvectors of P.
Pn = U
|
–1 ΛU · U–1 ΛU · . . . · U–1 ΛU = U–1 Λn U,
{z } (4.7)
n times
Expression (4.7) allows us to study the long-run behavior n → ∞ of a Markov chain X. In what follows,
assume the period of X is one so that the eigenvalues of P satisfy 1 = λ1 > |λ2 | > |λ3 | > . . . > |λ|S| |. Let
X0 = µ0 . As the eigenvectors (ei )1≤i ≤|S| form a basis for R|S| we can express µ0 as
|S|
X
µ0 = ci ei , (4.8)
i =1
fo some constants (ci ). Next, using (4.7) and (4.8), we obtain
|S| |S| |S|
Pn U–1 Λn U ci λn ci λn
X X X
µn = µ0 = ci ei = i ei . = c1 απ + i ei ,
i =1 i =1 i =2
where, in the last step, we used λ1 = 1 and e1 = απ for some constant α. As n → ∞, the terms in the
sum go to zero at least as fast as |λ2 |n = exp(n log |λ2 |). Thus, µn → π exponentially fast with a rate
determined by the second eigenvalue.
4.7 Exercises
Exercise 4.1. A six-sided die is rolled repeatedly. Which of the following a Markov chains? For those
that are, find the one-step transition matrix. (a) Xn is the largest number rolled up to the nth roll. (b)
X – n is the number of sixes rolled in the first n rolls. (c) At time n, Xn is the time since the last six
was rolled. (d) At time n, Xn is the time until the next six is rolled.
4.7. EXERCISES 61
Exercise 4.2. Let Yn = X2n . Compute the transition matrix for Y when (a) X is a simple random
walk (i.e., X increases by one with probability p and decreases by 1 with probability q and (b) X is a
branching process where G is the generating function of the number of offspring from each individual.
Exercise 4.3. Let X be a Markov chain with state space S and absorbing state k (i.e., p(k , j ) = 0 for
all j ∈ S). Suppose j → k for all j ∈ S. Show that all states other than k are transient.
Find Pn , the invariant distribution π and the mean-recurrence times τ̄j for j = 1, 2, 3.
Exercise 4.6. Let Xn be the number of mistakes in the nth addition of a book. Between the nth and
the (n + 1)th addition and editor corrects each mistake independently with probability p and introduces
Yn new mistakes where the (Yn ) are iid and Poisson distributed with parameter λ. Find the invariant
distribution π of the number of mistakes in the book.
Exercise 4.7. Give an example of a transition matrix P that admits multiple stationary distributions π.
The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 6).
1. N0 = 0.
2. If s < t then Ns ≤ Nt .
As its name suggests, a counting process simply counts the number of times an event occurs prior to a
given time. For example, we could define Nt := {number of buy orders prior to time t }. Of course, there
are many ways to model a counting process. And, depending on how we model N, the counting process
may or may not be Markov. However, because Markov processes allow for many analytically tractable
computations, we will focus on Markov counting processes; the most basic of these is the Poisson process.
Definition 5.1.2. A Poisson process with intensity λ is a stochastic process N = (Nt )t ≥0 taking values
in S = {0, 1, 2, . . .} such that the following hold.
1. N0 = 0.
63
64 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
2. If s < t then Ns ≤ Nt .
3. If s < t then (Nt – Ns ) ⊥⊥ Fs , where Fs = σ(Nr , 0 ≤ r ≤ s).
4. Lastly,
λs + O(s 2 ) m = 1,
P(Nt +s = n + m|Nt = n) = O(s 2 ) m ≥ 1,
1 – λs + O(s 2 ) m = 0,
as s → 0+ .
Note that Item 3 implies that N is Markov. To see this, define Ft := σ(Ns , 0 ≤ s ≤ t ) and observe that
∞
X
E[g(Nt )|Fs ] = E[g(Nt – Ns + Ns )|Ns ] = g(n + Ns )fNt –Ns (n).
n=0
Theorem 5.1.3. Let N = (Nt )t ≥0 be a Poisson process with parameter λ. Then, for all t ≥ 0 we
have Nt ∼ Poi(λt ). That is
(λt )j –λt
pt (j ) := P(Nt = j ) = e . (5.1)
j!
Proof. We have
j
X
P(Nt +s = j ) = P(Nt +s = j |Nt = i )P(Nt = i )
i =0
X j
= P(Nt +s – Nt = j – i )P(Nt = i )
i =0
= P(Nt +s – Nt = 0)P(Nt = j ) + P(Nt +s – Nt = 1)P(Nt = j – 1)
jX
–2
+ P(Nt +s – Nt = j – i )P(Nt = i )
i =0
= (1 – λs)P(Nt = j ) + λsP(Nt = j – 1) + O(s 2 ).
Subtracting P(Nt = j ) =: pt (j ) from both sides, dividing by s and taking a limit as s → 0 we obtain
where we have included the obvious boundary conditions. Thus, we have obtained a sequence of nested
ODEs for (pt (j ))j ≥0 . One can easily verify that (5.1) satisfies (5.2)-(5.3).
5.1. THE POISSON PROCESS 65
For any counting process (whether or not it is Poisson), we may be interested to know the arrival
time of the nth event and the time between the nth and (n + 1)th events.
Definition 5.1.4. Let N = (Nt )t ≥0 be a counting process. We define Sn , the the nth arrival time, by
τn := Sn – Sn–1 , n ≥ 1. (5.5)
Given the full history of N, we can construct the sequence (τn )n≥1 using (5.4)–(5.5). Alternatively, given
the sequence (τn )n≥1 , we can construct the path of N using
n
X ∞
X
Sn = τi , n ≥ 1, Nt = n1{Sn ≤t <Sn+1 } (5.6)
i =1 n=1
Theorem 5.1.5. Suppose the inter-arrival times τi (see Definition 5.1.4) of a counting process
N = (Nt )t ≥0 are iid and exponentially distributed with parameter λ. Then N is a Poisson process
with parameter λ.
Proof. We will show that if the inter-arrival times (τi )i ≥1 are iid with τi ∼ E(λ) then Nt , given by
(5.6) has a probability mass function given by (5.1). First, we show that Sn , as defined in (5.6), has a
Gamma density
Note that S1 = τ1 and fτ (x ) = fS1 (x ) = 1{x ≥0} λe–λx . Thus, (5.7) is correct for n = 1. We now assume
(5.7) holds for n and show that it holds for n + 1. As Sn+1 = Sn + τn+1 we have
Z
fSn+1 (x ) = 2
dydz fSn (y) · fτ (z ) · δ(y + z – x )
ZR
= dy fSn (y) · fτ (x – y)
R
Z
(λy)n–1 –λy
= dy 1{y≥0} λe · 1{x –y≥0} λe–λ(x –y)
R (n – 1)!
λn+1 e–λx Z x (λx )n –λx
= 1{x ≥0} dy y n–1 = 1{x ≥0} λe ,
(n – 1)! 0 (n)!
66 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
which agrees with (5.7). Thus, by induction, (5.7) holds for every n. Using the above result, we now
show that Nt , as defined in (5.6), has probability mass function (5.1). For n ≥ 1 we clearly have
Z t
(λx )n–1 –λx
Z t
P(Nt ≥ n) = P(Sn ≤ t ) = dx fSn (x ) = dx λe .
0 0 (n – 1)!
Using this result, we compute
Z t
(λx )n –λx (λt )n –λt
P(Nt ≥ n + 1) = dx λe =– e + P(Nt ≥ n),
0 (n)! n!
where we have used integration by parts to obtain the second equality. Thus, we have
(λt )n –λt
P(Nt = n) = P(Nt ≥ n) – P(Nt ≥ n + 1) = e ,
n!
which agrees with (5.1). To complete the proof, we observe that
The inter-arrival times (τi )i ≥1 of a Poisson process N are memoryless in the sense that
It should be clear from the construction of N that the Poisson process has stationary increments
Nt – Ns ∼ Nt –s ,
since Nt – Ns ⊥⊥ Ns .
Definition 5.2.1. A continuous time Markov chain X = (Xt )t ≥0 is a Markov process with a countable
state space S.
Recall from (2.4) that a Markov process satisfies P(Xt +s ∈ A|Fs ) = P(Xt +s ∈ A|Xs ). Here, we take
the filtration Fs to be the natural filtration for X. That is, Fs = σ(Xu , 0 ≤ u ≤ s). What the
Markov property means in practice is that the evolution of a Markov chain is described by its one-step
transition probabilities P(Xt +s = j |Xs = i ). In general, such probabilities may depend on the times t , s.
Throughout this Chapter, we will restrict our attention to Markov chains whose transition probabilities
do not depend on s.
Let Pt be the |S| × |S| matrix Pt = (pt (i , j )). We call the collection of matrices (Pt )t ≥0 the transition
semigroup.
Theorem 5.2.3. The transition semigroup (Pt )t ≥0 satisfies the following properties.
1. P0 = I.
P
2. j pt (i , j ) = 1.
3. Pt Ps = Pt +s .
P
Proof. Item 1 follows from P(X0 = j |X0 = i ) = δi ,j . Item 2 follows from j P(Xt = j |X0 = i ) = 1.
Lastly, to show Item 3 note that
X
pt +s (i , j ) = P(Xt +s = j |X0 = i ) = P(Xt +s = j |Xt = k ) · P(Xt = k |X0 = i )
k
X X
= P(Xs = j |X0 = k ) · P(Xt = k |X0 = i ) = pt (i , k )ps (k , j ).
k k
We now construct a continuous-time Markov chain in very much the same manner we constructed the
Poisson process in Definition 5.1.2. The following derivation is purely formal (i.e., not rigorous). Some
of what we write is not true in general, but holds in many practical applications.
Definition 5.2.4. A continuous time Markov chain with generator G = (g(i , j ))i ,j is a stochastic
process satisfying
The essence of Definition 5.2.4 is as follows. In a small time interval [t , t + s) the process X either
remains at its current state Xt = i or it jumps to a new state j 6= i . The probability of remaining at i is
1 + g(i , i )s + O(s 2 ). The probability of jumping to state j is g(i , j )s + O(s 2 ). Since probabilities should
always fall in [0, 1] we must have
g(i , i ) ≤ 0, g(i , j ) ≥ 0, i 6= j .
Morever, since X should be found somewhere with probability one we must also have
X
1= P(Xt +s = j |Xt = i )
j
1
G = lim Ps – I .
s→0 s
+
Thus, it is clear that we can obtain G from knowledge of Pt . We can also obtain Pt from G.
Theorem 5.2.5. Let X = (Xt )t ≥0 be a continuous time Markov chain with generator G. The
transition semigroup of X satisfies the following ODEs
d
Kolmogorov Forward equation : P = Pt G, (5.8)
dt t
d
Kolmogorov Backward equation : P = GPt . (5.9)
dt t
Proof. To derive the forward equation (5.8) we computed P(Xt +s = j |X0 = i ) by conditioning on the
value of Xt . We have
X X
pt +s (i , j ) = P(Xt +s = j |Xt = k )P(Xt = k |X0 = i ) = pt (i , k )ps (k , j )
k k
pt (i , k )g(k , j )s + O(s 2 ).
X
= pt (i , j )(1 + g(j , j )s) +
k :k 6=j
Subtracting pt (i , j ) from both sides, dividing by s and taking a limit as s goes to zero, we obtain
1 d
X
lim pt +s (i , j ) – pt (i , j ) = pt (i , j ) = pt (i , k )g(k , j ),
s&0 s dt k
5.2. OVERVIEW OF CONTINUOUS TIME MARKOV CHAINS 69
which is the component-wise representation of (5.8). To derive the backward equation (5.9) we computed
P(Xt +s = j |X0 = i ) by conditioning on the value of Xs . We have
X X
pt +s (i , j ) = P(Xt +s = j |Xs = k )P(Xs = k |X0 = i ) = ps (i , k )pt (k , j )
k k
g(i , k )s pt (k , j ) + O(s 2 ).
X
= (1 + g(i , i )s)pt (i , j ) +
k :k 6=i
Subtracting pt (i , j ) from both sides, dividing by s and taking a limit as s goes to zero, we obtain
1 d
X
lim pt +s (i , j ) – pt (i , j ) = pt (i , j ) = g(i , k )pt (k , j ),
s&0 s dt k
If Pt solves the forward equation (5.8) then it also solves the backward equation (5.9) and vice versa.
The solution to (5.8) and (5.9), subject to the initial condition P0 = I is
∞ 1 n n
et G
X
Pt = := t G .
n=0 n!
Example 5.2.6. Let the generator G of a continuous time Markov chain X be given by
–α α
G= , α, β > 0. (5.11)
β –β
We wish to find the semigroup Pt generated by G. We note that G has left eigenvectors and eigenvalues
given by
1 1
λ0 = 0, e0 = q (β, α), λ1 = –α – β, e1 = √ (1, –1).
α2 + β 2 2
For a Poisson process with intensity λ, we found that we could construct N from a sequence of iid
inter-arrival times (τi )i ≥1 that were exponentially distributed: τi ∼ E(λ). A continuous time Markov
chain can be constructed in a similar manner.
Let G = (g(i , j ))i ,j ∈S be the generator of continuous time Markov chain. Consider a discrete time
Markov chain Y = (Yn )n∈N0 with one-step transition matrix P = (p(i , j ))i ,j ∈S . Suppose the entries of
P are given by
Observe that p(i , j ) ≥ 0 ∀i , j ∈ S and p(i , j ) = 1 ∀i as required. Now, let (τi )i ∈N be a sequence of
P
j
independent random variables satisfying
We claim that the process X is a continuous time Markov chain with generator G. We call (Sn )n∈N0 the
jump times and (τn )n∈N the inter-jump times. Compare this construction of a Markov chain to the
second construction we gave for a Poisson process (see Theorem 5.1.5 and the text preceeding it).
The above construction makes clear what the dynamics of a CTMC X with generator G are. Specifically,
once X jumps into state i , the amount of time it remains in that state is exponentially distributed with
parameter –g(i , i ). When a jump occurs, the probability that X jumps from i to j is –g(i , j )/g(i , i ).
To see how this construction is equivalent to Definition 5.2.4, note that, for small s, the probability of
two jumps in the time interval [0, s) is O(s 2 ) because
P(S2 ≤ s) ≤ P(τ1 ≤ s)P(τ2 ≤ s) ≤ (1 – egs )(1 – egs ) = (1 – (1 + gs) + O(s 2 ))2 = O(s 2 ),
where we have defined g := inf i ∈S {–g(i , i )}. As the probability of two jumps is O(s 2 ), for small s we
have
For discrete time Markov chains, we found that µn+m = µn Pm . We defined an invariant distribution
π as any probability mass function that satisfied π = πP, which implied that π = πPn for all n. In
continuous time, we have the following analogs
µt +s = µt Ps , ∀ t , s ≥ 0,
π = πPt , ∀ t ≥ 0.
It is not always easy to find an invariant distribution π from Pt . However, π can also be obtained from
the generator G.
Theorem 5.2.7. Let X be a continuous time Markov chain with Generator G and let Pt be the
semigroup generated by G. Then
π = πPt , ⇔ πG = 0.
πG = 0 ⇔ πGn = 0, ∀ n ≥ 1,
∞ 1 n
t πGn = 0
X
⇔
n=1 n!
∞
X 1 n n
⇔ π t G =π
n=0 n!
⇔ πPt = π.
Example 5.2.8. Let us return to Example 5.2.6. Let us find the stationary distribution for the Markov
process X with generator G given by (5.11). We have
!
–α α β α
0= (π(0), π(1)) , 1 = π(0) + π(1), π(i ) ≥ 0, ⇒ π= , .
β –β α+β α+β
72 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
As in the discrete time case, for a given Markov chain X a stationary distribution π may not exist. If
a stationary distribution does exist, then it may not be unique. Given the existence of a stationary
distribution, the condition for uniqueness is that
The chains X is called a a birth-death process since, when Xt = i it either jumps to i + 1 (i.e., a birth)
or jumps down to i – 1 (i.e., a death). The expected waiting time τ in state i is exponentially distributed
τ ∼ E(λi + µi ). When the process does jump out of state i , the probability of an up-jump is λi /(λi + µi )
and the probability of a down jump is µi /(λi + µi ). If λ0 > 0 then the chain is irredicible. Let us see if
we can find the stationary distribution π. We see π such that πG = 0, which leads to
While a stationary distribution π can often be computed with relative ease, it is almost always difficult
(or impossible) to compute µt explicitly. Nevertheless, we can still obtain moments EXn
t if we can
compute the generating function GXt (s) := Es Xt .
Example 5.2.10. Consider the birth-death process given in Example 5.2.9 and assume
Note that the state Xt = 0 is an absorbing state since, λ0 = 0. Assume X0 = i ≥ 1. The Kolmogorov
forward equations (5.8) become
d
p (i , 0) = µpt (i , 1),
dt t
d
p (i , j ) = λ(j – 1)pt (i , j – 1) – (λ + µ)jpt (i , j ) + µ(j + 1)pt (i , j + 1), j ≥ 1.
dt t
Multiplying the j th equation by s j and summing all of the equations, we obtain
∂ ∂GXt ∂GXt ∂GXt
GXt = λs 2 – (λ + µ)s +µ , (5.14)
∂t ∂s ∂s ∂s
where we have used
∞
GXt (s) := E[s Xt |X0 = i ] = s j pt (i , j ),
X
j =0
One can check by direct substitution that the solution to PDE (5.14) with initial condition GX0 (s) = s i
is given by
i
µ(1 – s) – (µ – λs)e–(λ–µ)t
GXt (s) = , µ 6= λ.
λ(1 – s) – (µ – λs)e–(λ–µ)t
From the generating function we can compute the first to moments of Xt easily. We have
λ + µ (λ–µ)t (λ–µ)t
EXt = i e(λ–µ)t , VXt = i e e –1 , µ 6= λ.
λ–µ
5.3 Reversibility
In Section 4.5, we introduced a notion of reversibility for discrete time Markov chains. In this section,
we will discuss reversibility of continuous time Markov chains. As we shall see, most of the definitions
and theorems discussed below are direct analogs of the definitions and theorems given in Section 4.5.
74 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
Definition 5.3.1. Let X = (Xt )0≤t ≤T be a continuous time Markov chain. The time reversal of X is
the process Y = (Yt )0≤t ≤T defined as follows
Compare Definition 5.3.1 for the time reversal of a continuous time Markov chain with Definition 4.5.1
for the time reversal of a discrete time Markov Chain. Notice the similarities.
Theorem 5.3.2. Let X = (Xt )0≤t ≤T be an irreducible continuous time Markov chain with generator
G, transition semigroup Pt = et G and invariant distribution π and suppose X0 ∼ π. Let
Y = (Yt )0≤t ≤T be the time reversal of X. Then Y is a continuous time Markov chain with
generator H and transition semigroup Qt = et H , which satisfy
π(j ) π(j )
h(i , j ) = g(j , i ), qt (i , j ) = p (j , i ). (5.15)
π(i ) π(i ) t
Rather than prove Theorem 5.3.2, we will explain intuitively why it is true. Fix a large N ∈ N, define
δ := T/N 1 and consider the discrete-time processes Xb = (X
n 0≤n≤N where Xn := Xnδ . The process
b ) b
X
b is a discrete time Markov chain with transition probabilities p
b (i , j ) := P(X
b
n+1 = j |Xn = i ) given by
b
Now, let Y
b = (Y
b )
n 0≤n≤N be the time-reversal of X. That is Y n := XN–n = XT–nδ = Y nδ . We know from
b b b
given by
π(j )
qb (i , j ) = pb (j , i ). (5.17)
π(i )
for some generating matrix H = (h(i , j ))i ,j ∈S . Inserting (5.16) and (5.18) in (5.17) we obtain
π(j )
δi ,j + h(i , j )δ = (δ + g(j , i )δ) + O(δ 2 )
π(i ) j ,i
The terms of order O(1) cancel (you can check they are equal for both for i = j and for i 6= j ). In order
for the O(δ) terms to match, we must have
π(j )
h(i , j ) = g(j , i ),
π(i )
5.3. REVERSIBILITY 75
π(i )h(i , j ) = π(j )g(j , i ) ⇒ π(i )(Hn )ij = π(j )(Gn )ji , ∀n ∈ N0 ,
which implies
The right-hand side of (5.15) follows from (et G )ij = (Pt )i ,j = pt (i , j ) and (et H )ij = (Qt )ij = qt (i , j ).
Definition 5.3.3. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with transition
semigroup Pt = et G and invariant distribution π. Suppose X0 ∼ π. Let Y be the time reversal of X
and denote by Qt = et H its time-reversal. We say that X is reversible if Pt = Qt for all t ∈ [0, T], or
equivalently (by Theorem 5.3.2), if either of the following holds
π(i )g(i , j ) = π(j )g(j , i ), or π(i )pt (i , j ) = π(j )pt (j , i ), ∀ t ∈ [0, T].
Once again, it is instructive to compare Definition 5.3.3 with its discrete time analog Definition 4.5.3.
Definition 5.3.4. Let Pt = et G the transition semigroup of a continuous time Markov process X with
state space S. We say that a distribution λ = (λ(1), λ(2), . . . , λ(|S|)) is in detailed balance with Pt or G
if
λ(i )g(i , j ) = λ(j )g(j , i ), or λ(i )pt (i , j ) = λ(j )pt (j , i ), ∀ t ∈ [0, T]. (5.19)
Theorem 5.3.7. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with state
space S, transition semigroup Pt = et G and stationary distribution π. Then
Proof. (⇒) Suppose π is in detailed balance with G, then π(i )g(i , j ) = π(j )g(j , i ). Thus
(⇐) The Kolmogorov cycle condition (5.20) is equivalent to a conservative vector field:
!
X g(sk , sk +1 )
0= log ,
cycle{sk }
g(sk +1 , sk )
From the potential function U(s), we can construct a distribution λ = (λ(1), λ(2), . . . , λ(|S|)) for any
state s ∈ S as follows
X –1
λ(s) := AeU(s) , A= eU(sk ) .
sk ∈S
Note that A is just a normalization factor. It is easy to check that λ(i ) and λ(j ) satisfy
for any states states i and j . Therefore, by Definition (5.3.4), λ is in detailed balance with G. Additionally,
by Theorem (5.3.5), λ is a stationary distribution for X. As X is irreducible, the stationary distrubtion
is unique, which implies λ = π. Therefore, π is in detailed balance with G.
For any distributions µ and ν on some countable state space S we can define
X
Entropy : H[µ] := – µ(i ) log µ(i ),
i ∈S
X µ(i )
Relative Entropy : K[µ||ν] := µ(i ) log .
i ∈S
ν(i )
Consider a physical system with N 1 particles, each of which must be in some state s ∈ S. Let us
define
#{particles in state i at time t }
µt (i ) := .
N
Then µt is a distribution on S. Suppose this system has an invariant distribution π. We can compute
the relative entropy of µt with respect to π. We have
X X
K[µt ||π] =– – µt (i ) log µt (i ) + – µt (i ) log π(i ) .
i ∈S i ∈S
| {z }
free energy of µt | {z } | {z }
entropy of µt mean energy of µt
where we have indicated the the names that the above quantities are given in physics.
Theorem 5.3.8. Let X be a irreducible Markov chain with semigroup semigroup Pt = et G and
invariant distribution π. Suppose π is in detailed balance with G. Then we have
d
K[µt ||π] ≤ 0.
dt
That is, the relative entropy K[µt ||π] is non-increasing.
d d X µ (j )
K[µt ||π] = µt (j ) log t
dt dt j ∈S π(j )
X µt (j ) d
= 1 + log µ (j )
j ∈S
π(j ) dt t
X p (i , j ) d
= 1 + log t p (i , j )
j ∈S
π(j ) dt t
d X X p (i , j ) d
= pt (i , j ) + log t pt (i , j )
dt j ∈S j ∈S
π(j ) dt
X p (i , j ) d
log t
X
= p (i , j ) (as pt (i , j ) = 1)
j ∈S
π(j ) dt t j ∈S
X p (i , j ) X
= log t pt (i , k )g(k , j ) (by the KFE)
j ∈S
π(j ) k ∈S
X X p (i , j )
= log t pt (i , k )g(k , j )
j ∈S k ∈S
π(j )
X X pt (i , j )
= log pt (i , k )g(k , j )
j ∈S k ∈S
π(j )
78 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
X π(k )
X
+ log pt (i , k ) g(k , j )
k ∈S
pt (i .k ) j ∈S
| {z P }
this term is zero since j ∈S g(k ,j )=0
X X pt (i , j )π(k )
= log pt (i , k )g(k , j ) (algebra)
j ∈S k ∈S
π(j )pt (i , k )
X X pt (i , j )g(j , k )
= log pt (i , k )g(k , j ) (π is in detailed balance with G)
j ∈S k ∈S
pt (i , k )g(k , j )
1X X pt (i , j )g(j , k )
= log
2 j ∈S k ∈S pt (i , k )g(k , j )
× pt (i , k )g(k , j ) + pt (i , k )g(k , j )
1X X p (i , j )g(j , k )
= log t
2 j ∈S k ∈S pt (i , k )g(k , j )
× – pt (i , j )g(j , k ) + pt (i , k )g(k , j ) (algebra)
1 X X p (i , j )g(j , k )
=– log t
2 j ∈S k ∈S pt (i , k )g(k , j )
| {z }
A
× pt (i , j )g(j , k ) – pt (i , k )g(k , j )
| {z }
B
=: f (i ) ≤ 0, (as A and B have the same sign)
In statistical physics, the term A is called the thermodynamic force, the term B is called thermodynamic
d K[µ ||π] ≤
flux, the produce A·B is called entropy production or free energy dissipation and the result dt t
0 is the Second Law of Thermodynamics.
5.4 Exercises
Exercise 5.1. Patients arrive at an emergency room as a Poisson process with intensity λ. The time to
treat each patient is an independent exponential random variable with parameter µ. Let X = (Xt )t ≥0 be
the number of patients in the system (either being treated or waiting). Write down the generator of X.
5.4. EXERCISES 79
Show that X has an invariant distribution π if and only if λ < µ. Find π. What is the total expected
time (waiting + treatment) a patient waits when the system is in its invariant distribution?
Exercise 5.2. Let X = (Xt )t ≥0 be a Markov chain with stationary distribution π. Let N be an
independent Poisson process with intensity λ and denote by τn the time of the nth arrival of N. Define
Yn := Xτn + (i.e., Yn is the value of X immediately after the nth jump). Show that Y is a discrete time
Markov chain with the same stationary distribution as X.
Exercise 5.3. Let X = (Xt )t ≥t be a Markov chain with state space S = {0, 1, 2, . . .} and with a generator
G whose i th row has entries
gi ,i –1 = i µ, gi ,i = –i µ – λ, gi ,i +1 = λ,
with all other entries being zero (the zeroth row has only two entries: g0,0 and g0,1 ). Assume X0 = j .
Find GXT (s) := Es Xt . What is the distribution of Xt as t → ∞?
Exercise 5.4. Let N be a time-inhomogeneous Poisson process with intensity function λ(t ). That is,
the probability of a jumps of size one in the time interval (t , t + dt ) is λ(t )dt and the probability of two
jumps in that interval of time is O(dt 2 ). Write down the Kolmogorov forward and backward equations of
N and solve them. Let N0 = 0 and let τ1 be the time of the first jump of N. If λ(t ) = c/(1 + t ) show
that Eτ1 < ∞ if and only if c > 1.
Exercise 5.5. Let N be a Poisson process with a random intensity Λ with is equal to λ1 with probability
p and λ2 with probability 1 – p. Find GNt (s) = Es Nt . What is the mean and variance of Nt ?
Exercise 5.6. Let X = (Xt )0≤t ≤T be an irreducible continuous-time Markov Chain with state space S,
transition semigroup Pt = et G and stationary distribution π. Prove the following six statements are
equivalent
(i) Its stationary distribution satisfies detailed balance: π(i )g(i , j ) = π(j )g(j , i ), ∀i , j ∈ S
(ii) Any path connecting states i and j : i ≡ s0 , s1 , s2 , ..., sn ≡ j , has a path independent
! ! !
g(s0 , s1 ) g(s1 , s2 ) g(sn–1 , sn )
log + log + · · · + log = log π(sn ) – log π(s0 )
g(s1 , s0 ) g(s2 , s1 ) g(sn , sn–1 )
(iii) It defines a time reversible stationary Markov process.
(iv) Its G matrix satisfies Kolmogorov cycle condtion for every sequence of states.
(v) There exists a positive diagonal matrix Π such that matrix GΠ is symmetric.
(vi) Its stationary process has zero entropy production rate. Entropy production rate, ep (µt ), is defined
by !
1X µt (i )g(i , j )
ep (µt ) = (µ (i )g(i , j ) – µt (j )g(j , i )) log
2 i ,j t µt (j )g(j , i )
80 CHAPTER 5. CONTINUOUS TIME MARKOV CHAINS
Chapter 6
The notes from this chapter are primarily taken from (Grimmett and Stirzaker, 2001, Chapter 7). The
goals of this chapter are (i) to introduce various modes of convergence of random variables and (ii) to
establish when convergence of one mode implies convergence of another mode.
Example 6.1.1. Consider the infinite sequence of coin tosses described in Section 1.3.2. Suppose you
have an initial wealth X0 . Just before the n th coin toss, you bet your entire wealth on the outcome of
the n th toss being a heads. It is easy to see that, at time n your wealth is given by
2n X0
if ωi = H for i = 1, 2, . . . n,
Xn (ω) =
0
else.
Suppose the probability of a heads is p ∈ (0, 1). As each of the coin tosses are independent, we deduce
that
P(Xn = 2n X0 ) = p n , P(Xn = 0) = 1 – p n .
As p n → 0 as n → ∞ we see that
81
82 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
From this computation one might expect that E|Xn – X∞ | → 0 as n → ∞. However, this is not correct,
as
∞ if p ∈ (1/2, 1),
E|Xn – X∞ | = EXn = 2n X0 · p n → X0 if p = 1/2,
0 if p ∈ (0, 1/2).
Thus, we see that (6.1) does not imply E|Xn – X∞ | → 0. This simple example forces us to consider more
carefully what we mean by “convergence” of a random variable.
Definition 6.2.1. Consider a sequence of random variables (Xn )n≥0 defined on a probability space
a.s.
(Ω, F, P). We say that Xn converges to X∞ almost surely, written Xn → X∞ , if
P( lim |Xn – X∞ | = 0) = 1.
n→∞
Lp
We say that Xn converges to X∞ in Lp , written Xn → X∞ , if E|Xn |p < ∞ for all n and
P
We say that Xn converges to X∞ in probability, written Xn → X∞ , if
D
And we say that Xn converges to X∞ in distribution, written Xn → X∞ , if
It will be helpful to comment a bit more on the above definitions. Observe that almost sure convergence
does not require that Xn (ω) → X∞ (ω) for every ω ∈ Ω. It may be helpful to define
Almost sure convergence simply asks that P(A) = 1; it does not ask that A = Ω. There may be infinitely
a.s.
many ω ∈ Ac (uncountably infinite, in fact). But, if Xn → X∞ then P(Ac ) = 0. At times, it is
helpful to write that Xn → X∞ P-a.s. to indicate that Xn converges to X∞ almost surely under a
specific probability measure P. Note that, that Xn → X∞ P-a.s. does not implyl that Xn → X∞ P-a.s..
e
For any p ≥ 1 we can define Lp (Ω, F, P), the space of random variables X : Ω → R with a finite p th
moment.
E|X|p < ∞.
Then, from (6.2) we see that convergence in Lp is equivalent to convergence in the p-norm
lim kXn – X∞ kp = 0.
n→∞
As with almost sure convergence, it is sometimes necessary to specify a particular probability measure
under which Lp convergence occurs by writing Xn → X∞ in Lp (P) or, even more specifically, Lp (Ω, F, P).
It is common to say Xn → X∞ in mean and in mean square when Xn → X∞ in L1 and L2 , respectively.
Convergence in probability is based on the following intuition: two random variables X and Y are “close
to each other” if there is a low probability that their difference is very large: P(|X – Y| > ε) ≈ 0 for ε 1.
In some sense, the “distance” between X and Y is measured by dε (X, Y) := P(|X – Y| > ε). Convergence
in probability simply asks that this measure of distance goes to zero for any ε > 0 as n → ∞. That is,
limn→∞ dε (Xn , X∞ ) = 0.
Note that we have already defined convergence in distribution (see Definition 3.4.1). We have simply
repeated the definition here for convenience. In a sense, convergence in distribution is the weakest of
the four modes of convergence, as the definiton makes no reference to the underlying probability space
(Ω, F, P). Indeed, suppose Z ∼ N(0, 1) and Xn = –Z for all n. Then it is clear that FXn = FZ for all n.
But, it is easy to see that Xn does not converge to Z in probability, in Lp or almost surely. Convergence
w
in distribution is sometimes alternatively called weak convergence, written Xn → X∞ , or convergence
L
in law, written Xn → X∞ .
84 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
Sometimes, we may wish to see if a sequence of random variables (Xn ) converges (in some sense) to a
random variable X∞ , but we do know know what the random variable X∞ is. In such cases, it can be
helpful to check for Cauchy convergence of some sort.
Definition 6.2.2. Consider a sequence of random variables (Xn )n≥0 defined on a probability space
(Ω, F, P). We say that (Xn ) is Cauchy convergent almost surely if
P( lim |Xn – Xm | = 0) = 1.
n,m→∞
We say that (Xn ) is Cauchy convergent in Lp if E|Xn |p < ∞ for all n and
lim E|Xn – Xm |p = 0.
n,m→∞
Theorem 6.2.3. Consider a sequence of random variables (Xn ) defined on a probability space
(Ω, F, P) and let X∞ be a random variable defined on the same probability space.
a.s.
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent almost surely.
Lp
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent in Lp .
P
∃ X∞ s.t. Xn → X∞ ⇔ (Xn ) is Cauchy convergent in probability.
The usefulness of Theorem 6.2.3 is that it allows us to establish the existence of a limit X∞ of a sequence
of random variables (Xn ) (in some sense) without having knowledge of what X∞ is.
P
In Example 6.1.1 we saw that the wealth process (Xn ) converged in probability to zero: Xn → 0. Suppose
P was such that P(ωi = H) = p ∈ (0, 1/2). Then the wealth process (Xn ) also converged in mean to
zero: Xn → 0 in L1 (P). However, if P(ω
e e ∈ [1/2, 1), then (Xn ) did not converge in L1 (P).
i = H) = p
e It
would be very helpful to know when one mode of convergence implies convergence in another mode. This
is the subject of the next section.
Theorem 6.3.1. Let (Xn ) be a sequence of random variables defined on (Ω, F, P) and let X∞ be a
random variable on the same probability space. Suppose 1 ≤ q ≤ p. Then the following implications
hold
a.s.
Xn → X∞ P D
Lp Lq ⇒ Xn → X∞ ⇒ Xn → X∞ .
Xn → X∞ ⇒ Xn → X∞
Note: the right bracket “}” indicates that convergence in probability holds if Xn converges to X∞ either
in Lq or P-a.s.. One does not need for Xn to converge to X∞ in both Lq and P-a.s. in order for
convergence in probability to hold. We will prove Theorem 6.3.1 in a series of steps. Along the way, we
will introduce some related (and useful!) inequalities.
P D
Proof that Xn → X∞ ⇒ Xn → X∞ . First we note that, if ε > 0, then
FXn (x ) = P(Xn ≤ x )
= P(Xn ≤ x , X∞ ≤ x + ε) + P(Xn ≤ x , X∞ > x + ε)
≤ P(X∞ ≤ x + ε) + P(|Xn – X∞ | > ε)
= FX∞ (x + ε) + P(|Xn – X∞ | > ε). (6.3)
If you find the above inequality difficult to derive, it may help to draw a pictures of regions in the Xn -X∞
plane to which the above probabilities correspond. Likewise, we have
FX∞ (x – ε) = P(X∞ ≤ x – ε)
= P(X∞ ≤ x – ε, Xn ≤ x ) + P(X∞ ≤ x – ε, Xn > x )
≤ P(Xn ≤ x ) + P(|Xn – X∞ | > ε)
= FXn (x ) + P(|Xn – X∞ | > ε). (6.4)
which implies that limn→∞ FXn (x ) = FX∞ (x ) at any point x where FX∞ is continuous. And this
D
establishes that Xn → X∞ .
Theorem 6.3.2 (Hölder’s inequality). Let X and Y be random variables defined on (Ω, F, P) and
let p, q ≥ 1 satisfy 1/p + 1/q = 1. Then we have
Corollary 6.3.3 (Lyapunov’s inequality). Let Z be a random variable defined on (Ω, F, P) and
suppose 1 ≤ q ≤ p. Then we have
Proof. Let r , p ≥ 1. Applying Hölder’s inequality (6.5) with Y = 1 and X = |Z|r we obtain
Lemma 6.3.4 (Markov’s inequality). Let Z be a random variable defined on (Ω, F, P). For any
a > 0 we have
Lp L1
where, to deduce the last equality, we have used the fact that convergence in Xn → X∞ ⇒ Xn → X∞ .
6.4. CONTINUITY OF PROBABILITY MEASURES 87
Theorem 6.4.2 (Continuity of Probability Measures). Fix a probability space (Ω, F, P). Suppose
(An ) is a sequence of subsets of F. Then
n
[
An ↑ A, ⇒ lim P(An ) = P lim Ak = P(A),
n→∞ n→∞
k =1
\n
An ↓ A, ⇒ lim P(An ) = P lim Ak = P(A).
n→∞ n→∞
k =1
88 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
Bn := An \ An–1 , ∀ n ∈ N, A0 = ∅,
To get a more intuitive understanding of the above definitions, it may be helpful to note the following
Clearly, we have
Note that the lim inf and lim sup of a sequence of sets always exists (just like the lim inf and lim sup of a
sequence of real numbers always exists). To see this, observe that, if we define
∞
[ ∞
\
Bm := An , Cm := An ,
n=m n=m
A2n = B, A2n+1 = C, n = 0, 1, 2, . . . ,
where B and C are some arbitrary sets in a sample space Ω. To compute the lim inf n An consider the
following question: for which ω ∈ Ω does the event ω ∈ An occur all but finitely many times? The
answer is all of the ω ∈ B ∩ C. To compute the lim supn An consider the following question: for which
ω ∈ Ω does the event ω ∈ An infinitely often? The answer is all of the ω ∈ B ∪ C. Thus, we have
We can use the Borel-Cantelli Lemma to establish that fast convergence in probability implies almost
sure convergence.
P
Theorem 6.6.2. Suppose Xn → X∞ and
∞
X
P(|Xn – X∞ | > ε) < ∞, ∀ ε > 0.
n=1
a.s.
Then Xn → X∞ .
P(An ) < ∞. It
P
Proof. Fix ε > 0 and take An = {|Xn – X∞ | > ε}. We have by assumption that n
follows from the Borel-Cantelli Lemma 6.6.1, that
Definition 6.7.1. Let X = (Xn )n∈N0 be a sequence of random variables defined on a probability space
(Ω, F, P). The sequence X is said to be a martingale with respect to a filtration (Fn )n∈N0 if the following
hold:
(i) E|Xn | < ∞ for all n ∈ N0 ,
(ii) E[Xn+m |Fn ] = Xn for all n, m ∈ N0 . Alternatively, E[Xn+1 |Fn ] = Xn for all n ∈ N0 .
The special structure of martingales will allow us to prove some very powerful theorems about the
behavior of a martingale X as n → ∞. Before stating and proving these theorems, however, let us take a
look at some example of discrete-time martingales.
Example 6.7.2 (Martingales from branching processes). Suppose Zn is the size of the n th
generation of a branching process with Z0 = 1 (see Section 3.2). We have
Zn
X
Zn+1 = Xn,i ,
i =1
where the (Xn,i ) are i.i.d with common distribution X. Let Fn := σ(Z0 , Z1 , . . . , Zn ). We have
Zn
X
E[Zn+1 |Fn ] = E[ Xn,i |Zn ] = Zn EX = Zn µ, µ := EX.
i =1
From this, one obtains by induction that EZn = µn . Clearly the process Z is not a martingale when
µ 6= 1. However, consider Wn := µ–n Zn . We have
Thus, the process W = (Wn ) is a martingale with respect to (Fn ). Next, suppose η is the smallest
solution of
Defining Vn := η Zn , we find
PZn
E[Vn+1 |Fn ] = E[η Zn+1 |F n] = E[η i =1 Xn,i |Fn ] = (Eη X )Zn
92 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
= (G(η))Zn = η Zn = Vn .
Example 6.7.3 (Martingales from Markov chains). Let X = (Xn ) be a discrete time Markov chain
with state space S and one-step transition matrix P = (p(i , j )). Suppose ψ is an a right eigenvector of P
with corresponding eigenvalue λ = 1. That is
X
Pψ = ψ, or component-wise p(i , j )ψ(j ) = ψ(i ), ∀ i ∈ S.
j ∈S
To prove the main result of this section (Theorem 6.7.5) we require the following result:
Observe that Ak is the event that |M| does not reach ε prior to time k and Bk is the event that |M| first
reaches or exceeds ε for the first time at time k . Thus, for any k , the events Ak , B1 , B2 , . . . , Bk are a
partition of Ω. That is, for all k , we have
k
[
Ω = Ak ∪ Bi , Ak ∩ Bi = ∅, i ≤ k, Bi ∩ Bj = ∅, i 6= j .
i =1
It follows that
n n
EM2n = EM2n 1An + EM2n 1Bi ≤ EM2n 1Bi .
X X
i =1 i =1
Moreover, we have
Let us look at these three terms one-by-one. Clearly, the first term satisfies α ≥ 0. Next, using the fact
that M is a martingale, the second term satisfies
and hence
n n
EM2n ≤ EM2n 1Bi ≤ ε2 P(Bi ) = ε2 P( sup |Mi | ≥ ε),
X X
i =1 i =1 0≤i ≤n
Theorem 6.7.5. Let M = (Mn ) be a martingale with respect to a filtration (Fn ). Suppose that
supn EM2n < ∞. Then there exists a random variable M∞ such that
a.s. L2
Mn → M∞ , and Mn → M∞ .
EM2n+m = E(Mn+m – Mn + Mn )2
= E(Mn+m – Mn )2 + EM2n + 2EMn (Mn+m – Mn )
= E(Mn+m – Mn )2 + EM2n + 2EMn E[(Mn+m – Mn )|Fn ]
= E(Mn+m – Mn )2 + EM2n ≥ EM2n .
Thus, the sequence (EM2n ) is strictly increasing and (by assumption) bounded. The sequence therefore
has a limit ` := limn→∞ EM2n . We will now show that the sequence (Mn ) is Cauchy convergent almost
a.s.
surely (see Definition 6.2.2). By Theorem 6.2.3, this will establish that Mn → M∞ . Define C as the set
of ω ∈ Ω for which the sequence (Mn ) is Cauchy convergent almost surely
\ ∞
[
= {|Mm+i – Mm | < ε ∀ i ∈ N}
ε>0 m=1
The complement of C is given by
∞ ∞
Cc =
[ \ [ \
{∃ i ∈ N s.t. |Mm+i – Mm | ≥ ε} = Am (ε),
ε>0 m=1 ε>0 m=1
Am (ε) := {∃ i ∈ N s.t. |Mm+i – Mm | ≥ ε}.
Thus, to prove that P(Cc ) = 0, it is sufficient to show that P(An (ε)) → 0 as n → ∞ for all ε > 0. To
this end, for a fixed m ∈ N define the sequence Y = (Yn ) by Yn := Mn+m – Mm . The process (Yn ) is a
martingale with respect to (Gn ) where Gn = σ(Y1 , Y2 , . . . , Yn ) since
As the process Y is a martingale, we have by the Doob-Komogorov inequality (Theorem 6.7.4) that
Example 6.7.6. Suppose (Xn ) is a Markov chain with one-step transition probabilities given by
! !j !N–j
N i i
p(i , j ) = P(Xn+1 = j |Xn = i ) = 1– , i , j ∈ {0, 1, . . . , N}.
j N N
Let Fn = σ(X1 , X2 , . . . , Xn ). The process (Xn ) is a martingale with respect to the filtration (Fn ) because
N
X
E[Xn+1 |Fn ] = j p(Xn , j )
j =0
X N N
!
Xn j
Xn N–j
= j 1–
j =0
j N N
Xn –N Xn N
= 1– 1– Xn = Xn .
N N
Clearly, as (Xn ) is bounded, we have supn EX2n < ∞. As such, by Theorem 6.7.5 there exists a random
a.s. L2
variable X∞ such that Xn → X∞ and Xn → X∞ . To find the distribution of X∞ , we note that both
{0} and {N} are absorbing states. As there are no other absorbing states, we conclude that (Xn ) will
eventually end up in onf of these two states. Thus, using the martingale property X0 = EXn we find
where passing the limit through the expectation is allowed by Lebesgue’s dominated convergence (Theorem
6.3.5). From the above computation, we conclude that
Example 6.8.1. Suppose (Ω, F, P) = ([0, 1], B([0, 1]), Leb) so that P(dω) = dω. Define a sequence of
random variables (Xn ) by
P
It is easy to see that Xn → 0 as, for any ε > 0, we have
a.s.
In fact, in this case, we have Xn → 0 as well. However, we clearly do not have L1 convergence of Xn → 0
because
P L1
It is natural to ask: are there conditions under which Xn → X∞ implies Xn → X∞ ? It turns out that
the answer is “yes.” We introduce these conditions in this section.
Definition 6.8.2. A collection of random variables (Xn ) (possibly uncountable) is said to be uniformly
integrable (UI) if, for every ε > 0 there exists Kε ∈ [0, ∞) such that
Example 6.8.3. Let us return to Example 6.8.1, discussed above. For every n we have E|Xn | = 1. Thus,
the collection of random variables (Xn ) is integrable. However, we have
Thus, for a given ε there is no Kε ∈ [0, ∞) for which supn E|Xn |1{|Xn |>Kε } < ε. As such, we conclude
that the collection (Xn ) is not UI.
Let us provide some simple conditions which, if satisfied by a collection of random variables, ensure
uniform integrability.
Proof. Assuming K > 0 and p > 1 we have |Xn |1–p 1{|Xn |>K} ≤ K1–p 1{|Xn |>K} and thus |Xn |1{|Xn |>K} ≤
Kp–1 |Xn |p 1{|Xn |>K} . Hence
sup E|Xn |1{|Xn |>K} ≤ K1–p sup E|Xn |p =: K1–p c, c := sup E|Xn |p < ∞
n n n
The right-hand side can be made smaller than ε by choosing K > Kε := (c/ε)1/(p–1) . Thus, we have
that (Xn ) is UI.
6.8. UNIFORM INTEGRABILITY 97
Proof. By assumption |Xn | ≤ Y, which implies 1{|Xn |>K} ≤ 1{Y>K} . Hence, we have
Because EY < ∞ by assumption, for any ε > 0, there exists Kε such that EY1{Y>Kε } < ε. Thus, by
choosing K > Kε above, we can ensure supn E|Xn |1{|Xn |>K} < ε. It follows that the collection of random
variables (Xn ) is UI.
Theorem 6.8.6 (Bounded convergence). Let (Xn ) be a sequence of random variables and let
P
X∞ be a random variable. Suppose that Xn → X∞ and that, for some K < ∞ we have
sup |Xn | ≤ K.
n
L1
Then Xn → X∞ .
P
The right-hand side goes to zero as n → ∞ because (by assumption) Xn → X∞ . Thus, we have
P(|X| > K + 1/k ) = 0. Now, observe that
[
P(|X∞ | > K) = P {|X∞ | > K + 1/k } = lim P(|X∞ | > K + 1/k ) = 0.
k k →∞
P
Hence, P(|X∞ | ≤ K) = 1. Next, because Xn → X∞ , for any ε > 0 we can choose nε such that
where we have used |Xn – X∞ | ≤ 2K. As ε was arbitrary, we can make E|Xn – X∞ | as small as we like.
L1
Hence, Xn → X∞ .
98 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
We can now state and prove the main result of this section.
Theorem 6.8.7. Let (Xn ) be a sequence of random variables and let X∞ be a random variable.
Suppose that E|Xn | < ∞ for every n and E|X∞ | < ∞. Then
L1 P
Xn → X∞ ⇔ Xn → X∞ and (Xn ) is UI.
Proof. We have already proved the ⇒ part 2 (see Theorem 6.3.1). So, we focus now on the ⇐ part. To
P
this end, suppose Xn → X∞ and (Xn ) is UI. For any K ∈ [0, ∞), define φK : R → [–K, K] as follows
– K x < –K,
φK (x ) := x |x | ≤ K,
K
x > K.
Observe that
Fix ε > 0. From (6.12) and the fact that (Xn ) is UI and X∞ integrable, there exists a Kε such that
P P
Now, Xn → X∞ implies that φKε (Xn ) → φKε (X∞ ). And furthermore, supn |φKε (Xn )| ≤ Kε . Thus, it
L1
follows from Theorem 6.8.6 that φKε (Xn ) → φKε (X∞ ). Thus, there exists an nε such that
E|Xn – X∞ | ≤ E|Xn – φKε (Xn )| + E|φKε (Xn ) – φKε (X∞ )| + E|φKε (X∞ ) – X∞ | < ε.
L1
As ε was abritrary, it follows that Xn → X∞ , as claimed.
a.s. P
Remark 6.8.8. From Theorem 6.3.1, we know that Xn → X∞ implies Xn → X∞ . From this and
Theorem 6.8.7, it follows that
a.s. L1
Xn → X∞ and (Xn ) is UI ⇒ Xn → X∞ .
2 More specifically, we have proved that convergence in L1 implies convergence in probability, but it also implies uniform
integrability.
6.8. UNIFORM INTEGRABILITY 99
We will show below that the the set of conditional expectations of an integrable random variable form a
class of uniformly integable random variables. To prove this result, we require the following lemma.
Lemma 6.8.9. Suppose E|X| < ∞. Then, for all ε > 0 there exists δε > 0 such that P(A) < δε
implies E|X|1A < ε.
Proof. Suppose, by contradiction, that E|X| < ∞ and, for a given ε > 0, there exists a sequence of sets
(An ) such that
Theorem 6.8.10. Let X be a random variable on (Ω, F, P) and suppose E|X| < ∞. Then the
collection C of random variables defined by
is uniformly integrable.
Proof. Fix ε > 0. From Lemma 6.8.9 there exists δε > 0 such that P(A) < δε implies E|X|1A < ε. As
E|X| < ∞ we can choose Kε > 0 so that E|X|/Kε < δε . Now, for any G ⊆ F, we can define Y = E(X|G).
By Jensen’s inequality (see Theorem 2.3.4) we have
so that
As the above inequality holds for any G ⊆ F, the class C is UI, as claimed.
100 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
6.9 Exercises
Exercise 6.1. Let X1 , X2 , X3 , ... be a sequence of random variables such that
Exercise 6.2. Consider the sample space Ω = [0, 1] with uniform probability distribution, i.e.,
Exercise 6.3. Let {Xn , n = 1, 2, ...} and {Yn , n = 1, 2, ...} be two sequences of random variables, defined
on some probability space (Ω, F, P). Suppose that we know
a.s. a.s.
Xn → X, Yn → Y.
a.s.
Prove that Xn + Yn → X + Y.
Exercise 6.4. Let {Xn , n = 1, 2, ...} and {Yn , n = 1, 2, ...} be two sequences of random variables, defined
on some probability space (Ω, F, P). Suppose that we know
P P
Xn → X, Yn → Y.
P
Prove that Xn + Yn → X + Y.
6.9. EXERCISES 101
Exercise 6.5. Show that if Xn is any sequence of random variables, there are constants cn → ∞ so
a.s.
that Xn /cn → 0.
Exercise 6.6. Let X1 , X2 , ..., be independent with P(Xn = 1) = pn and P(Xn = 0) = 1 – pn . Show that
P
(a) Xn → 0 if and only if pn → 0.
a.s.
(b) Xn → 0 if and only if pn < ∞.
P
n
Exercise 6.7. Suppose that X1 , X2 , ..., are independent with P(Xn > x ) = x –5 for all x ≥ 1 and
n = 1, 2, ..., Show that limsupn→∞ (log Xn )/ log n = c almost surely for some number c, and find c.
102 CHAPTER 6. CONVERGENCE OF RANDOM VARIABLES
Chapter 7
Brownian motion
The notes from this chapter are primarily taken from (Shreve, 2004, Chapter 3). The goals of this
chapter are (i) to define what we mean by “Brownian motion” and (ii) to develop important properties of
Brownian motion.
where ωi is the result of the i th coin toss. We take Fn to be the σ-algebra generated by observing the
first n coin tosses and we set F = σ(∪n Fn ). The coin tosses are assumed to be independent and we take
P(ωi = H) = P(ωi = T) = (1/2).
Let ki < ki +1 for all i ∈ N0 . As the (Xi )i ∈N are independent, we clearly have
103
104 CHAPTER 7. BROWNIAN MOTION
Thus, for the discrete time process M = (Mi )i ≥0 we see that variance accumulates at a rate of one per
unit time.
Next, we note that M is a martingale with respect to the filtration F = (Fn )n∈N0 . To see this, let k ≤ l
and note that
where we have used the independent increments property (Ml – Mk ) ⊥⊥ Fk , equation (7.1) and the fact
that Mk ∈ Fk .
The quadratic variation of the symmetric random walk M up to time k , denoted [M, M]k , is defined as
k k
(Mj – Mj –1 )2 = X2j = k ,
X X
[M, M]k :=
j =1 j =1
where we have used X2j = 1. The astute reader will notice that [M, M]k = VMk . However, it is important
to note that the computation of variance and the computation of quadratic variation are different !
To see this, note that if P(ωi = H) = p 6= (1/2) then VXi 6= 1 and thus VMk = 6 k . However, since
X2j = 1 is unaffected by the value of p = P(ωi = H), the computation of [M, M]k is also unaffected by p.
Another way to see that VMk and [M, M]k are different is to note that VMk is a statistical quantity
(i.e., it is an average over all ω) whereas [M, M]k is computed ω-by-ω (it just turns out that, for each ω
we have [M, M]k (ω) = k .
Let 0 = t0 < t1 < t2 < . . . and suppose ntj ∈ N for every j . Then we have
(n) (n) (n) (n)
independent increments : (Wt2 – Wt1 ) ⊥⊥ (Wt4 – Wt3 ),
7.1. SCALED RANDOM WALKS 105
The scaled random walk W(n) , restricted to the set of t for which nt ∈ N0 , is a martingale with respect
to the filtration F(n) . To see this, assume 0 ≤ s ≤ t are such that ns ∈ N and nt ∈ N. Then we have
(n) (n) (n) (n) (n) (n) (n) (n) (n) (n) (n)
E[Wt |Fs ] = E[Wt – Ws + Ws |Fs ] = E[Wt – Ws |Fs ] + E[Ws |Fs ]
(n) (n) (n) (n)
= E(Wt – Ws ) + Ws = Ws ,
The quadratic variation of the scaled symmetric random walk W(n) up to time t , denoted [W(n) , W(n) ]t ,
is defined as
nt nt 2 nt
(n) (n)
[W(n) , W(n) ]t W(j –1)/n )2 √1 Xj 1
X X X
:= (Wj /n – = n
= n = t,
j =1 j =1 j =1
(n)
where we again assume nt ∈ N0 . Thus, for the scaled symmetric random walk, we see that VWt =
[W(n) , W(n) ]t . However, we emphasize one more time that the computation of variance is a statistical
average over all ω and the computation of quadratic variation is done ω-by-ω.
(n)
Theorem 7.1.1. Fix t ≥ 0. Define a random variable Wt := limn→∞ Wt . Then Wt ∼ N(0, t ).
The limit as n → ∞ is in fact not easy to show. Nevertheless, the limit above is correct. From Example
3.3.4, we know that if a random variable Z is normally distributed Z ∼ N(0, t ) then its characteristic
function φZ is given by
1 2
φZ (u) = e– 2 tu .
(n) D
As limn→∞ φ (n) → φZ , we have from the Continuity Theorem 3.3.8 that Wt → N(0, t ), as claimed.
Wt
1. W0 = 0.
2. If 0 ≤ r < s < t < u < ∞ then (Wu – Wt ) ⊥⊥ (Ws – Wr ).
3. If 0 ≤ r < s then Ws – Wr ∼ N(0, s – r ).
4. The map t → Wt is continuous for every ω.
It is clear from the previous sections that we can construct a Brownian motion as a limit of a scaled
symmetric random walk. Had we simply given Definition 7.2.1 at the beginning of this chapter with
no further introduction, one might have legitimately aksed if there exists a process that satisfies the
properties of a Brownian motion. There are other methods to prove the existence of Brownian motion.
But, the scaled random walk construction is perhaps the most intuitive.
What is Ω in Definition 7.2.1? It could be an infinite series of Hs and Ts, representing movements up
and down of a scaled random walk. Or, it could be Ω = C0 (R+ ), the set of continuous functions on
R+ , starting from zero. In this case, an element of Ω is a continuous function t → ω(t ) and one could
simply take Wt (ω) = ω(t ). Whatever the sample space, the probability of any single element ω is zero:
P(ω) = 0, but probabilities such as P(Wt ≤ 0) are well-defined.
Let 0 < t1 < t2 < . . . < td < ∞. Note that the vector W := (Wt1 , Wt2 , . . . , Wtd ) is a d-dimensional
normally distributed random variable. The distribution of a normally distributed random vector is
uniquely determined by its mean vector and covariance matrix. We clearly have E(Wt1 , Wt2 , . . . , Wtd ) =
(0, 0, . . . , 0). The entries of the covariance matrix are of the following form: for T ≥ t we have
Thus, CoV[Ws , Wt ] = s ∧ t , and the covariance matrix for (Wt1 , Wt2 , . . . , Wtd ) is
t1 t1 . . . t1
t1 t2 . . . t2
C= . . . . (7.3)
. . . ..
. . . .
t1 t2 . . . td
Definition 7.2.2. Let (Ω, F, P) be on probability space on which a Brownian motion W = (Wt )t ≥0 is
defined. A filtration for the Brownian motion W is a collection of σ-algebras F = (Ft )t ≥0 satisfying:
The most natural choice for this filtration F is the natural filtration for W. That is Ft = σ(Wu , 0 ≤ u ≤
t ). In principle the filtration (Ft )t ≥0 could contain more than the information obtained by observing W.
However, the information in the filtration is not allowed to destroy the independence of future increments
of Brownian motion.
Not surprisingly, if F = (Ft )t ≥0 is a filtration for a Brownian motion W then W is a martingale with
respect to this filtration. We see this, let 0 ≤ s < t and observe that
Suppose that f ∈ C([0, T]) and f 0 (t ) exists and is finite for all t ∈ (0, T). Then, by the Mean Value
Theorem, there exits tj∗ ∈ [tj , tj +1 ] such that
Thus, if f ∈ C([0, T]) and f 0 (t ) exists and is finite for all t ∈ (0, T), we have
n–1 Z T
|f 0 (tj∗ )|(tj +1 dt |f 0 (t )|.
X
FVT (f ) := lim – tj ) =
kΠk→0 j =0 0
Definition 7.3.1. Let f : [0, T] → R. We define the quadratic variation of f up to time T, denoted
[f , f ]T as
n–1
Xh i2
[f , f ]T := lim f (tj +1 ) – f (tj ) ,
kΠk→0 j =0
Proposition 7.3.2. Suppose f : [0, T] → R has a continuous first derivative: f ∈ C1 ((0, T)). Then
[f , f ]T = 0.
In ordinary calculus, we typically deal with functions f ∈ C1 , and hence [f , f ]T = 0. For this reason,
quadratic variation never arises in usual calculus. However, it turns out that for almost every ω ∈ Ω, we
have that t 7→ Wt (ω) is not differentiable. We can see this from the scaled random walk construction of
(n) dW (n)
Brownian motion. The slope of the scaled random walk Wt at any t for which dt t (ω) is defined is
(n) (n)
W – Wt √
lim t +ε = ± n → ±∞ as n → ∞.
ε→0 ε
Thus, Brownian motion, which we constructed as a limit of a scaled random walk W(ω) := limn→∞ W(n) (ω)
/ C1 ((0, T)), then Proposition 7.3.2 can fail.
is not differentiable at any t , P-a.s.. When a function f ∈
Indeed, as we will show, paths of BM have strictly positive quadratic variation. It is for this reason that
stochastic calculus is different from ordinary calculus.
7.3. QUADRATIC VARIATION 109
Theorem 7.3.3. Let W be a Brownian motion. Then, for all T ≥ 0 we have [W, W]T = T almost
surely.
Note that QΠ → [W, W]T as kΠk → 0. We will show that EQΠ → T and VQΠ → 0. Using the fact that
Wtj +1 – Wtj ∼ N(0, tj +1 – tj ), we compute
Thus, EQΠ → T and VQΠ → 0 as kΠk → 0, which proves that [W, W]T = limkΠk→0 QΠ = T.
Suppose dt 1 and define dWt := Wt +dt – Wt . The above computations show that E(dWt )2 = dt and
V(dWt ) = 2dt 2 . Since dt 2 is practically zero for dt 1, one can imagine that (dWt )2 is almost equal
to a constant dt . Informally, we write this as
This informal statement, while not rigorously correct, captures the spirit of the quadratic variation
computation for W.
Definition 7.3.4. Let f , g : [0, T] → R. We define the covaration of f and g up to time T, denoted
[f , g]T as
n–1
Xh ih i
[f , g]T := lim f (tj +1 ) – f (tj ) g(tj +1 ) – g(tj ) ,
kΠk→0 j =0
Theorem 7.3.5. Let W be a Brownian motion and let Id be the identity function: Id(t ) = t . Then,
for all T ≥ 0 we have [W, Id]T = 0 almost surely and [Id, Id]T = 0.
Proof. For a fixed partition Π = {0 = t0 , t1 , . . . , tn = T} we have
n–1
X
[W, Id]T
= lim (Wtj +1 – Wtj ) · (tj +1 – tj )
kΠk→0 j =0
n–1
X
= lim (Wtj +1 – Wtj ) · (tj +1 – tj )
kΠk→0 j =0
Just as (7.7) captures the spirit of the computation of [W, W]T , the following equations
dWt dt = 0, dt dt = 0,
informally capture the spirit of the [W, Id]T and [Id, Id]T computations.
Thus, Z is a martingale.
The process Z is sometimes referred to as an exponential martingale. The exponential martingale will
be used to compute the distribution of
We call τm the first hitting time or first passage time of a Brownian motion W to level m.
where Z is given by (7.8). We call Z(m) a stopped process as it remains at Zτm forever after W hits m.
As Z is a martingale it follows that the stopped process Z(m) is also a martingale. Thus, we have
(m) (m) 1 2
1 = Z0 = EZt = Ee– 2 σ t ∧τm +σWt ∧τm .
Now, it turns out that P(τm < ∞) = 1 (a fact that can be proved with relative ease). As a result, we
have limt →∞ t ∧ τm = τm and Wτm = m. Thus, we obtain
1 2 1 2 1 2
1 = lim Ee– 2 σ t ∧τm +σWt ∧τm = E lim e– 2 σ t ∧τm +σWt ∧τm = Ee– 2 σ τm +σm .
t →∞ t →∞
√ √
Setting σ = 2α we obtain Ee–ατm = e–m 2α , which agrees with (7.10) for m ≥ 0. To obtain (7.10) for
m < 0, simply note that, since Brownian motion is symmetric about zero, the distribution of τm is the
same as the distribution of τ–m .
Theorem 7.6.1. For all m 6= 0, the first hitting time τm of Brownian motion to level m has a
density fτm , which given by
|m| –m 2 /(2t )
fτm (t ) = 1{t ≥0} √ e . (7.12)
t 2πt
d d d Z∞ 1 x2
fτm (t ) = P(τm ≤ t ) = 2P(Wt ≥ m) = 2 dx √ exp –
dt dt dt m 2πt 2t
d Z∞ 1 y2 m
2
= 2 √ dy √ exp – = √ e–m /(2t )
dt m/ t 2π 2 t 2πt
The case m ≤ 0 is proved in a similar manner.
Now, let us define the running maximum, denoted W = (Wt )t ≥0 , of Brownian motion
Wt = max Ws .
0≤s≤t
Theorem 7.6.2. For any t > 0, the joint density of Brownian motion Wt and its running maximum
Wt is
where fWt is the density of Wt , which is a N(0, t ) random variable. To obtain the density (7.14) use
∂2 ∂2 Z ∞
fW ,W (w , m) = P(Wt ≤ w , Wt ≤ m) = – dx fWt (x ).
t t ∂m∂w ∂m∂w 2m–w
The rest of the computation is algebra.
114 CHAPTER 7. BROWNIAN MOTION
Corollary 7.6.3. For any t > 0, the conditional density of Wt given Wt is given by
fW ,W (w , m)
t t
fW |W (m, w ) = .
t t fWt (w )
7.7 Exercises
Exercise 7.1. Let W be a Brownian motion and let F be a filtration for W. Show that W2t – t is a
martingale with respect to the filtration F.
Exercise 7.2. Compute the characteristic function of WNt where N is a Poisson process with intensity
λ and the Brownian motion W is independent of the Poisson process N.
Exercise 7.3. The nth variariation of a function f , over the interval [0, T] is defined as
n–1
|f (tj +1 ) – f (tj )|m ,
X
VT (m, f ) := lim Π = {0 = t0 , t1 , . . . tn = T}, kΠk = max(tj +1 – tj ).
kΠk→0 j =0 j
Xt = µt + Wt , τm := inf{t ≥ 0 : Xt = m},
where W = (Wt )t ≥0 is a Brownian motion. Let F = (Ft )t ≥0 be a filtration for W. Show that Z is a
martingale with respect to F where
Zt = exp σXt – (σµ + σ 2 /2)t .
Assume µ > 0 and m ≥ 0. Assume further that τm < ∞ with probability one and the stopped process
Zt ∧τm is a martingale. Find the Laplace transform Ee–ατm .
Chapter 8
Stochastic calculus
The notes from this chapter are taken primarily from (Shreve, 2004, Chapters 4 and 5).
Assumption 8.1.1. In what follows W = (Wt )t ≥0 will always represent a Brownian motion and
F = (Ft )t ≥0 will always be a filtration for this Brownian motion. We shall assume the integrand
∆ = (∆t )t ≥0 is adapted to F, meaning ∆t ∈ Ft for all t .
Note that the process ∆ can and, in many cases, will be random. However, the information available in
Ft will always be sufficient to determine the value of ∆t at time t . Also note, since (WT – Wt ) ⊥
⊥ Ft
for T > t , it follows that (WT – Wt ) ⊥⊥ ∆t . In other words, future increments of Brownian motion are
independent of the ∆ process.
115
116 CHAPTER 8. STOCHASTIC CALCULUS
Since the process ∆ is constant over intervals of the form [tj , tj +1 ), it makes sense to define
Z T n–1
X
IT = ∆t dWt := ∆tj (Wtj +1 – Wtj ). (for ∆ a simple process) (8.1)
0 j =0
Theorem 8.1.2. The process I = (It )t ≥0 defined in (8.1) is a martingale with respect to the filtration
(Ft )t ≥0 .
Proof. Without loss of generality assume T = tn and t = ti for some 0 ≤ i ≤ n – 1 (we can always
re-define our time grid so that this is true). Then we have
n–1
X
E[IT |Ft ] = E[∆tj (Wtj +1 – Wtj )|Fti ]
j =0
iX–1 n–1
X
= ∆tj (Wtj +1 – Wtj ) + E[∆tj (Wtj +1 – Wtj )|Fti ]
j =0 j =i
iX–1 n–1
X
= ∆tj (Wtj +1 – Wtj ) + E[∆tj E[Wtj +1 – Wtj |Ftj ]|Fti ]
j =0 j =i
iX–1
= ∆tj (Wtj +1 – Wtj ) = It .
j =0
Thus, the process I is a martingale, as claimed.
Theorem 8.1.3 (Itô Isometry). The process I = (It )t ≥0 defined in (8.1) satisfies
Z T
VIT = EI2T =E ∆2t dt . (8.2)
0
Proof. We have
n–1
X n–1
EI2T =
X
E∆ti ∆tj (Wti +1 – Wti )(Wtj +1 – Wtj )
j =0 i =0
n–1 X jX
n–1 –1
E∆2tj (Wtj +1 )2
X
= – W tj +2 E∆ti ∆tj (Wti +1 – Wti )(Wtj +1 – Wtj )
j –0 j =0 i =0
n–1 X jX
n–1 –1
E∆2tj E[(Wtj +1 – Wtj )2 |Ftj ] + 2
X
= E∆ti ∆tj (Wti +1 – Wti )E[(Wtj +1 – Wtj )|Ftj ]
j –0 j =0 i =0
n–1 Z T
E∆2tj (tj +1 – tj ) = E ∆2t dt .
X
=
j –0 0
8.1. ITÔ INTEGRALS 117
Thus, we obtain
n–1
X n–1 Z T
∆2tj ∆2t dt ,
X
[I, I]T = [I, I]tj +1 – [I, I]tj = tj +1 – tj =
j =0 j =0 0
as claimed.
When we computed VWT and [W, W]T we found that these two quantities were equal, even though the
computations for these quantities were completely different. From (8.2) and (8.3) we now see how the
variance and quadratic variation of a stochastic process can be different. Note that VIT is a non-random
constant, whereas [I, I]T is random.
To construct an Itô integral with ∆ as the integrand, we first approximate ∆ by a simple process
n–1
(n) X
∆t ≈ ∆t := ∆tj 1{tj ≤t <tj +1 } , 0 ≤ t0 < t1 < . . . < tn = T.
j =0
(n) 2
Z T
lim E ∆t – ∆t dt = 0. (8.5)
n→∞ 0
118 CHAPTER 8. STOCHASTIC CALCULUS
the condition (8.5) ensures that the limit exists in L2 (Ω, F, P). The Itô integral for general integrands
inherits the properties we established for simple integrands.
Theorem 8.1.5. Let W be a Brownian motion and let F = (Ft )t ≥0 be filtration for this Brownian
motion. Let ∆ = (∆t )0≤t ≤T be adapted to the filtration F and satisfy (8.5). Define I = (It )0≤t ≤T ,
be given by 0t ∆s ds, where the integral is defined as in (8.6). Then the process I has the following
R
properties.
Theorem 8.2.1. Let W = (Wt )t ≥0 be a Brownian motion and suppose f : R → R satisfies f ∈ C2 (R).
Then, for any T ≥ 0 we have
Z T Z T
f (WT ) – f (W0 ) = f 0 (W t )dWt + 1 00
2 0 f (Wt )dt . (8.8)
0
Proof. We shall simply sketch the proof of Theorem 8.2.1. Suppose for simplicity that f is analytic
(i.e., that f is equal to its power series expansion at every point). Let 0 = t0 < t1 < . . . < tn = T be a
partition Π of [0, T]. Then
Z T n–1
X n–1
X
df (Wt ) = f (WT ) – f (W0 ) = f (Wtj +1 ) – f (Wtj ) = Aj + B j + C j ,
0 j =0 j =0
2
Bj := 12 f 00 (Wtj ) Wtj +1 – Wtj
,
3
1 f 000 (W ) W
Cj := 3! tj tj +1 – Wtj + ....
Example 8.2.2. What is 0T Wt dWt ? To answer this question, consider f (Wt ) with f (x ) = x 2 .
R
where we have used f 0 (x ) = 2x and f 00 (x ) = 2. Noting that W0 = 0 and solving for 0T Wt dWt we obtain
R
Z T
Wt dWt = 12 W2T – 12 T.
0
Not surprisingly, there are stochastic processes that are not adequately described by Brownian motion
alone. However, a large class of stochastic processes can be constructed from Brownian motion.
Definition 8.2.3. Let W = (Wt )t ≥0 be a Brownian motion and let F = (Ft )t ≥0 be a filtration for this
Brownian motion. An Itô process is any process X = (Xt )t ≥0 of the form
Z t Z t
Xt = X0 + Θs ds + ∆s dWs , (8.9)
0 0
120 CHAPTER 8. STOCHASTIC CALCULUS
where Θ = (Θt )t ≥0 and ∆ = (∆t )t ≥0 are adapted to the filtration F and satisfy
Z T Z T
|Θt |dt < ∞, E ∆2t dt < ∞, ∀ T ≥ 0.
0 0
and X0 is not random.
Expression (8.10) literally means that X satisfies (8.9). Informally, the differential form can be understood
as follows: in a small interval of time δt , the process X changes according to
In fact, noting that Wt +δt – Wt ∼ N(0, δt ) and Wt +δt – Wt ⊥⊥ Ft , one can use expression (8.11) to
simulate the increment Xt +δt – Xt . This way of simulating X is called the Euler scheme and is the
workhorse of many Monte Carlo methods.
Lemma 8.2.4. The quadratic variation [X, X]T of an Itô process (8.9) is given by
Z T
[X, X]T = ∆2t dt .
0
Proof. We sketch the proof of Lemma 8.2.4. Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T].
By definition we have
n–1
X 2 n–1
X
[X, X]T = lim Xtj +1 – Xtj = lim Aj + B j + C j
kΠk→0 j =0 kΠk→0 j =0
with
Z t Z t
It = ∆s dWs , Jt = Θs ds.
0 0
In the limit as kΠk → 0 we obtain
n–1
X n–1
X n–1
X
Aj → [I, I]T , Bj → 0, Cj → 0.
j =0 j =0 j =0
Definition 8.2.5. Let X = (Xt )t ≥0 be an Itô process, as described in Definition 8.2.3. Let Γ = (Γt )t ≥0
be adapted to the filtration of the Brownian motion F = (Ft )t ≥0 . We define
Z T Z T Z T
Γt dXt := Γt Θt dt + Γt ∆t dWt ,
0 0 0
where we assume
Z T Z T
|Γt Θt |dt < ∞, E (Γt ∆t )2 dt < ∞, ∀ T ≥ 0.
0 0
Theorem 8.2.6 (Itô formula in one dimension). Let X = (Xt )t ≥0 be an Itô process and suppose
f : R → R satisfies f ∈ C2 (R). Then, for any T ≥ 0 we have
Z T Z T
f (XT ) – f (X0 ) = f 0 (Xt )dXt + 12 f 00 (Xt )d[X, X]t .
0 0
Proof. The proof of Theorem 8.2.6 if very similar to the proof of Theorem 8.2.1. We outline the proof
here. Suppose for simplicity that f is analytic (i.e., that f is equal to its power series expansion at every
point). Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T]. Then
Z T n–1
X n–1
X
df (Xt ) = f (XT ) – f (X0 ) = f (Xtj +1 ) – f (Xtj ) = Aj + B j + C j ,
0 j =0 j =0
2
Bj := 12 f 00 (Xtj ) Xtj +1 – Xtj
,
3
1 f 000 (X ) X
Cj := 3! tj tj +1 – Xtj + ....
where we have used d[X, X]t = ∆2t dt . Perhaps the easiest way to remember (8.12) is to use the following
two-step procedure:
122 CHAPTER 8. STOCHASTIC CALCULUS
2. Insert the differential dXt = Θt dt + ∆t dWt into (8.13), expand (dXt )2 and use the rules
Assuming µ = (µt )t ≥0 and σ = (σt )t ≥0 are bounded above and below and X0 > 0, the process X remains
strictly positive. We call X a generalized geometric Brownian motion. The “geometric” part refers to
the fact that the relative step size dXt /Xt has dyanmics µt dt + σt dWt . The “generalized” part refers to
p
the fact that the processes σ and µ are stochastic rather than constant. Define Yt = Xt . What is dYt ?
Let f (x ) = x p . Then f 0 (x ) = px p–1 and f 00 (x ) = p(p – 1)x p–2 . Thus, we have
p–1 p–2
dYt = df (Xt ) = pXt dXt + 21 p(p – 1)Xt (dXt )2
p–1 p–2
= pXt (µt Xt dt + σt Xt dWt ) + 21 p(p – 1)Xt (µt Xt dt + σt Xt dWt )2
p–1 p–2
= pXt (µt Xt dt + σt Xt dWt ) + 21 p(p – 1)Xt σt2 X2t dt
p p
= pµt + 21 p(p – 1)σt2 Xt dt + pσt Xt dWt
= pµt + 21 p(p – 1)σt2 Yt dt + pσt Yt dWt .
We see from the last line that Y = (Yt )t ≥0 is also a generalized geometric Brownian motion.
Example 8.2.8. Let X have generalized geometric Brownian motion dynamics as in (8.14). We would
like to find an explicit expression for Xt (i.e., an expression of the form Xt = . . . where . . . does not
contain X). To this end,we let Yt = log Xt . With f (x ) = log x we have f 0 (x ) = 1/x and f 00 (x ) = –1/x 2 .
Thus, we have
1 1 –1
2 = µ – 1 σ 2 dt + σ dW .
dYt = dXt + (dXt ) t 2 t t t
Xt 2 X2t
Thus, we have
Z T Z T !
XT = exp(YT ) = exp Y0 + µt – 1 σ2 dt + σt dWt
0 2 t 0
Z T Z T !
= X0 exp µt – 1 2
0 2 σt dt +
0
σt dWt , (8.15)
Proof. Set µt = 0 and σt = ug(t ) in (8.14) where u is a constant. And suppose X0 = 1. Then we have
Z T
XT = 1 + ug(t )Xt dWt ,
0
which is a martingale since Itô integrals are martingales. From (8.15), we know that XT can be written
explicitly as
Z T Z T ! Z T !
XT = exp 1
–2 u 2 g 2 (t )dt + ug(t )dWt = exp 2
–u 21 g 2 (t )dt + uIT .
0 0 0
moment generating function of a normal random variable with mean zero and variance v (T) = 0T g 2 (t )dt .
R
Theorem 8.3.2. Let W = (W1t , W2t , . . . , Wdt )t ≥0 be a d-dimensional Brownian motion. The covari-
ation of independent components of W is zero: [Wi , Wj ]T = 0 for all i 6= j and T ≥ 0.
Proof. Let 0 = t0 < t1 < . . . < tn = T be a partition Π of [0, T]. The sampled covariation CΠ of Wi
and Wj is given by
n–1
X j j
CΠ = Witk +1 – Witk Wt – Wt
k +1 k
k =0
j j
Since E Witk +1 – Witk Wt – Wt = 0, we clearly have ECΠ = 0. Next, we compute the variance of
k +1 k
CΠ . We have
n–1
X n–1 j j j j
EC2Π E Witk +1 – Witk Witl+1 – Witl
X
VCΠ = = Wt – Wt Wt – Wt
k +1 k l+1 l
k =0 l=0
n–1 2 2
j j
E Witk +1 – Witk
X
= Wt – Wt
k +1 k
k =0
n–1
X kX–1
j j j j
+2 E Witk +1 – Witk Wt – Wt Witl+1 – Witl Wt – Wt
k +1 k l+1 l
k =0 l=0
n–1
X
= E (tk +1 – tk ) (tk +1 – tk )
k =0
n–1
X
≤ kΠk E (tk +1 – tk ) = kΠk T.
k =0
Thus, ECΠ → 0 and VCΠ → 0 as kΠk → 0, which proves that [Wi , Wj ]T := limkΠk→0 CΠ = 0.
Theorem 8.3.2 can be used to derive the covariation of two Itô processes Xi and Xj .
k =1
We will not prove Theorem 8.3.3. Rather, we simply remark that it can be obtained informally by
writing
j
d[Xi , Xj ]t = dXit dXt , (8.17)
8.3. MULTIVARIATE STOCHASTIC CALCULUS 125
inserting expression (8.16) into (8.17) and using the multiplication rules
j 1, i = j ,
j
dWit dWt = δij dt , δij = dWt dt = 0, dt dt = 0. (8.18)
6 j,
0, i =
Note that d[Xi , Xj ]t = 0 unless Xj and Xj are driven by at least one common one-dimensional Brownian
motion.
We can now give a n-dimensional version of Itô’s Lemma. We present the formula in differential form, as
it is written more compactly in this way.
The prove of Theorem 8.3.4 is a straightforward extension of Theorem 8.2.6 to the n-dimensional case
and will not be presented here.
To obtain an explicit expression for df (Xt ) in terms of dW1t , dW2t , . . . , dWdt and dt we can repeat the
same informal procedure we used in the one-dimensional case.
1. Expand df (Xt ) = f (Xt + dXt ) – f (Xt ) about the point Xt to second order
n ∂f (Xt ) i 1 X n X n ∂ 2 f (X )
t j
dXit dXt .
X
df (Xt ) = dXt + (8.19)
i =1
∂x i 2 i =1 j =1
∂x ∂x
i j
2. Insert expression for dXit into (8.19) and use the multiplication rules given in (8.18).
Example 8.3.5 (Product rule). To compute d(Xt Yt ) wherer X and Y and one-dimensional Itô
processes, we define f (x , y) = xy and use fx = y, fy = x , fxy = 1 and fxx = fyy = 0 to compute
Example 8.3.6 (OU process). An Ornstein-Uhlenbeck process (OU process, for short) is an Itô
process X = (Xt )t ≥0 that satisfies
where W = (Wt )t ≥0 is a one-dimensional Brownian motion and κ, θ > 0. The OU process is mean-
reverting in the following sense. If Xt > θ then κ(θ – Xt ) < 0 and the deterministic part of (8.20) (i.e.,
126 CHAPTER 8. STOCHASTIC CALCULUS
the dt -term) pushes the process down towards θ. If Xt < θ then κ(θ – Xt ) > 0 and the deterministic part
of (8.20) pushes the process up towards θ. The the OU process mean-reverts to the long-run mean θ.
We often call κ the rate of mean reversion, though, this nomenclature is somewhat misleading since the
instantaneous rate of mean reversion is actually κ(θ – Xt ).
We will find an explicit expression for Xt and also compute EXt and VXt . To this end, let us define
Yt = Xt – θ so that
Note that Y is an OU process that mean-reverts to zero. Next, we define Zt = f (t , Yt ) = eκt Yt . We can
use the two-dimensional Itô formula to compute dZt . Using fyy = 0 and the heuristic rules dt dWt = 0
and dt dt = 0 we have
2 f (t , M )d[M, M]
df (t , Mt ) = ∂t f (t , Mt )dt + ∂m f (t , Mt )dMt + 21 ∂m t t
= ∂t + 12 ∂m 2 f (t , M )dt + ∂ f (t , M )dM .
t m t t
Although we have not proved it, since M is a martingale (by assumption), it follows that any integral of
the form It := 0t ∆s dMs , where ∆ = (∆t )t ≥0 is adapted to (Ft )t ≥0 , is a martingale. Thus, using the
R
It follows that
1 2 1 2 1 2
Et euMT – 2 u T = f (t , Mt ) = euMt – 2 u t ⇒ Et eu(MT –Mt ) = e 2 u (T–t ) . (8.21)
1 2
By definition, Et eu(MT –Mt ) is the Ft -conditional moment generating function of (MT –Mt ). And e 2 u (T–t )
is the moment generating function of a N(0, T – t ) random variable. It follows that MT – Mt ∼ N(0, T – t )
for all 0 ≤ t ≤ T < ∞, just like a Brownian motion. Furthermore, we see that MT – Mt ⊥⊥ Ft , as the
right-hand-side of (8.21) does not depend on Ft . As M satisfies all properties of a Brownian motion, it
must be a Brownian motion.
128 CHAPTER 8. STOCHASTIC CALCULUS
where W = (W1t , W2t ) is a two-dimensional Brownian motion. We will use Theorem 8.3.7 to show that B
is a Brownian motion. It is clear that B0 = 0 and B has sample paths that are continuous. It is also
clear that B is a martingale since W1 and W2 are martingales. What remains is to show that [B, B]t = t .
We have
As previously mentioned, the distribution of a normally distributed random vector is uniquely determined
by its mean vector m and covariance matrix. Thus, for a Gaussian process, we are interested in
Example 8.4.2. A Brownian motion W is a Gaussian process. To see this, fix an arbitrary sequence of
times 0 = t0 < t1 < t2 < . . . tn . Note that
kX
–1
Wtk = (Wtj +1 – Wtj ).
j =0
The increments are independent and normally distributed Wtj +1 – Wtj ∼ N(0, tj +1 – tj ). It fol-
lows that the vector (Wt1 , Wt2 , . . . , Wtn ) is jointly normal. The mean vector is given by m =
(m(t1 ), m(t2 ), . . . , m(tn )) = (0, 0, . . . , 0) and the covariance matrix C = (c(ti , tj ))1≤i ,j ≤n has entries
c(ti , tj ) = ti ∧ tj ; see equation (7.3).
8.4. BROWNIAN BRIDGE 129
is a Gaussian process. To see this, fix an arbitrary sequence of times 0 = t0 < t1 < t2 < . . . tn . Note that
kX
–1 Z tj +1
Itk = g(s)dWs .
j =0 tj
The increments are independent and normally distributed with mean zero and variance:
Z t Z t
j +1 j +1
V g(s)dWs = g 2 (s)ds
tj tj
It follows that the vector (It1 , It2 , . . . , Itn ) is jointly normal. The mean vector is given by m =
(m(t1 ), m(t2 ), . . . , m(tn )) = (0, 0, . . . , 0). To compute the covariance matrix C = (c(ti , tj ))1≤i ,j ≤n ,
assume without loss of generality that ti < tj . Then we have
Z t Z t
i j
c(ti , tj ) = E g(s)dWs g(u)dWu
0 0
Z t 2 Z t Z t
i i j
g(u)dWu Ftj
=E g(s)dWs +E g(s)dWs E
0 0 ti
Z t Z t ∧t
i i j
= g 2 (s)ds = g 2 (s)ds,
0 0
Definition 8.4.4 (Brownian bridge, version I). / Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Xa→b = (Xa→b
t )0≤t ≤T , a Brownian bridge from a to b on [0, T], by
t t
Xa→b
t =a+ (b – a) + Wt – WT , 0 ≤ t ≤ T. (8.22)
T T
Theorem 8.4.5. The Brownian Bridge from a to b, defined in (8.22), is a Guassian process. The
mean vector and covariance matrix have entries given by
t
m a→b (t ) = a + (b – a),
T
st
c a→b (t , s) = t ∧ s – .
T
Proof. Since the sum of two (possibly correlated) normal random variables is again normal, and since
for every t ∈ [0, T], the value of Xa→b
t is a linear combination of Wt and WT , both of which are normal,
130 CHAPTER 8. STOCHASTIC CALCULUS
Suppose F = (Ft )t ≥0 is a filtration for a Brownian motion W. Then Xa→b is clearly not adapted to F.
Indeed, since Xa→b
t is expressed in terms of WT , we require the information in FT in order to write the
value of Xta→b . There is an alterantive definition of a Brownian bridge, which is adapted to F.
Definition 8.4.6 (Brownian bridge, version II). Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Ya→b = (Yta→b )0≤t ≤T , a Brownian bridge from a to b on [0, T], by
Z t
t 1
Ya→b
t =a+ (b – a) + (T – t ) dWs , 0 ≤ t ≤ T. (8.23)
T 0 T–s
Definition 8.4.8 (Brownian bridge, version III). Let W = (Wt )t ≥0 be a Brownian motion and fix
T > 0. We define Za→b = (Zta→b )0≤t ≤T , a Brownian bridge from a to b on [0, T], by
Za→b
t = a + Wt |WT = b – a, 0 ≤ t ≤ T.
P(Wt ∈ dx , WT ∈ db)
P(Z0→b
t ∈ dx ) = P(Wt ∈ dx |WT = b) =
P(WT ∈ db)
P(Wt ∈ dx )P(WT – Wt + x ∈ db)
= .
P(WT ∈ db)
132 CHAPTER 8. STOCHASTIC CALCULUS
where m a→b (t ) and c a→b (t , s) are as given in Theorem 8.4.9. We can clearly see then, that Z0→b
t
is normally distributed at every t with mean m 0→b (t ). We leave computation of the entries of the
covariance matrix c 0→b (t , s) as an exercise for the reader.
via
P(A)
e = EZ1A , A ∈ F,
e 11 ,
P(A) = E A ∈ F,
Z A
and we call Z1 = dP
e the Radon-Nykodym derivative of P with respect to P.
dP
In Example 1.7.6, on a probability space (Ω, F, P), we defined X ∼ N(0, 1) and a Radon-Nikodym
1 2
derivative Z = e–θX– 2 θ . We showed that Y := X + θ was N(θ, 1) under P and N(0, 1) under P.
e Thus, Z
We would like to extend this idea from a static to a dynamics setting. Specifically, we would like to find
a measure change that modifies the dynamics of a stochastic process X = (Xt )t ≥0 .
Definition 8.5.1. Let (Ω, F, P) be a probability space and let F = (Ft )0≤t ≤T be a filtration on this
space. A Radon-Nykodým derivative process (Zt )0≤t ≤T is any process of the form
Zt := E[Z|Ft ]
Note that Z in Definition 8.5.1 satisfies the conditions of a Radon-Nikodým derivative. As such, one can
define a measure change ddP
P from Z.
e
8.5. GIRSANOV’S THEOREM FOR A SINGLE BROWNIAN MOTION 133
Lemma 8.5.3. Let (Zt )0≤t ≤T be a Radon-Nikodým derivative process and define ddP
P = Z. Suppose
e
EY
e = EZ Y.
s
EY
e = EZY = EYE[Z|F ] = EYZ .
s s
Lemma 8.5.4. Let (Zt )0≤t ≤T be a Radon-Nikodým derivative process and define ddP
P = Z. Suppose
e
Y ∈ Ft where 0 ≤ s ≤ t ≤ T. Then
1
E[Y|F
e
s] = E[Zt Y|Fs ].
Zs
Proof. From Definition 2.3.1, we recall that a conditional expectation E[Y|F
e
s ] must satisfy two
properties:
(i) E[Y|Fs ] ∈ Fs .
e
Theorem 8.5.5 (Girsanov). Let W = (Wt )0≤t ≤T be a Brownian motion on a probability space
(Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration for W. Suppose Θ = (Θt )0≤t ≤T is adapted to the
filtration F. Define (Zt )0≤t ≤T and W
f = (W
f )
t 0≤t ≤T by
Z t Z t
Zt = exp – 1 Θ2 ds – Θs dWs , f = Θ dt + dW ,
dW W
f = 0.
s t t t 0
0 2 0
Assume that
Z T
E Θ2t Z2t dt < ∞.
0
under P.
e
W
f = 0. Also, we see that
0
d[W,
f W] f )2 = (dW + Θ dt )2 = (dW )2 + 2Θ dW dt + Θ2 (dt )2 = dt .
f = (dW
t t t t t t t t
Since Itô integrals are martingales, it follows that (Zt )0≤t ≤T is a martingale under P. In particular we
have EZ = EZT = Z0 = 1. We also have
for all 0 ≤ t ≤ T, which shows that (Zt )0≤t ≤T is a Radon-Nikodým derivative process. Next, we show
that (Wf Z )
t t 0≤t ≤T is a martingale under P. We have
d(W
f Z )=W
t t
f dZ + Z dW
t t t
f + d[W,
t
f Z]
t
f (–Z Θ dW ) + Z (Θ d + dW
=W f ) + (Θ dt + dW )(–Z Θ dW )
t t t t t t t t t t t t t
f (–Z Θ dW ) + Z (Θ d + dW
=W f ) – Z Θ dt
t t t t t t t t t t
8.6. GIRSANOV’S THEOREM FOR D-DIMENSIONAL BROWNIAN MOTION 135
f Θ + 1)Z dW .
= (–W t t t t
1
E[
e Wf |F ] =
t s E[Zt W
f |F ]
t s (by Lemma 8.5.4)
Zs
1
= Zs W
f =W
s
f .
s
Zs
Thus, W
f is a martingale, and therefore, a Brownian motion under P.
e
Theorem 8.6.1 (Girsanov). Let W = (W1t , W2t , . . . , Wdt )0≤t ≤T be a d-dimensional Brownian
motion on a probability space (Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration for W. Sup-
pose Θ = (Θ1t , Θ2t , . . . , Θdt )0≤t ≤T is adapted to the filtration F. Define (Zt )0≤t ≤T and W
f =
1 2 d
(W
f ,W
t
f ,...,W
t
f )
t 0≤t ≤T by
Z t Z t
Zt = exp – 1 hΘ , Θ ids – hΘs , dWs i , f = Θ dt + dW ,
dW W
f = 0,
s s t t t 0
0 2 0
where h·, ·i denotes a d-dimensional Euclidean inner product. Assume that
Z T
E hΘt , Θt iZ2t dt < ∞.
0
Theorem 8.6.2 (Martingale representation). Let W = (W1t , W2t , . . . , Wdt )0≤t ≤T be a d-dimensional
Brownian motion on a probability space (Ω, F, P) and let F = (Ft )0≤t ≤T be a filtration generated
by W (that is, Ft = σ(Ws , 0 ≤ s ≤ t )). Let M = (Mt )0≤t ≤T be a martingale with respect to the
filtration F. Then there exists a process Γ = (Γ1t , Γ2t , . . . , Γdt )0≤t ≤T that is adapted to the filtration
F such that
Z t
Mt = M0 + hΓs , dWs i, t ∈ [0, T].
0
8.7 Exercises
Exercise 8.1. Compute d(W4t ). Write W4T as an integral with respect to W plus an integral with respect
to t . Use this representation of W4T to show that EW4T = 3T2 . Compute EW6T using the same technique.
Where W = (Wt )0≤tleqT is a Brownian motion under probablity measure P. Then we can define a new
probability measure P̃ such that the process W̃ = (W̃t )0≤t ≤T is a Brownian motion under P̃. Then the
OU process X = (Xt )0≤t ≤T on the new probablity space (Ω, F, P̃) will be
Exercise 8.6. Let X be a Brownian bridge from zero to zero on the interval t = 0 to t = 1
(1) Prove that the process X1–t , 0 ≤ t ≤ 1, is also a Brownian bridge.
(2) Prove that if W is a Brownian motion, then the processes (1 – t )W1/(1–t ) and t W(1/t )–1 , 0 ≤ t ≤ 1,
are both Brownian bridges.
(3) If we write t̄ for t modulo 1, prove that the process Yt = Xt +s–Xs , 0 ≤ t ≤ 1 is a Brownian bridge
for every fixed s ∈ (0, 1). This is called the cyclic invariance of Brownian bridge.
138 CHAPTER 8. STOCHASTIC CALCULUS
Chapter 9
The notes from this chapter are taken primarily from (Shreve, 2004, Chapter 6), (Øksendal, 2005,
Chapters 5, 7, 8 and 9) and Linetsky (2007). Another good reference is (Karlin and Taylor, 1981, Chapter
15).
We call functions µ and σ the drift and diffusion, respectively, and we call Xt = x the initial condition.
A (strong) solution of an SDE is a stochastic process X = (Xs )s≥t such that
Z T Z T
XT = x + µ(s, Xs )ds + σ(s, Xs )dWs , (9.2)
t t
for all T ≥ t .
One way to envision a strong solution of an SDE is as follows: think of a sample path W· (ω) : [t , ∞) → R
as input. From this input, we can construct a unique sample path X· (ω) : [t , ∞) → R.
Ideally, we would like to write XT as an explicit functional of the Brownian path (Ws )s≥t . Unfortunately,
this is typically not possible. Still, it will help to build intuition if we see some explicitly solvable examples.
139
140 CHAPTER 9. SDES AND PDES
where µ and σ are deterministic functions of t . To solve this SDE, we consider Xt = log Zt . Using the
Itô formula, we obtain
!
1 1 –1
dXt = d log Zt = dZt + d[Z, Z]t
Zt 2 Z2t
= µ(t ) – 21 σ 2 (t ) dt + σ(t )dWt .
Example 9.1.3 (Linear SDE). Consider the following SDE with linear coefficients
dXt = d(Yt Zt )
= Yt dZt + Zt dYt + d[Y, Z]t
= µ(t )Yt Zt dt + σ(t )Yt Zt dWt + (b(t ) – σ(t )a(t )) dt + a(t )dWt + a(t )σ(t )dt
We also have X0 = Y0 Z0 = x · 1 = x . Thus, we have shown that YZ solves (9.3). Now, we note that
Z T Z T
b(t ) – σ(t )a(t ) a(t )
YT = x + dt + dWt ,
0 Zt 0 Zt
9.1. STOCHASTIC DIFFERENTIAL EQUATIONS 141
and thus
Z T Z T
b(t ) – σ(t )a(t ) a(t )
X T = Y T ZT = x ZT + Z T dt + ZT dWt .
0 Zt 0 Zt
where, from Example 9.1.2, we have
Z t Z t
Zt = exp µ(s) – 1 σ 2 (s) ds + σ(s)dWs .
0 2 0
Although we have mostly thrown mathematical rigor out the window in these notes, in an effort to
be responsible mathematicians, we should at least state (even if we do not prove) an existence and
uniqueness result.
Theorem 9.1.4 (Existence and Uniqueness of SDEs). Consider the following SDE
for some constants C1 , C2 < ∞. Then SDE (9.4) has a unique solution, which is adapted to to the
filtration F = (Ft )t ≥0 generated by W = (Wt )t ≥0 and satisfies E 0T X2t dt for all T < ∞.
R
Remark 9.1.5. Theorem 9.1.4 actually refers to a strong solution of an SDE. There is another notion of
a solution of an SDE called a weak solution. We will not discuss weak solutions here.
We will not prove Theorem 9.1.4. We refer the interested reader, instead, to (Øksendal, 2005, Theorem
5.21). However, we will illustrate with two examples what can go wrong if equations (9.5) and (9.6) are
not satisfied.
We identify µ(t , x ) = x 2 , which does not satisfy the linear growth condition (9.5). The unique solution
to (9.7) is
1
Xt = , 0 ≤ t < 1.
1–t
Note that Xt blows up as t → 1 and that Xt is not defined for t ≥ 1. Thus, it is impossible to find a
solution that is defined for all t ≥ 0.
142 CHAPTER 9. SDES AND PDES
We identify µ(t , x ) = 3x 2/3 , which does not satisfy the Lipschitz condition (9.6) at x = 0. One can
check directly that any X(a) of the form
(a)
0 t ≤a
Xt =
(t – a)3 t > a,
Theorem 9.1.8 (Markov property of solutions of an SDE). Let X = (Xt )t ≥0 be the solution
of an SDE of the form (9.1). The X is a Markov process. That is, for t ≤ T and for some suitable
function ϕ, there exists a function g (which depends on t , T and ϕ) such that
The proof of Theorem 9.1.8 is somewhat technical and will not be given here. But, the intuitive idea for
why the theorem is true is rather simple. From (9.2), we see that the value of XT depends only on the
path of the Brownian motion over the interval [t , T] and the initial value Xt = x . The path that X took
to arrive at Xt = x plays no role. In other words, given the present Xt = x , the future (XT )T>t is
independent of the past Ft . With this in mind, the process X should admit a transition density
Of course, finding an explicit representation of the transition density Γ may not be possible.
Theorem 9.2.1 (Kolmogorov Backward equation). Let X be the solution of SDE (9.1). For
some suitable function ϕ, define
If the function u ∈ C1,2 , then it satisfies the Kolmogorov Backward Equation (KBE), a linear PDE
of the form
Proof. First, we note that the process (u(t , Xt ))0≤t ≤T is a martingale since, for any 0 ≤ s ≤ t ≤ T,
we have
Next we take the differential of u(t , Xt ) and find using the Itô formula that
Integrating, we have
Z t Z t
u(t , Xt ) = u(s, Xs ) + (∂r + A(r )) u(r , Xr )dr + σ(r , Xr )∂x u(r , Xr )dWr .
s s
Taking a conditional expectation and using the fact that Itô integrals are martingales, we find
Z t
E[u(t , Xt )|Fs ] = u(s, Xs ) + E[(∂r + A(r )) u(r , Xr )|Fs ]dr .
s
Since (u(t , Xt ))0≤t ≤T is a martingale, the integral above must be zero for every 0 ≤ s ≤ t ≤ T and for
every possible value of Xs . The only way for this to be true is if the function u satisfies the PDE in
(9.10). To see why u(T, x ) = ϕ(x ) simply use the fact that ϕ(XT ) ∈ FT to write
Theorem 9.2.1 tells us that the function u defined in (9.9) satisfies a PDE (9.10). Alternatively, the
Feynman-Kac formula says that the solution u of the PDE (9.10) has the stochastic representation
u(t , x ) = E[ϕ(x )|Xt = x ].
The methods outlined in the proof above can be applied more generally to find PDE representations for
more complicated functionals of the path of X. The basic steps are as follows
144 CHAPTER 9. SDES AND PDES
Proof. Note that the function u is not a martingale as, for 0 ≤ s ≤ t ≤ T we have
Z T
E[u(t , Xt )|Fs ] = E[E[e–A(t ,T) ϕ(XT ) + dr e–A(t ,r ) g(r , Xr )|Ft ]|Fs ]
t
Z T
= E[e–A(t ,T) ϕ(XT ) + dr e–A(t ,r ) g(r , Xr )|Fs ]
t
Z T
6= E[e–A(s,T) ϕ(XT ) + dr e–A(s,r ) g(r , Xr )|Fs ] = u(s, Xs ).
s
Setting the dt term equal to zero, we obtain (9.11). The terminal condition is obtained using
Z T
u(T, XT ) = E[e–A(T,T) ϕ(XT ) + ds e–A(t ,s) g(s, Xs )|FT ]
T
= E[ϕ(XT )|FT ] = ϕ(XT ),
Killing a diffusion
On a probability space (Ω, F, P), consider the following model for a diffusion X and a random time τ
or exceeds E. The integral depends on the path of X, and therefore is stochastic. The exponentially
distributed random variable E is also random (obviously). For these reasons, we say that τ is doubly
stochastic. The random time τ called the killing time as it sometimes is used to model the lifetime of
the process X.
Let FX = (FtX )t ≥0 be the filtration generated by observing the X process. Note that
/ FtX .
1{τ >t } ∈
In order to keep track of the information obtained by observing τ , we introduce an auxiliary process D
as follows
Dt = 1{τ ≤t } .
Theorem 9.2.3. Let X = (Xt )t ≥0 and τ be as given in (9.12) and (9.13). Then
RT
E[1{τ >T} ϕ(XT )|Ft ] = 1{τ >t } E[e– t γ(s,Xs )ds ϕ(X )|F ],
T t (9.14)
where T ≥ t .
146 CHAPTER 9. SDES AND PDES
Proof. Noting that 1 = 1{τ ≤t } + 1{τ >t } and 1{τ ≤t } 1{τ >T} = 0, we have
E[1{τ >T} ϕ(XT )|Ft ] = 1{τ >t } E[1{τ >T} ϕ(XT )|Ft ] + 1{τ ≤t } E[1{τ >T} ϕ(XT )|Ft ]
X , F ]|F ].
= 1{τ >t } E[ϕ(XT )E[1{τ >T} |FT (9.15)
t t
In this Section, we will derive two PDEs which are satisfied by the transition density Γ: the Kolmogorov
Backward Equation (KBE), which is a PDE in the backward variables (t , x ), and the Komogorov
Forward Equation (KFE), which is a PDE satisfied by the forward variables (T, y). Physicists and
biologists sometimes call the KFE the Fokker-Planck Equation.
We have already seen the KFE and KBE in the continuous time Markov chain setting, discussed in
Section 5.2. The development of the KFE and KBE for diffusion processes is remarkably similar.
Definition 9.3.1. The two-parameter semigroup (P(t , T))0≤t ≤T<∞ , of a Markov diffusion X, is defined
as
Z
P(t , T)ϕ(x ) = E[ϕ(XT )|Xt = x ] = dy Γ(t , x ; T, y)ϕ(y).
To see the semigroup property P(t , s)P(s, T) = P(t , T), note that
The semigroup property can, alternatively, be derived from the Chapman-Kolmogorov equations
Z
Γ(t , x ; T, y) = dz Γ(t , x ; s, z )Γ(s, z ; T, y), 0 ≤ t ≤ s ≤ T < ∞,
We have
Z
P(t , s)P(s, T)ϕ(x ) = dy Γ(s, x ; T, y)ϕ(y)
Z Z
= dz Γ(t , x ; s, z ) dy Γ(s, z ; T, y)ϕ(y)
Z
= dy Γ(t , x ; T, y)ϕ(y) = P(t , T)ϕ(x ).
Definition 9.3.3. The infinitesimal generator or simply the generator, of a semigroup of operators
(P(t , s))0≤t ≤T<∞ is defined as
1
A(t )ϕ(x ) := lim P(t , s)ϕ(x ) – ϕ(x )
s&t s – t
1
= lim E[ϕ(Xs )|Xt = x ] – ϕ(x )
s&t s – t
Theorem 9.3.4. If ϕ ∈ C20 (bounded and twice differentiable), then and X is the solution of
then the generator A(t ) of the semigroup (P(t , s))0≤t ≤T<∞ of X is given by
Therefore, we have
Z s Z s
ϕ(Xs ) = ϕ(Xt ) + A(r )ϕ(Xr )dr + σ(r , Xr )ϕ(Xr )dWr ,
Z st t
E[ϕ(Xs )|Xt = x ] = ϕ(x ) + E[A(r )ϕ(Xr )|Xt = x ]dr .
t
Finally,
1 1 Zs
lim E[ϕ(Xs )|Xt = x ] – ϕ(x ) = lim E[A(r )ϕ(Xr )|Xt = x ]dr = A(t )ϕ(x ).
s&t s – t s&t s – t t
Remark 9.3.5. For all intents and purposes, when X is the solution of an SDE the generator A(t ) is the
simply the operator that acts on ϕ in the dt term of dϕ(Xt ). Thus, we can write the Itô formula more
compactly as
Of course, if X is not the solution of an SDE, we cannot write the differential dϕ(Xt ) in this more
compact form.
Then, seen as a function of the backwards variables (t , x ), the transition density Γ(·, ·; T, y) satisfies
Seen as a function of the forward variables (T, y), the transition density Γ(t , x ; ·, ·) satisfies
The above equations must hold for all functions ϕ. It follows that
Proving the KFE requires a little more effort. First, we note that, by definition, the generator A(t ) and
its L2 (dx ) adjoint satisfy
Z Z
dy f (y)A(t )g(y) = dy g(y)A∗ (t )f (y),
where, both A(t ) and A∗ (t ) act on the y variable and f , g → 0 as y → ±∞. Now, observe that
Z
dy ϕ(y)Γ(t , x ; T, y) = E[ϕ(XT )|Xt = x ]
Z T Z T
= ϕ(x ) + ds E[A(s)ϕ(Xs )|Xt = x ] + E[ σ(s, Xs )∂x ϕ(Xs )dWs |Xt = x ]
t t
Z T Z
= ϕ(x ) + ds dy Γ(t , x ; s, y)A(s)ϕ(y)
t
Z T Z
= ϕ(x ) + ds dy ϕ(y)A∗ (s)Γ(t , x ; s, y).
t
150 CHAPTER 9. SDES AND PDES
We also have
Z
dy ϕ(y)Γ(t , x ; t , y) = E[ϕ(Xt )|Xt = x ] = ϕ(x ).
Again, the above expressions must hold for all ϕ. Thus, we see that
Example 9.3.7. Let us check that the KBE and KFE are satisfied in the following simple case:
dXt = σdWt .
τ = inf{t ≥ 0 : Xt ∈
/ I}, (9.18)
Theorem 9.4.1. Let X = (Xt )t ≥0 and τ be given by (9.17) and (9.18). Define
Z τ
u(x ) := E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Xt = x ], t ≤ τ.
t
The the function u satisfies
(A – λ)u + g = 0 in I, (9.19)
u = ϕ, on ∂I, (9.20)
Proof. Let F = (Ft )0≤t ≤τ be the filtration generated by X. The process M = (Mt )0≤t ≤τ , defined by
Z t
Mt := e–λt u(Xt ) + e–λs g(Xs )ds
0
Z τ Z t
= e–λt E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Xt ] + e–λs g(Xs )ds
t 0
Z τ Z t
= e–λt E[e–λ(τ –t ) ϕ(Xτ ) + e–λ(s–t ) g(Xs )ds|Ft ] + e–λs g(Xs )ds
Z τ t 0
= E[e–λτ ϕ(Xτ ) + –λs
e g(Xs )ds|Ft ]
0
is a martingale since, for 0 ≤ s ≤ t ≤ τ we have
Z τ
E[Mt |Fs ] = E[e–λτ ϕ(Xτ ) + e–λr g(Xr )dr |Fs ] = Ms .
0
Taking the differential of M we obtain
Z t
dMt = –λt
d e u(Xt ) + d e–λs g(Xs )ds
0
= e–λt – λu(Xt ) + Au(Xt ) + g(Xt ) dt + e–λt σ(Xt )∂x u(Xt )dWt .
Since M is a martingale, the dt term must be zero for all Xt ∈ I. Thus, we have
(A – λ)u + g = 0, in I,
satisfies
(A – λ)u = 0, in I,
u = 1, on ∂I.
152 CHAPTER 9. SDES AND PDES
Example 9.4.3 (First hitting time of Brownian motion). Let us define a process X and a hitting
time τ by
should satisfy
Let us check that this is the case. We showed in Section 7.5 that
√
Ee–λτm = e–|m| 2λ , τm := inf{t ≥ 0 : Wt = m},
u(r ) = 1, 1 2
2 ∂x u(x ) = λu(x ).
The end-points l and r may or may not be part of the interval I. The generator A of X is given by
where s and m, called the scale and speed densities, respectively, are given by
! !
Z
2µ(x ) 2 2 Z
2µ(x )
s(x ) = exp – dx 2 , m(x ) = 2 = 2 exp dx 2 .
σ (x ) σ (x )s(x ) σ (x ) σ (x )
The constant of integration is arbitrary.
then m is a time-homogenous solution of the KFE. To see this simply observe that
σ 2 (y)
!
(–∂T + A∗ )m(y) = A∗ m(y)= ∂y –µ(y)m(y) + ∂y m(y)
2
! !!
2µ(y) Z
2µ(y) Z
2µ(y)
= ∂y – 2 exp dy 2 + ∂y exp dy 2 = 0.
σ (y) σ (y) σ (y)
Thus, m is a stationary density for X.
Definition 9.5.1. Let L2 (E, ρ) denote the set of square-integrable functions on E weighted by ρ
Z
L2 (E, ρ) := {f : hf , f iρ < ∞}, hf , giρ := dx ρ(x )f (x )g(x ).
E
Expression (9.23) is called the self-adjoint form of A as, for functions f and g that satisfy appropriate
boundary conditions (to be determined below) we have
Z r
hf , Agim = hAf , gim , hf , gim := dx m(x )f (x )g(x ).
l
is self-adjoint in L2 (I, m). Note that A is not self-adjoint in L2 (I, dx ). Also note, when we talk about
the operator A we are really talking about the pair (A, dom(A)). The domain of A includes boundary
conditions that must be satisfied in order of A to be self-adjoint in L2 (I, m).
Definition 9.5.2. Let X be a time-homogeneous scalar diffusion, which lives on an interval I with
endpoints l and r . An endpoint l or r is said to be
The definition above, while precise, gives us very little intuition as to what the four different boundary
classifications mean. Thus, we elaborate a bit.
• Regular boundary The process X can be started from a regular boundary and X can reach a regular
boundary in finite time.
• Exit boundary: The process X cannot be started from a exit boundary but X can reach a exit
boundary in finite time. If X reaches an exit boundary, it does not return.
• Entrance boundary: The process X can be started from an entrance boundary but X cannot reach
an entrance boundary in finite time.
• Natural boundary: The process X cannot be started from a natural boundary nor can X reach a
natural boundary in finite time.
We must specify the behavior of X at a regular boundary. Different boundary behaviors correspond to
different boundary conditions for dom(A). The two most common behaviors are killing an reflecting.
If a regular boundary is specified as killing then the process X is killed as soon as it hits this boundary
and it cannot return to the state space I. Thus, if l and r are regular killing boundaries we clearly have
Since Γ(t , ·; T, y) ∈ dom(A), we clearly want A to act on functions f that satisfy f (l) = f (r ) = 0.
If a regular boundary is specified as reflecting the the process X is instantaneously reflected back into I
if it hits this boundary. Thus, if l and r are regular reflecting boundaries we clearly have
Z
1= dy Γ(t , x ; T, y),
ZI
0= dy ∂T Γ(t , x ; T, y)
ZI
= dy A∗ Γ(t , x ; T, y)
ZI
= dy ∂y –µ(y)Γ(t , x ; T, y) + 21 ∂y σ 2 (y)Γ(t , x ; T, y)
I
h i
= –µ(y)Γ(t , x ; T, y) + 21 ∂y σ 2 (y)Γ(t , x ; T, y)
" ∂I #
Γ(t , x ; T, y) 1 2 Γ(t , x ; T, y)
= –µ(y)m(y) + 2 ∂y σ (y)m(y)
m(y) m(y) ∂I
" #
2µ(y) 2µ(y) Γ(t , x ; T, y) 2µ(y) Γ(t , x ; T, y)
Z Z
= – 2 exp dy 2 + ∂y exp dy 2
σ (y) σ (y) m(y) σ (y) m(y) ∂I
" #
2µ(y) Γ(t , x ; T, y)
Z
= exp dy 2 ∂y
σ (y) m(y) ∂I
" #
1 Γ(t , x ; T, y)
= ∂y .
s(y) m(y) ∂I
Since Γ(t , x ; T, ·) ∈ dom(A∗ ), if follows that A∗ acts on functions f that satisfy ∂y (f (y)/m(y)) = 0 at
y = l and y = r . But we are interested in the domain of A. Recall that A∗ is the L2 (I, dx ) adjoint of A
– not the L2 (I, m) adjoint of A. Thus, we seek boundary conditions for A such that hf , Agi = hA∗ f , gi.
Note that
!
Z
1 1
hf , Agi = dx f (x ) ∂x ∂x g(x )
I m(x ) s(x )
! ! " #
Z
f (x ) 1 f (x ) 1
= – dx ∂x · ∂x g(x ) + ∂x g(x )
I m(x ) s(x ) m(x ) s(x ) ∂I
! " #
Z
1 f (x ) f (x ) 1 g(x ) f (x )
= dx ∂x ∂x · g(x ) + ∂x g(x ) – ∂x
I s(x ) m(x ) m(x ) s(x ) s(x ) m(x ) ∂I
" #
∗ f (x ) 1 g(x ) f (x )
= hA f , gi + ∂x g(x ) – ∂x .
m(x ) s(x ) s(x ) m(x ) ∂I
Since
" #
f (x )
∂x = 0, ∀ f ∈ dom(A∗ ),
m(x ) ∂I
in order for hf , Agi = hA∗ f , gi, we must have
" #
1
∂x g(x ) = 0, ∀ g ∈ dom(A).
s(x ) ∂I
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 157
To summarize: If the endpoints of I are regular, we must specify if the boundary is killing or reflecting.
The correct boundary conditions to impose for the generator A are
1
killing : f (x ) = 0, reflecting : ∂x f (x ) = 0.
∂I s(x ) ∂I
Although we will not derive it, the dom(A) should also satisfy certain boundary conditions at exit and
entrance boundaries
1
exit : f (x ) = 0, entrance : ∂x f (x ) = 0.
∂I s(x ) ∂I
No boundary conditions are required for natural boundaries.
Eigenfunction expansions
Consider the following eigenvalue equation
Aψn = λn ψn , ψn ∈ dom(A).
Here, and throughout this subsection, we have implicitly assumed that the spectrum of A is discrete.
Theorem 9.5.4. Suppose the eigenfunctions of A are complete in L2 (I, m) and ϕ ∈ L2 (I, m). Then
the function u(t , x ) := E[ϕ(XT )|Xt = x ] is given by
Proof. We must show that the function u satisfies the KBE. We have
(∂t + A)u(t , x ) = hψn , ϕim ∂t e(T–t )λn ψn (x ) + e(T–t )λn Aψn (x )
X
n
hψn , ϕim –λn e(T–t )λn ψn (x ) + e(T–t )λn λn ψn (x ) = 0.
X
=
n
X
u(T, x ) = hψn , ϕim ψn (x ) = ϕ(x ).
n
where, to establish the terminal condition u(T, x ) = ϕ(x ) we have used the fact that the eigenfunctions
(ψn )n≥0 are complete in L2 (I, m).
158 CHAPTER 9. SDES AND PDES
Corollary 9.5.5. The transition density Γ(t , x ; T, y) of X has the following eigenfunction expansion
Proof. The transition density Γ(t , x ; T, y) satisfies the KBE with a terminal condition Γ(T, x ; T, y) =
δy (x ). Setting ϕ = δy in (9.27),we obtain
Example 9.5.6 (Brownian motion in a finite interval). Consider a diffusion X = (Xt )t ≥0 that
lives on a finite interval (l, r ) and satisfies the SDE
dXt = dWt .
We identify the drift µ(x ) = 0 and the diffusion coefficient σ(x ) = 1. The generator A of X and the
speed density m are given by
1
A = 12 ∂x2 , m(x ) = .
r –l
One can easily check that the endpoints l and r are regular. As such, we must specify the behavior at
the endpoints as either killing or reflecting. In the case of killing boundaries we have
which has a state space I = R. We identify the drift µ(x ) = –x and volatility coefficient σ(x ) = 1. The
generator and speed density of X are
2
A = –x ∂x + 12 ∂x2 , m(x ) = e–x .
One can easily verify that the endpoints l = –∞ and r = +∞ are natural. As such, we do not need to
specify any boundary conditions. The eigenfunctions and eigenvalues of A are given by
ψn (x ) = Hn (x ), λn = –n, n = 0, 1, 2, . . . ,
where (Hn )n≥0 are the Hermite polynomials, properly normalized so that hψn , ψk im = δn,k .
Heuristic computations
We have used the fact that A is self-adjoint in L2 (I, m) in order to derive to write an eigenfunction
expansion for u(t , x ) = E[ϕ(XT )|Xt = x ], the solution of the KBE. In fact, this is a special case of a
more general method of solving PDEs and ODEs involving a self-adjoint operator A. In what follows,
we continue to assume that the generator A of a diffusion X has a discrete spectrum, and that the
eigenfunctions of A are a complete basis in L2 (I, m).
We call (9.29) the eigenfunction or spectral representation of g(A). For example, taking g(λ) = 1 gives
the spectral representation of the identity operator
P(t , T)· = e(T–t )λn hψn , ·im ψn , dom(P(t , T)) := {f ∈ L2 (I, m) : e(T–t )λn hψn , f i2m < ∞},
X X
n n
160 CHAPTER 9. SDES AND PDES
Note that P(t , T) as defined above is in fact the semigroup operator we introduced previously. To see
this note thate
n
Z
e(T–t )λn
X
= dy ψn (y)ϕ(y)m(y)ψn (x )
n I
Z
e(T–t )λn ψn (y)ψn (x ) ϕ(y)
X
= dy m(y)
I n
Z
= dy Γ(t , x ; T, y)ϕ(y),
I
The usefulness in defining operators of the form g(A) is as follows. Consider an linear ODE or PDE for
a function u involving the operator A. One can obtain a solution to this ODE or PDE as follows:
1. Solve the ODE or PDE assuming A is a constant; the solution will involve terms of the form g(A)f .
2. Replace g(A)f with its spectral representation.
Example 9.5.8. Let us check that the above two-step method works for the KBE. We want to solve
where ϕ ∈ L2 (I, m). Treating A like a constant, we have an ODE in t . Solving this ODE and then using
the spectral representation any expression involving A we obtain
which agrees with the expression given in (9.27). Noting that e(T–t )A = P(t , T) we can write the solution
to (9.30) compactly as u(t , ·) = P(t , T)ϕ.
where ϕ ∈ L2 (I, m) and g(s, ·) ∈ L2 (I, m) for all s ∈ [0, T]. Treating A as a constant, we have an ODE
in t . The solution is given by
Z T
u(t , ·) = e(T–t )A ϕ + ds e(s–t )A g(s, ·).
t
9.5. IN DEPTH LOOK: SCALAR TIME-HOMOGENOUES DIFFUSIONS 161
Again, recognizing that e(s–t )A = P(t , s) we can write the solution of (9.31) as
Z T
u(t , ·) = P(t , T)ϕ + ds P(t , s)g(s, ·).
t
(A – µ)u = g. (9.32)
Treating A as a contant, we have an algebraic equation for u. Solving this, and using the spectral
representation for any expression involving A we obtain
1 X 1
u= = hψn , gim ψn .
A–µ n λn – µ
1 = R we can write the solution u of (9.32) compactly as u = R g. Note that his
Noting that A–µ µ µ
solution makes sense only if g ∈ dom(Rµ ). In particular, this means that we must have λn 6= µ for all n.
By construction, the scale function S is one-to-one. It is interesting to note that the process S(X) =
(S(Xt ))t ≥0 is a martingale, as the following computation shows
Now, consider two regular points {l} and {r } of a scalar diffusion X. Let τl and τr be the first hitting
times of X to l and r , respectively
We wish to compute P(τl < τr |X0 = x ), the probability that X hits l prior to hitting r . Let us define
τ := τl ∧ τr .
As the process S(X) is a martingale it follows from Theorems 2.4.3 and 2.4.5 that
S(x ) = ES(Xτ ) = P(τl < τr |X0 = x )S(l) + P(τr < τl |X0 = x )S(r ).
Throughout this section, we consider d-dimensional diffusion process X = (Xt )t ≥0 that satisfies the
following SDE
µ : R+ × Rd → Rd , σ : R+ × Rd → Rd×m
+ .
9.6. EXTENSIONS TO HIGHER DIMENSIONS 163
SDE (9.36) has a unique strong solution, which is square integrable for all t (i.e., EX2t < ∞) and adapted
to the filtration F = (Ft )t ≥0 generated by W (i.e., Xt ∈ Ft ) if the following are satisfies
d X
m d d
|σ|2 2, |µ|2 µ2i , |x |2 xi2 .
X X X
:= σij := :=
i =1 j =1 i =1 i =1
d d X
d
A(t ) = µi (t , x )∂xi + 12 (σσ T )i ,j (t , x )∂xi ∂xj ,
X X
i =1 i =1 j =1
d d X d
A∗ (t ) = – 1 ∂yi ∂yj (σσ T )i ,j (t , y).
X X
∂yi µi (t , y) + 2
i =1 i =1 j =1
The transition density Γ, defined by Γ(t , x ; T, y)dy = P(XT ∈ dy|Xt = x ) satisfies the KBE in the
backwards variables (t , x ) and the KFE in the forward variables (T, y)
If we define a function u : R+ × R → R by
Z T
e–A(t ,T) ϕ(XT ) + –A(t ,s)
u(t , x ) := E e g(s, Xs )ds Xt =x ,
t
Then u satisfies
∂t – γ(t , ·) + A(t ) u(t , ·) + g(t , ·) = 0, u(T, ·) = ϕ.
164 CHAPTER 9. SDES AND PDES
Lastly, consider the time-homogenous diffusion: µ(t , x ) = µ(x ) and σ(t , x ) = σ(x ). Let D ⊂ Rd be an
open, connect set. Denote by ∂D the boundary of D. Assume X0 ∈ D and define the hitting time
τD := inf{t ≥ 0 : Xt ∈
/ D}.
Then u satisfies
(A – λ)u + g = 0, in D,
u = ϕ, on ∂D.
9.7 Exercises
Exercise 9.1. A (one-dimensional) backward stochastic differential equation (BSDE), defined on a
probability space filtered probability space (Ω, F, F = (Ft )t ≥0 , P), is an equation of the form
where W = (Wt )0≤t ≤T is an (P, F)-Brownian motion, the process Y = (Yt )0≤t ≤T lives in R, the process
Z = (Zt )0≤t ≤T lives in R, the driver f : Ω × [0, T] × R × R → R, satisfies f (ω, t , Yt , Zt ) ∈ Ft for
all t ∈ [0, T] and the random variable ξ is FT -measurable. A solution of BSDE (9.39) is any pair of
F-adapted processes (Y, Z) such that the terminal condition YT = ξ is satisfied. A forward-backward
stochastic differential equation (FBSDE), is a BSDE of the form
where the processes W, Y, and Z are as described above, the process X = (Xt )0≤t ≤T lives in R, and the
functions µ, σ, f and ϕ are maps
µ : [0, T] × R × R × R → R, σ : [0, T] × R × R → R,
f : [0, T] × R × R × R → R, ϕ : R → R.
In general, the coefficient σ in (9.40) could depend on Z as well. However, for simplicity, we do not
consider this case here. FBSDEs naturally arise in mathematical finance where components of X are
9.7. EXERCISES 165
assets in a market, components of Y are values of hedging portfolios, and components Z are the associated
hedges.
We wish to solve FBSDE (9.40) – meaning we wish to find a pair of F-adapted processes (Y, Z) such
that YT = ϕ(XT ). Let us supposed that Yt = u(t , Xt ) for some function u : [0, T] × R → R, which is to
be determined. Let us further suppose that u ∈ C1,2 . Compute du(t , Xt ) and compare your result to
the expression given for dYt in (9.40). Conclude that, if the function u satisfies the semilinear PDE
Define
" Z T ! #
u(t , x ) := E exp – Xs ds Xt =x .
t
Derive a PDE for the function u. To solve the PDE for u, try a solution of the form
where A and B are deterministic functions of t . Show that A and B must satisfy a pair of coupled ODEs
(with appropriate terminal conditions at time T). Bonus question: solve the ODEs (it may be helpful
to note that one of the ODEs is a Riccati equation).
(i ) b (i ) 1 (i )
dXt = – Xt dt + σdWt ,
2 2
166 CHAPTER 9. SDES AND PDES
d d Z t 1
(i )
(Xt )2 , √ X(i ) (i )
X X
Rt := Bt := s dWs .
i =1 i =1 0 Rs
Show that B is a Brownian motion. Derive an SDE for R that involves only dt and dBt terms (i.e., no
(i )
dWt terms should appear).
Derive a PDE for the function u. Let ub be the Fourier transform in u in the x variable
1 Z
u(t
b , ξ, z ) = dx e–iξx u(t , x , z ).
2π R
Show that ub satisfies a PDE in (t , z ) with a terminal condition u(T,
b ξ, z ) = ϕ(ξ)
b where ϕb is the Fourier
transform of ϕ. Assume that ub is of the form
u(t
b , ξ, z ) = ϕ(ξ)e
b A(t ,ξ)+z B(t ,ξ) .
Show that A and B satisfy a pair of coupled ODEs in t (with appropriate terminal conditions at time T).
Bonus question: solve the ODEs (it may be helpful to note that one of the ODEs is a Riccati equation).
Exercise 9.5. Consider a diffusion X = (Xt )t ≥0 that lives on a finite interval (l, r ), 0<l <r <∞
and satisfies the SDE
dXt = µXt dt + σXt dWt
One can easily check that the endpoints l and r are regular (you do not have to prove it here). Assume
both endpoints are killing. Find the transition density Γ(t , x ; T, y) of X.
Exercise 9.6. Consider a two-dimensional diffusion processes X = (Xt )t ≥0 and Y = (Yt )t ≥0 that satisfy
the SDEs
dXt = dW1t
dYt = dW2t
9.7. EXERCISES 167
where W1t and W2t are two independent Brownian motions. Define a function u as follows
Exercise 9.7. Suppose W = (W1t , W2t , . . . , Wdt )t ≥0 is a d-dimensional Brownian motion. Define
d
X 1/2
Rt = (Wit )2 .
i =1
Clearly, R lives in an interval I with endpoints {0} and {∞}. Show that, when d = 1 the origin is a
regular endpoint. Show that when d ≥ 2, the origen is and entrance point.
168 CHAPTER 9. SDES AND PDES
Chapter 10
Jump diffusions
Notes from this chapter are taken primarily from (Øksendal and Sulem, 2005, Chapter 1). Notes for
Section 10.5 on Hawkes processes follow Hawkes (1971).
1. η0 = 0.
2. Independent increments: for any 0 ≤ t1 < t2 < t3 < t4 < ∞, we have ηt4 – ηt3 ⊥⊥ ηt2 – ηt1 .
3. Stationary increments: for any 0 ≤ t1 < t2 < ∞, we have ηt2 – ηt1 ∼ ηt2 –t1 .
4. Continuity in probability: for any ε > 0 and t ≥ 0, we have lims&0 P(|ηt +s – ηt | > ε) = 0.
Note, Item 4 in Definition 10.1.1 does not mean that a Lévy process cannot jump. For example, consider
a Poisson process N = (Nt )t ≥0 with intensity λ. A Poisson process is a jump process and yet it is easy
to see that it is continuous in probability as, for any ε ∈ (0, 1), we have
Item 4 simply means that, at a fixed t , the probability that a Lévy process has a discontinuity at t is
zero.
We can and do assume that any Lévy process is right-continuous with left limits (RCLL). That is
lim ηt +s = ηt , ∀ t ≥ 0.
s&0
169
170 CHAPTER 10. JUMP DIFFUSIONS
A process that is RCLL is sometimes called càdlàg (for those who speak French: continue à droite,
limite à gauche). We have already encountered two examples of Lévy processes.
Example 10.1.3. A Poisson process N = (Nt )t ≥0 with intensity λ is a pure jump Lévy process with
Nt +s – Nt ∼ Poi(λs).
A filtration for a Lévy process is defined just as a filtration for a Brownian motion.
Definition 10.1.4. Let (Ω, F, P) be on probability space on which a Lévy process η = (ηt )t ≥0 is defined.
A filtration for the Lévy process η is a collection of σ-algebras F = (Ft )t ≥0 satisfying:
The most natural choice for this filtration F is (not surprisingly) the natural filtration for η, that is,
Ft = σ(ηs , 0 ≤ s ≤ t ). In principle the filtration F could contain more than the information obtained by
observing η. However, the information in the filtration is not allowed to destroy the independence of
future increments of the Lévy process.
∆ηt := ηt – ηt – , ηt – := lim ηs .
s%t
As with most random variables and stochastic processes, we will typically omit the dependence of N
on ω, writing simply N(t , U) as opposed to N(t , U, ω). The Poisson random measure N(t , U) counts
the number of jumps of size ∆ηs ∈ U prior to time t . It will be convenient to introduce the following
differential form
N(dt , dz ),
which counts the number of jumps of size dz over the time interval dt .
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 171
Definition 10.1.7. Let N be the Poisson random measure of a Lévy process η. We define ν : Bd0 → R+ ,
the Lévy measure of η, as follows
Theorem 10.1.8. Let U ∈ Bd0 . Then the process (N(t , U))t ≥0 is a Poisson process with intensity
ν(U).
Example 10.1.9 (Compound Poisson process). Let (Xn )n∈N be a sequence of iid random vectors in
Rd with distribution FX . Let (Pt )t ≥0 be a Poisson process with intensity λ. Assume (Pt )t ≥0 ⊥
⊥ (Xn )n∈N .
We define a compound Poisson process η = (ηt )t ≥0 by
Pt
X
ηt = Xn .
n=1
The increments of Y are given by
Pt
X
ηt – ηs = Xn .
n=Ps +1
The distribution Fηt –ηs depends only on (t – s) and FX . As such we see that η is a stationary process.
Also, non-overlapping increments of η are clearly independent, as they depend on different (Xi ). Thus, η
is a Lévy process in Rd . Let us find the Lévy measure ν corresponding to η. For any U ∈ Bd0 we have
X
ν(U) = EN(1, U) = E 1{∆ηs ∈U}
s:0<s≤t
P1
X P1
X
=E 1{Xn ∈U} = E E[1{Xn ∈U} |P1 ]
n=1 n=1
XP1 P1
X
=E E[1{Xn ∈U} ] = E P(Xn ∈ U)
n=1 n=1
= FX (U)EP1 = λFX (U).
A pure-jump Lévy process η has a finite Lévy measure ν(Rd ) < ∞ if and only if it can be represented by
a compound Poisson process. In this case, we can express η in one of two ways
Z Pt
X
ηt = z N(t , dz ), or ηt = Xn . (10.1)
Rd n=1
However, there exist Lévy processes for which ν(Rd ) = ∞. We call a Lévy process for which ν(Rd ) = ∞
an infinite activity Lévy process. For an infinite activity Lévy process, neither of the representations in
(10.1) make sense. To write the most general form of a Lévy process, we introduce the compensated
Poisson random measure.
Definition 10.1.10. Let N be a Poisson random measure with associated Lévy measure ν. The
compensated Poisson random measure, denoted N
e is defined as
N(t
e , A) := N(t , A) – ν(A)t .
N(dt
e , dz ) := N(dt , dz ) – ν(dz )dt .
E[N(t
e , A)|F ] = N(s,
s
e A) + E[N(t
e , A) – N(s,
e A)|Fs ]
= N(s,
e A) + E[N(t , A) – N(s, A)|Ft ] – ν(A)(t – s)
= N(s,
e A) + ν(A)(t – s) – ν(A)(t – s) = N(s,
e A),
where we have used EN(t , A) = t ν(A). It follows that, for any fixed R > 0, the the process M(k ) =
(k )
(Mt )t ≥0 , defined as
Z
(k )
Mt := z N(t
e , dz ), k = 1, 2, . . . ,
1/k ≤|z |<R
(k )
is a (d-dimensional) martingale satisfying E|Mt |2 < ∞. One can show that, as k → ∞ the sequence of
processes (M(k ) )k ∈N converges in L2 (Ω, F, P) to a process M = (Mt )t ≥0 defined as
Z Z
Mt ≡ z N(t
e , dz ) := lim z N(t
e , dz ),
|z |<R k →∞ 1/k ≤|z |<R
with respect to N(t , dz ) and an integral with respect to t ν(dz ). The reason is that we may have
= ∞, and in this case, we have
R
|z |<R |z |ν(dz )
Z Z Z
e , dz ) 6=
z N(t z N(t , dz ) – z ν(dz )t .
|z |<R |z |<R |z |<R
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 173
The following theorem gives the most general form of a Lévy process.
We can always choose R = 1 in (10.2). It is useful to recognize when we can choose R = 0 and R = ∞ as,
in these cases, the right-hand side of (10.2) reduces from four to three terms.
ν(Rd ) < ∞,
174 CHAPTER 10. JUMP DIFFUSIONS
then (10.4) holds, and we can write (10.5) as a compound Poisson process
Pt
X
ηt = µ0 t + σWt + Xn ,
n=1
where P is a Poisson process with intensity ν(Rd ) and the (Xn )n∈N are iid random vectors in Rd with
common distribution FX = ν/ν(Rd ).
This should come as no surprise. We have already seen a Brownian motion and a Poisson process are
Markov processes. As these processes serve as building blocks for more general Lévy processes, it follows
that Lévy processes are Markov as well.
where µ ∈ Rd , σ ∈ Rd×d
+ , W is a d-dimensional Brownian motion and N is a Poisson random
measure on Rd . Then the Lévy measure ν associated with N satisfies
Z
(1 ∧ |z |2 )ν(dz ) < ∞, (10.6)
Rd
Conversely, given triplet (µ, a, ν) with a = σσ T and ν satisfying (10.6), there exists a Lévy process
η satisfying (10.7)-(10.8).
Proof. We will show why (10.7)-(10.8) holds for a scalar Lévy process of the compound Poisson type.
In this case, η has the decomposition
Pt
X
ηt = µt + σWt + Xn ,
n=1
10.1. BASIC DEFINITIONS AND RESULTS ON LÉVY PROCESSES 175
where W is a scalar Brownian motion, P is a Poisson process with intensity λ and the (Xn )n∈N are iid
random variables with common distribution FX . We compute
PPt
Eeiuηt = Eeiuµt +iuσWt +iu n=1 Xn
PPt
= eiuµt EeiuσWt Eeiu n=1 Xn
1 2 2 PPt
= eiuµt – 2 σ u t EE[eiu n=1 Xn |Pt ]
1 2 2
= eiuµt – 2 σ u t E(E[eiuX ])Pt
1 2 2 Z P
t
= eiuµt – 2 σ u t E eiuz FX (dz )
iuµt – 21 σ 2 u 2 t
Z
=e exp λt eiuz F X (dz ) – 1
1 2 2 Z
= eiuµt – 2 σ u t exp λt (eiuz – 1)FX (dz )
= et ψ(u) ,
where we have used Es Pt = exp(λt (s – 1)) and ν = λFX . The computation for a more general Lévy
process in multiple dimensions is similar.
Let us provide a few examples of Lévy measures ν and compute the integrals that appear in expression
(10.8) for the characteristic exponent ψ.
Example 10.1.18 (Dirac comb). A Dirac comb Lévy measure on R is a measure of the form
n
X
ν(dz ) = λi δzi (z )dz ,
i =1
where λi is the intensity of a jump of size zi . In this case, both (10.3) and (10.4) are satisfied, so we can
choose either R = 0 or R = ∞. Suppose we choose R = 0. Then the third term in (10.8) disappears and
the last term becomes
Z n
(eiuz λi (eiuzi – 1).
X
– 1)ν(dz ) =
R i =1
Example 10.1.19 (Gaussian Jumps). A Gaussian Lévy measure on R is a measure of the form
1 –(z – m)2
ν(dz ) = λ √ exp dz ,
2πs 2 2s 2
where m is the mean jump size, s 2 is the variance of the jumps, and the intensity of all jumps is ν(R) = λ.
In this case, both (10.3) and (10.4) are satisfied, so we can choose either R = 0 or R = ∞. Suppose we
choose R = 0. Then the third term in (10.8) disappears and the last term becomes
ium– 12 σ 2 u 2
Z
(eiuz – 1)ν(dz ) = λ e –1 .
R
176 CHAPTER 10. JUMP DIFFUSIONS
Example 10.1.20 (Generalized tempered stable jumps). A Generalized tempered stable Lévy
measure on R is a measure of the form
A– e–β– |z | A+ e–β+ z
ν(dz ) = 1{z <0} + 1 {z >0} ,
|z |1+α– |z |1+α+
where the parameters (A± , α± , β± ) are all positive. In this case (10.3) is satisfied by (10.4) is not. Thus,
we must choose R = ∞. The fourth tern in (10.8) then disappears and the third term becomes
iu α– iuα–
Z
(eiuz – 1 – iuz )ν(dz ) = A– Γ(–α– )β–α–
1+ –1–
R β– β–
α+
iu iuα+
α
+ A+ Γ(–α+ )β++ 1 – –1+ .
β+ β+
6 0 and α± =
assuming α± = 6 1. The above integral can also be computed eixplicitly for the cases α± = 0
and α± = 1 (see, e.g., (Cont and Tankov, 2004, Proposition 4.2)).
Assumption 10.1.21. From this point onward we assume E|ηt |2 < ∞, and thus E|ηt | < ∞. Thus, we
can take R = ∞ and express η as follows
Z
ηt = µt + σWt + z N(t
e , dz ). (10.9)
Rd
Lévy process are semimartingales, which (roughly speaking) form the class of stochastic processes that
can be used to construct an Itô integral. In Section 8.2 we defined Itô integral of an (Ft )t ≥0 -adapted
process Γ = (Γt )t ≥0 with respect to a Brownian motion W = (Wt )t ≥0 by introducing a sequence of
(n)
simple processes Γ(n) = (Γt )t ≥0 that converged to Γ in the followingg sense
n–1 Z T
(n) (n)
|Γt – Γt |2 dt = 0.
X
Γt := Γtj 1{tj ≤t <tj +1 } , 0 ≤ t0 < t1 < . . . < tn = T, lim E
n→∞ 0
j =0
We can define the Itô integral of Γ with respect to a Lévy process η in the same manner
Z T n–1
XZ T
(n)
Γt dηt := lim Γ dηt = lim Γtj (ηtj +1 – ηtj ).
0 n→∞ t n→∞
j =0 0
Here, we must add a technical condition that the integrand Γ be a left-continuous process. If a process
Γ is not left-continuous, then we can still define a stochastic integral as follows
Z T
IT = Γt – dηt , Γt – = lim Γs ,
0 s%t
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 177
where (Γt – )t ≥0 is the left-continuous version of (Γt )t ≥0 . Note that, if Γ is left-continuous then Γt = Γt –
for all t ≥ 0. The process (It )t ≥0 will still be a right-continuous process as η is right-continuous. In view
of (10.9), for a left-continuous process Γ, we can seperate an Itô integral into three terms
Z T Z T Z T Z TZ
Γt dηt = Γt µdt + Γt σdWt + Γt z N(dt
e , dz ). (10.10)
0 0 0 0
Expression (10.10) suggests that we consider general processes of the form
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ), (10.11)
where the processes (µt )t ≥0 , (σt )t ≥0 and (γt – (z ))t ≥0 , must be adapted to a filtration F = (Ft )t ≥0
R
obtained by observing the processes W and z N(·, dz ). Note that we have written things in differential
form; to obtain the integral form, simply integrate and add an initial condition. We have been somewhat
vague about the dimension of the various objects appearing in (10.11). In general, for X ∈ Rd , we could
have
µ : R+ × Ω → Rd , σ : R+ × Ω → Rd×n , γ : R+ × Rm × Ω → Rd×k .
We showed (for the scalar case) in Theorem 8.1.5 that if σ satisfies E 0T |σt |2 dt < ∞ then the Itô integral
R
Z TZ
E |γt (z )|2 ν(dz )dt < ∞,
0 R
then the process M = (Mt )0≤t ≤T , defined by
Z tZ
Mt := γs (z )N(ds,
e dz ),
0 R
is a martingale. It follows that if µ = 0 in (10.11), and σ and γ(z ) satisfy the above square integrability
conditions, then the process X = (Xt )t ≥0 is a martingale.
µ : R+ × Ω → R, σ : R+ × Ω → R+ , γ : R+ × Ω × R → R,
We are not going to prove Theorem 10.2.1. However, we will attempt to understand why the formula is
correct. Suppose for simplicity that ν(R) < ∞. In this case we have a finite activity Lévy process and
we can separate the compensated Poisson random measure N
e into two parts N(dt , dz ) and ν(dz )dt . In
where we can now identify the drift of X as µt0 . Similarly, we can write df (Xt ) as
df (Xt ) = µt f 0 (Xt ) + 12 σt2 f 00 (Xt ) dt + σt f 0 (Xt )dWt
Z
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz )
R
Z
– f (Xt – + γt (z )) – f (Xt – ) ν(dz )dt
R
Z
+ f (Xt – + γt (z )) – f (Xt – ) – γt (z )f 0 (Xt ) ν(dz )dt
R !
Z
= µt – γt (z )ν(dz ) 0 1 2 00
f (Xt ) + 2 σt f (Xt ) dt + σt f 0 (Xt )dWt
R
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 179
Z
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz )
R
= µt0 f 0 (Xt ) + 12 σt2 f 00 (Xt ) dt + σt f 0 (Xt )dWt
Z
+ f (Xt – + γt (z )) – f (Xt – ) N(dt , dz ). (10.14)
R
Things are looking a bit more familiar now. The non-integral terms in (10.14) arise from the µt0 dt +σt dWt
part of dXt . To understand the integral term in (10.14), suppose there is a jump of size y at time t .
Since ν(R) < ∞, there can only be a single jump at time t and thus
N(dt , dz ) = δy (z )dz .
It follows that
Z Z
∆Xt := Xt – Xt – = γt (z )N(dt , dz ) = γt (z )δy (z )dz = γt (y),
R R
This last expression agrees with the integral term in (10.14). Of course, if there is no jump at time t
then N(dt , dz ) = 0 for all dz and in this case ∆Xt = ∆f (Xt ) = 0.
df (Xt ) = d log Xt
1 1 1
= bXt – 12 a 2 X2t 2 dt + aXt dWt
Xt Xt Xt
Z
+ log Xt – + ec(z ) – 1 Xt – – log (Xt – ) N(dt
e , dz )
R
Z
1
+ log Xt – + ec(z ) – 1 Xt – – log (Xt – ) – ec(z ) – 1 Xt ν(dz )dt
R Xt
Z Z
= b– 1a2 dt + adWt + c(z )N(dt
e , dz ) + c(z ) – ec(z ) + 1 ν(dz )dt
2 R R
Z
= b 0 dt + adWt + c(z )N(dt
e , dz ),
Z R
b 0 = b – 12 a 2 –
e )
c(z – 1 – c(z ) ν(dz ).
R
and assume that σ = (σt )t ≥0 and γ(z ) = (γt (z ))t ≥0 are F-adapted. Then
Z T Z
EX2T =E σt2 + 2
γt (z )ν(dz ) dt ,
0 R
Proof. One could prove Theorem 10.2.3 directly from the definition of the Itô integral. However, we
will use the Itô formula instead. Setting f (x ) = x 2 and noting that
Theorem 10.2.4. Let X = (Xt )t ≥0 be a d-dimensional Lévy-Itô process with components satisfying
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ),
Rm
µ : R+ × Ω → Rd , σ : R+ × Ω → Rd×n , γ : R+ × Ω × Rm → Rd×k .
d
X d X d d X n
(i ) 1
X
T
X (i ,j ) (j )
df (Xt ) = µt ∂xi f (Xt ) + 2 (σt σt )i ,j ∂xi ∂xj f (Xt ) dt + σt ∂xi f (Xt )dWt
i =1 i =1 j =1 i =1 j =1
k Z
( · ,j ) e (j ) (dt , dz )+
X
+ m
f (X t – + γt (z )) – f (X t – ) N
j =1 R
k Z
( · ,j ) ( · ,j )
(z ) · ∇f (Xt – ) ν (j ) (dz )dt .
X
+ m
f (Xt – + γt (z )) – f (Xt – ) – γt
j =1 R
I would not recommend memorizing the above formula. Simply book-mark this page. If you work with
Lévy-Itô processes for a sufficient amount of time (about 5 years for the author of these notes), you will
eventually have the above formula fixed in your mind.
182 CHAPTER 10. JUMP DIFFUSIONS
µt , bt ∈ R, σt , at ∈ Rn , γt (·), gt (·) : Rm → Rk .
Recall the Definition 7.3.4 of quadratic covariation. The quadratic covaration of X and Y up to
time T, denoted [X, Y]T , is given by
n
Z T X k Z
Z T X
(j ) (j ) (j ) (j )
[X, Y]T = σt at dt + γt (z ) · gt (z )N(j ) (dt , dz )
0 j =1 0 j =1 Rm
Z T Xn k Z
(j ) (j ) (j ) (j )
(z )ν (j ) (dz )
X
= σ t at + m
γt (z ) · g t dt
0 j =1 j =1 R
Z T Xk Z
(j ) (j ) (j )
+ γt (z ) · gt (z )N
e (dt , dz ).
0 j =1 Rm
We will not prove Theorem 10.2.5. Rather, we simply note that the above result relies on the following
facts
[W(i ) , N(j ) (·, A)]t = 0, [N(i ) (·, A), N(j ) (·, B)]t = δi ,j N(i ) (t , A ∩ B), [N(i ) (·, B), Id]t = 0,
Heurisitcally, one can derive Theorem 10.2.5 by writing d[X, Y]t = dXt dYt and using the rules
(i ) (j ) (i ) (j )
dWt dWt = δi ,j dt , N
e (dt , dz )N
e (dt , dy) = δi ,j δ(z – y)N(i ) (dt , dz )dy,
(i ) (i ) e (j )
dWt dt = 0, dWt N (dt , dz ) = 0,
(i )
dt dt = 0, N
e (dt , dz ) dt = 0.
Example 10.2.6. Suppose X is given by (10.16). Then, from Theorem 10.2.5, the quadratic variation of
X, denoted [X, X]T , is given by
n
Z T X k Z
Z T X
(j ) (j )
[X, X]T = (σt )2 dt + (γt (z ))2 N(j ) (dt , dz )
0 j =1 0 j =1 Rm
10.2. THE ITÔ FORMULA FOR LÉVY-ITÔ PROCESSES. 183
n
Z T X k Z
(j ) (j )
(σt )2 (γt (z ))2 ν (j ) (dz )
X
= + m
dt
0 j =1 j =1 R
Z T Xk Z
(j ) (j )
+ (γt (z ))2 N
e (dt , dz ).
0 j =1 Rm
Example 10.2.7. Some texts define quadratic covaration of two semimartingales X and Y as the unique
process [X, Y]T that satisfies
Z T Z T
XT YT = X0 Y0 + Xt – dYt + Yt – dXt + [X, Y]T .
0 0
Let us check that this definition yields the same result as Theorem 10.2.5. For simplicity, let us take
Z
dXt = µt dt + σt dWt + m
γt (z )N(dt
e , dz ),
ZR
dYt = bt dt + at dWt + gt (z )N(dt
e , dz ),
Rm
Thus, we have verified the new definition of quadratic covariation agrees (at least, for the processes in
this example) with Theorem 10.2.5.
184 CHAPTER 10. JUMP DIFFUSIONS
µ : R+ × Rd → Rd , σ : R+ × Rd → Rd×n , γ : R+ × Rd × Rm → Rd×k .
(i )
with X0 = xi ∈ Rd .
Xt = F(Ws , N(s,
e dz ); 0 ≤ s ≤ t ),
such that
Z T Z T Z TZ
XT = x + µ(t , Xt )dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ), ∀ T ≥ 0.
0 0 0 Rm
Theorem 10.3.2. Consider the one-dimensional process X driven by a single Brownian motion
and a single Poisson random measure on R. If there exist constants C1 and C2 such that the linear
growth condition
Z
µ2 (t , x ) + σ 2 (t , x ) + γ 2 (t , x , z )ν(dz ) < C1 (1 + x 2 ), ∀ t ≥ 0,
R
then SDE (10.17) has a unique strong solution adapted to Ft := σ(Xs : 0 ≤ s ≤ t ) (i.e., the solution
X of SDE (10.17) is a Markov process). The conditions for higher dimensions are analogous.
10.3. LÉVY-ITÔ SDE 185
Since X, the solution (10.17) is a Markov we can define a corresponding semigroup P(t , s) and infinitesimal
generator A(t ). For a sufficiently nice function ϕ : Rd → R, we have
i =1 i =1 j =1
k Z
ν (j ) (dz ) θγ ( · ,j ) (t ,x ,z ) – 1 – γ ( · ,j ) (t , x , z ) · ∇ ,
X
+ m
j =1 R
Proof. The proof is the same as in the no-jump case. First, write ϕ(Xs ) = ϕ(x ) + ts dϕ(Xu ). Next,
R
use the Itô formula to write dϕ(Xu ) explicitly as terms involving du, dWu and N(du,
e dz ). Finally, take
e are martingales, and send s & t . We omit
an expectation, note that integrals with respect to W and N
the details.
We can now write the Itô formula for the solution X of Lévy-Itô SDE (10.17) in a more compact form
Z
dϕ(Xt ) = A(t )ϕ(Xt )dt + ∇ϕ · σ(t , Xt )dWt + ϕ(Xt – + γ(t , Xt – , z )) – ϕ(Xt – ) N(dt
e , dz ).
Rd
Then the function u(t , x ) := E[ϕ(XT )|Xt = x ] satisfies the Kolmogorov Backward Equation
Proof. The proof is exactly analogous to the diffusion case. First, show that E[ϕ(XT )|Ft ] = u(t , Xt ) is
a martingale. Take the differential of u(t , Xt ) and set the dt -term equal to zero to obtain the partial
integro-differential equation that must be satisfied by u. The terminal condition is obtained from
E[ϕ(XT )|FT ] = ϕ(XT ) = u(T, XT ).
measure on R. We wish to find an expression for u(t , x ) := E[ϕ(XT )|Xt = x ], which satisfies the KBE;
in this case
Z
(∂t + A)u(t , ·) = 0, u(T, ·) = ϕ, A = µ∂x + 1 2 2
2 σ ∂x +
R
ν(dz )(θz – 1 – z ∂x ).
where A∗ is the L2 (R, dx ) adjoint of A and ψ(ξ) is the characteristic exponent of X. Specifically
Z
A∗ = –µ∂x + 1 σ2∂ 2
2 x + ν(dz )(θ–z – 1 + z ∂x ),
Z R
ψ(ξ) = µiξ – 12 σ 2 ξ 2 + ν(dz )(eiξz – 1 – iξz ),
R
(check that you can derive A∗ and ψ(ξ) for yourself!). Taking the Fourier transform of the KBE and
using the above results, we obtain an ODE in t for u(t
b , ξ)
(∂t + ψ(ξ))u(t
b , ξ) = 0, u(T,
b ξ) = ϕ(ξ).
b
one-dimensional compensated Poisson random measure on R. All processes are scalar, unless specifically
stated otherwise.
Z ≥ 0, EZ = 1,
P(A)
e = EZ1A , ∀ A ∈ F,
e 11 ,
P(A) = E ∀ A ∈ F,
Z A
and thus, we identify Z1 = dP
e as the Radon-Nikodym derivative of P with respect to P.
e
dP
In Chapter 8, we defined a Radon-Nikodým derivative process (Definition 8.5.1) by fixing a time horizon
T and setting
Zt := E[Z|Ft ], 0 ≤ t ≤ T.
then, under P,
e the process
Z t
W
f
t = Wt + Θs ds, 0 ≤ t ≤ T,
0
is a martingale (in fact, a Brownian motion). We are now going to apply this machinary to a Lévy-Itô
process.
Theorem 10.4.2 (Girsanov Theorem for Lévy-Itô process (part I)). On a probability space
(Ω, F, P), let F = (Ft )0≤t ≤T be a filtration and let X = (Xt )0≤t ≤T be an Lévy-Itô process of the form
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz ), (10.19)
R
Lévy measure ν. Suppose there exist F-adapted processes β = (βt )0≤t ≤T and η(z ) = (ηt (z ))0≤t ≤T
such that
Z
µt = σ t βt – γt (z )(eηt (z ) – 1)ν(dz ). (10.20)
R
and assume that (β, η) are such that EZT < ∞. Set Z = ddP
P = ZT . Then X is a local martingale
b
under P.
b
Proof. First, we observe that (Zt )0≤t ≤T is a martingale. To see this, simply note that
Z
dZt = Zt – – βs dWs + eηt (z ) – 1 N(dt
e , dz ) .
R
Thus, we have
Since, EZ = 1 and, by construction, Z > 0, the random variable Z defines a Radon-Nikodym derivative
dP
dP . Also, since Zt = E[Z|F] we see that (Zt )0≤t ≤T is a Radon-Nikodym derivative process. Now, note
b
that (Xt )0≤t ≤T is adapted to the filtration F. In light of (10.18), to show that (Xt )0≤t ≤T is a martingale
under P
b we need only to show that (X Z )
t t 0≤t ≤T is a martingale under P. To this end we compute
Theorem 10.4.3 (Girsanov Theorem for Lévy-Itô process (part II)). On a probability space
(Ω, F, P), let Z = ddP
P and (Zt )t ≥0 be as defined in Theorem 10.4.2. Define
b
dW
c := β dt + dW ,
t t t (10.22)
N(dt
b , dz ) = N(dt
e , dz ) – eηt (z ) – 1 ν(dz )dt = N(dt , dz ) – eηt (z ) ν(dz )dt . (10.23)
Then W
c is a Brownian motion under P
b and N
b is a compensated Poisson random measure under
P
b in the sense that the process M = (M )
t 0≤t ≤T , defined by
Z tZ
Mt := γt (z )N(dt
b , dz ),
0 R
Xεt := εW
c +M .
t t
By Theorem 10.4.2, we see that, for all ε ∈ [0, 1] the process Xε is a martingale under P.
b In particular
X0 = M is a martingale under P,
b as claimed. Next, note that
X1t – Mt = W
c ,
t
[W,
c W]
c = t . Thus by Lévy’s characterization of Brownian motion (Theorem 8.3.7), we conclude that W
t
c
Theorem 10.4.4 (Girsanov Theorem for Lévy-Itô process (part III)). On a probability space
(Ω, F, P), equipped with a filtration F = (Ft )0≤t ≤T , let X and Z = ddP
P be as in equations (10.19) and
b
where W
c and N
b are defined in (10.22) and (10.23) and X is a martingale under P.
b
Proof. To show that X has the representation (10.24) we simply note that
Z
dXt = µt dt + σt dWt + γt (z )N(dt
e , dz )
R Z
= µt dt + σt dW
c – β dt +
t t γt (z ) N(dt
b , dz ) + eηt (z ) – 1 ν(dz )dt
Z R
= σt dW
c +
t γt (z )N(dt
b , dz )
R
where, in the second line, we used (10.22) and (10.23), and in the last line we used (10.20). Since W
c is
νb(dz ). From Theorem 10.4.2, equation (10.21), and the time-homogeneity of X, we see that Z must be of
the form
Z t Z tZ Z
Zt := exp αds + η(z )N(ds,
e dz ) , α=– eη(z ) – 1 – η(z ) ν(dz ). (10.25)
0 0 R R
10.5. HAWKES PROCESSES 191
Thus, we identify
νb(dz )
η(z ) = log .
ν(dz )
νb(dz )
Note that, in order for η to be well-defined we must have 0 < ν(dz ) < ∞, which holds if an only if νb ∼ ν
(i.e., the measures are equivalent). Re-writing the dynamics of X as follows
Z
dXt = µdt + z N(dt
b , dz ) + eηt (z ) – 1 ν(dz )dt
R ! !
Z
νb(dz )
= µdt + z N(dt
b , dz ) + – 1 ν(dz )dt
R ν(dz )
Z
= µdt + z N(dt
b , dz ) + (νb(dz ) – ν(dz )) dt
ZR
= µdt
b + z N(dt
b , dz )
Z R
µb = µ + z (νb(dz ) – ν(dz )) ,
R
Z –1
λ
b–λ =µ· z F(dz ) .
R
Note that, had we included a Brownian component in the P dynamics of X, we would have complete
freedom to change the jump measure since any non-zero drift could be absorbed into the drift of the
Brownian motion under the change of measure.
Note that the probability of a jump P(dPt = 1) = EdPt in the instant dt is entirely independent of this
history Ft of P. This memorylessness property is convenient from a computational point of view and
is necessary in order for P to be Markov. But, from a modeling perspective, limiting our analysis to
Poisson processes (or more generally, Poisson random measures) is very restrictive. For example, if we
want to model the arrival of earthquakes in Seattle, it is well-known that the occurrence of an earthquake
today will increase the probability of an earthquake tomorrow. Thus, it would be useful to have some
way of modeling events whose arrival intensities are history-dependent. This is precisely what a Hawkes
process will allow us to do.
where h(·) : R+ → R+ is the kernel (sometimes also called an exciting function) and Λ(·) : R+ → R+
is locally integrable. The intensity λt is an Ft -measurable random variable where Ft is the natural
filtration for N. When the function Λ(·) is linear, the process Nt is known as the linear Hawkes process,
otherwise, it is called a nonlinear Hawkes process.
If we denote by τ1 , τ2 , . . . the (random) times that N jumps, then the intensity at time t is given by
X
λt = Λ h(t – τi ) .
τi <t
Typically, h is a strictly decreasing function and Λ is increasing. Under these assumptions, the intensity
λt is decreasing over the intervals of the form [τi , τi +1 ).
For reason of analytic tractability, in what follows, we concentrate on the linear Hawkes processes.
Specifically, let us assume
Λ(z ) = ν + z ,
where we have introduced the compensated Hawkes process, whose differential is given by
dN
e = dN – λ ds.
s s s
10.5. HAWKES PROCESSES 193
E[dN
e |F ] = E[dN |F ] – λ dt = 0,
t t t t t
Hence, we find that the mean λ̄ of the stationary distribution of the intensity is given by
Z ∞ –1
λ̄ = ν · 1 – h(s)ds .
0
Clearly, a stationary distribution only exists if h satisfies
Z ∞
h(s)ds < 1. (10.29)
0
Thus, some texts require that h satisfy (10.29)
In general, a given Hawkes process N = (Nt )t ≥0 and the associated intensity process λ = (λt )t ≥0 are
non-Markovian. However, when the kernel h(·) has an exponential form
then the intensity process λ by itself and the pair (λ, N) are Markov. To see this, observe from (10.27)
and (10.30) that
Z t–
λt = ν + ae–b(t –s) dNt .
0
Taking the differential of λt we find
Z t–
dλt = –b ae–b(t –s) dN t dt + adNt
0
194 CHAPTER 10. JUMP DIFFUSIONS
Since the intensity of Nt is λt , we see that the future dynamics of λ depend only on the value of λt and
not the entire history Ft . Thus, the process λ is Markov and its generator is
A = (a – b)λ + bν ∂λ + λ (θa – 1 – a∂λ )
where, we remind the reader that θa is the shift operator: θa f (λ) = f (λ + a). Since λ alone is Markov,
it follows that the pair (λ, N) is Markov and has generator
A = b(ν – λ)∂λ + λ θ(a,1) – 1 , (10.31)
We cannot find the probability mass function of NT . However, we can find its Laplace transform Ee–ηNT .
Let us define
This must hold for all λ. Thus we collect terms of like order in λ and obtain a pair of coupled ODEs for
A and B
d Z t–
X
(i ) (i ) (i ) (j )
E[dNt |Ft ] = λt dt , λt := Λ(i ) h (i ,j ) (t – s)dNs , (10.34)
j =1 0
(i )
where h (i ,j ) (·) : R+ → R+ and Λ(i ) (·) : R+ → R+ is locally integrable. The intensity λt is an
Ft -measurable random variable where Ft is the natural filtration for the d-dimenional process (N(i ) )di=1 .
When the function Λ(i ) (·) is linear for all i , the process Nt is known as the linear d-dimensional Hawkes
process, otherwise, it is called a nonlinear d-dimensional Hawkes process.
As in the one-dimensional case, linear d-dimensional Hawkes processes with exponential kernels are
Markov and admit many analytically tractable results.
10.6 Exercises
Exercise 10.1. Let P = (Pt )t ≥0 be a Poisson process with intensity λ.
(a) What is the Lévy measure ν of P?
(b) Let dXt = dPt . Define u(t , x ) := E[ϕ(XT )|Xt = x ]. Find u(t , x ) and verify that it solves the
Kolmogorov Backward equation.
where b ∈ R, σ, γ ∈ R+ , N
e is a compensated Poisson random measure with Lévy measure ν and M is a
Poisson random measure with Lévy measure µ. Assume that N and M are independent and ν(R) < ∞
and µ(R+ ) < ∞. As the process Y is strictly increasing, we can constrict a process Z = (Zt )t ≥0 as follows
Zt := XYt .
(a) Show that Z is a Lévy process. For this problem, you may assume that Z is continuous in probability.
(b) Find ψZ where EeiξZt = et ψZ (ξ) . You may leave your answer as a composition of functions if you
find it helpful.
(c) Suppose ν ≡ 0 (i.e., the X process experiences no jumps). Find the drift A, volatility Σ and Lévy
measure Π(dz ) corresponding to Z. Write the Lévy-Itô decomposition of ZT in terms of a Brownian
motion B and a Poisson randome measure P whose Lévy measure is Π.
196 CHAPTER 10. JUMP DIFFUSIONS
Exercise 10.4. Let η = (ηt )t ≥0 be a one-dimensional Lévy process and define X = (Xt )t ≥0 by
random measure on R. Derive using the Lévy-Itô formula the infinitesimal generator A(t ) of the X
process
Stochastic Control
Notes from this chapter are taken primarily from (Björk, 2004, Chapter 19) and (van Handel, 2007,
Chapter 6).
µ : R+ × Rd × U → Rd , σ := R+ × Rd × U → Rd×n
Here, Xu is called the state process and u = (ut )t ≥0 is the control. The control must live in some
control set U. Frequently, we take the control set U = Rk , but this is not required. The superscript
u on Xu indicates the dependence of state process Xu on the control u (clearly, if you change u you
change Xu ).
197
198 CHAPTER 11. STOCHASTIC CONTROL
Within the class of admissible controls is a subset of controls that we will find very useful:
Definition 11.1.2. We call an admissible control u a feedback or Markov control if it is of the form:
ut = α(t , Xt ) for some function α : R+ × Rd → U.
Note Markov controls are a strict subset of admissible controls, since a Markov control at time t is only
allowed to depend on (t , Xt ) whereas, in general, an admissible control at time t may depend on the
entire history Ft . Nevertheless, Markov controls are important because they are easy to implement, and
(as we shall see) they can actually be computed! Moreover, it often turns out that the optimal Markov
control coincides with the optimal admissible control.
Since the solution of an SDE is a Markov process, a Markov control yields a Markov state process Xu .
Below, we introduce some cost or gain functionals J(·) that assign to each admissible control strategy
u a cost or gain J(u). The goal of optimal control is to find an optimal strategy u ∗ (obviuosly!), which
minimizes or maximizes this cost functional. Three common types of cost functionals are:
Note, for the Indefinite time and Infinite time functionls, the dynamics of Xu should be time-homogeneous
(i.e., the coefficient functions µ and σ should not depend on t ).
Of course, one might construct other cost functionals, but the above three are by far the most common.
As stated above, the optimal control (if it exists) is the strategy u ∗ that minimizes or maximizes a given
cost or gain functional J(u). Thus, we make the following definition:
11.2. THE DYNAMIC PROGRAMMING PRINCIPLE AND THE HJB PDE 199
Definition 11.1.3. For a given cost functional J : U → R, we say define the optimal control u ∗ , if it
exists, by
Assumption 11.2.1. We assume that there exists an admissible Markov strategy u ∗ , which is optimal.
Clearly, this assumption is not always justified. Nevertheless, we will go with it for the time being. We
define the reward-to-go function Ju (or cost-to-go in the case of minimization) as
Z T Z T
Ju (t , Xut ) F(s, Xus , us )ds u F(s, Xus , us )ds u u
+ ϕ(XT )Ft
:= E =E + ϕ(XT )Xt ,
t t
where we have used the Markov property of Xu to replace Ft with the σ-algebra generated by Xut . We
also define the value function V as
∗
V(t , x ) := Ju (t , x ) = max Ju (t , x ). (11.3)
u∈U
The idea of the Dynamic Programming Principle (DPP) is to split the optimization problem into two
intervals [t , t + δ) and [t + δ, T], where δ is small and positive. Note that, for any δ ≥ 0 we have
Z t +δ Z T
Ju (t , x ) F(s, Xus , us )ds F(s, Xs , us )ds + ϕ(XT )Ft +δ Xut
u u
=E +E =x
t t +δ
Z t +δ
F(s, Xus , us )ds + Ju (t + δ, Xut+δ )Xut = x
=E
t
In words, the control u may be sub-optimal over the interval [t , t + δ) and it is optimal over the interval
[t + δ, T]. Clearly, we have
Z t +δ
F(s, Xus , us )ds + δ, Xt +δ )Xut
u
V(t , x ) ≥ E + V(t =x . (11.4)
t
200 CHAPTER 11. STOCHASTIC CONTROL
The inequality arises from the fact that the strategy u is not necessarily optimal over in interval [t , t + δ).
If we had ub = u ∗ , then we would have obtained an equality since, in this case, we have u = u ∗ . Now
observe that
Z t +δ
V(t + δ, Xut+δ ) = V(t , Xut ) + (∂s + Au (s))V(s, Xus )ds + martingale part,
t
Finally, we devide by δ and take a limit as δ → 0. Assuming there is no problem with passing a limit
through an expectation, we have
1 Z t +δ
0 ≥ lim E u u u
F(s, Xs , us ) + (∂s + A (s))V(s, Xs ) ds Xt = x
δ→0 δ t
1 Z t +δ
F(s, Xus , us ) + (∂s + Au (s))V(s, Xs ) ds Xut = x
= E lim
δ→0 δ t
= F(t , x , u) + (∂t + Au (t ))V(t , x ).
Once again, if u is optimal, we obtain an equality. Thus, we have arrived at the Hamilton-Jacobi-Bellman
(HJB) PDE
Theorem 11.2.2. If (i) an optimal control u ∗ exists and is Markov, (ii) the value function V,
defined in (11.3), satisfies V ∈ C1,2 and (iii) the limiting procedures performed above are allowed,
then V satisfies the HJB PDE (11.6) and the optimal control u ∗ is given by
∗
ut∗ = α(t , Xut ), where α(t , x ) := argmax (F(t , x , u) + Au V(t , x )) . (11.7)
u∈U
11.3. SOLVING THE HJB PDE 201
Theorem 11.2.2 is so incredibly unsatisfactory! It only tells us that, if an optimal controls exists and is
Markov, then (modulo some technical conditions) the value function V satisfies the HJB PDE and the
optimal control is given by (11.7). But, this is not what we want. What we would like, is solve the HJB
PDE (11.6) and conclude that the solution actually is the value function and u ∗ , given by (11.7), is the
optimal control. The following Theorem gives us such a result.
Theorem 11.2.3 (Verification Theorem). Suppose H : [0, T] × Rd → R solves the HJB PDE
(11.6). Define
Suppose that Mt := H(t , Xut ) – 0t (∂s + Au (s))H(s, Xus )ds is a true martingale (not just a local
R
martingale), where ut = g(t , Xut ). Then the value function V and optimal control u ∗ are given by
The Verification Theorem tells us that, if we solve the HJB PDE, and the solution satisfies some regularity
and integrability conditions, then the the solution is the value function and the Markov control is the
optimal control. We will not prove the verification theorem.
Example 11.3.1 (Merton problem). The Merton problem, due to Nobel Prize winner Robert Merton,
is a classical problem in mathematical finance whereby an investor seeks to optimize his expected utility
at a fixed future date T by investing in a stock. Specifically, suppose a stock S = (St )0≤t ≤T follows a
geometric Brownian motion
Let Xu = (Xut )0≤t ≤T be the wealth process of an investor who invests ut dollars in S at time t and keeps
the rest of his money in a bank account (which we assume pays no interest). The dynamics of Xu are
202 CHAPTER 11. STOCHASTIC CONTROL
given by
u
dXut = t dSt = µut dt + σt ut dWt .
St
The function U is called the investor’s utility function; it is intended to map the investor’s wealth to his
happiness. Since “more money” = “more happy” the function U is strictly increasing. It is also concave
since, an additional dollar means less to somebody with a wealth of 1 million dollars than it does to
somebody with a wealth of 10 dollars. For the investor’s control problem (11.8), we identify
Fu = 0, Au = uµ∂x + 21 u 2 σ 2 ∂x2 ,
In order to maximize (Fu + Au V) we simply differentiate this quantity, set it equal to zero, and solve for
u. We have
–µ∂x V
0 = ∂u (Fu + Au V) |u ∗ = µ∂x V + uσ 2 ∂x2 V|u ∗ ⇒ u∗ = .
σ 2 ∂x2 V
where we have introduced the Sharpe ratio λ. Of course, if you have never seen the horrible looking
non-linear PDE before you would never know how to solve it. However, if you have spent some time
around optimal investment problems you would know that the correct thing to do at this point is to
guess that V is of the form
Such a guess clearly satisfies the terminal condition V(T, x ) = U(x ). If we insert the guess (11.10) into
(11.9) we find
0 = f 0 + 12 λ2 1–γ
γ f U,
11.3. SOLVING THE HJB PDE 203
Thus, we have obtained the value function V(t , x ) = U(x )f (t ). The optimal Markov control is given by
µ ∂x V(t , x ) µx
u ∗ (t , x ) = – 2 = .
σ ∂x V(t , x ) γσ 2
ut∗ µ
Thus, the total proportion of wealth the investor should keep in the stock is constant: ∗ = γσ 2.
Xut
Example 11.3.2 (Linear quadartic regulator). Suppose the dynamics of a controlled process
Xu = (Xut )0≤t ≤T are given by
One could imagine, for example, that Xu represents the position of a particle that we are attempting to
keep near the origin. We can adjust the drift of the particle through the control u, but this has a cost,
which is quadratic in u. Likewise, there is a quadratic cost for allowing Xu to be away from the origin.
This control problem is known as the Linear Quadratic Regulator (LQR) because the dynamics of X
are linear in the state process Xu and control process u and the costs are quadratic in Xu and u. The
HJB equation associated with cost functional (11.11) is
Step one in solving the HJB PDE is finding the optimal control u ∗ in feedback form. To this end, we
have
–bVx
u u
∂u A V + F = bVx + 2ru ∗ , u∗ =
0= ⇒ .
u=u ∗ 2r
The HJB PDE thus becomes
∗ ∗
0 = V t + Au V + F u
204 CHAPTER 11. STOCHASTIC CONTROL
–bVx –bVx 2
= Vt + ax + b Vx + 1 c 2V + qx 2 +r
2 xx
2r 2r
b 2 V2x
= Vt + ax Vx + 21 c 2 Vxx + qx 2 – . (11.12)
4r
As with all non-linear PDEs with explicit solutions, the method to solve (11.12) is to guess. In this case,
the correct guess is
where the terminal conditions for P and Q will ensure that V(T, x ) = hx 2 . We have
O(x 0 ) : 0 = Q0 + c 2 P, Q(T) = 0,
O(x 2 ) : 0 = P0 + 2aP + q – (b 2 /r )P2 , P(T) = h.
Thus, we have obtained a coupled system of ODEs for (P, Q). The O(x 2 ) equation is a Riccati equation,
which can be solved analytically (though, the solution is a tad messy and not worth writing down here).
Once one has obtained an expression for the function P one can obtain Q as an integral
Z T
Q(t ) = c 2 P(s)ds.
t
Finally, the optimal control is given by
∗ –b ∗ –b ∗
ut∗ = u ∗ (t , Xut ) = Vx (t , Xut ) = P(t )Xut .
2r r
In what follows, we state without proof the HJB PDEs associated with the indefinite time and infinite
time cost functionals J(u), which were introduced introduced in Section 11.1.
11.5. EXERCISES 205
0 = sup (Au V + Fu ), x ∈ A,
u∈U
V(x ) = ϕ(x ), x ∈ ∂A.
where the operator Au is the generator of the process Xu defined in (11.14) with ut = u fixed. In
the one-dimensional case, we have
We leave the formal proof of the above HJB equations as an exercise for the reader.
11.5 Exercises
Exercise 11.1. Derive the HJB equations in Section 11.4.
Optimal Stopping
Notes from this chapter closely follow (van Handel, 2007, Chapter 9).
We have previously defined the notion of a Markov process. For optimal stopping, we will need to
consider a subset of these processes called strong Markov processes.
Definition 12.1.1. Let (Ω, F, P) be a probability space, let T be a fixed positive number, and let
F = (Ft )t ≥0 be a filtration of sub-σ-algebras of F. Consider an F-adapted stochastic process X = (Xt )t ≥0 .
We say X is a strong Markov process or simply “X is strong Markov,” if, for any F stopping time τ and
t ≥ 0 we have
Proposition 12.1.2. Suppose that on a probability space (Ω, F, P) equipped with a filtration F =
(Ft )t ≥0 the process X is the unique strong solution of the time-homogeneous SDE
207
208 CHAPTER 12. OPTIMAL STOPPING
µ : Rd → Rd , σ : Rd → Rd×m .
Although we have specified a time-homogenous diffusion X, our framework is general enough to incorporate
time-inhomogenous diffusions by considering the augmented the R1+d -dimensional process Y = (Ys )s≥0
given by Yt = (t , Xt ).
For and F-stopping time τ , we introduce the cost functional (again, some authors prefer the phrase
gain functional if the goal is to maximize)
Z τ
J(τ ) = E e–λt F(Xt )dt + e–λτ ϕ(Xτ ) , (12.2)
0
where F, ϕ : Rd → R and λ ≥ 0. Expression (12.2) is not the only possible choice for a cost functional.
However, (12.2) is general enough to encompass many optimal stopping problems of pracitcal interest.
Definition 12.2.1. We call a F-stopping time τ admissible if (12.2) is well-defined.
We will denote by T the set of admissible stopping times. The goal of optimal stopping is to find the
admissible stopping time τ ∗ that maximizes or minimizes J(τ ). Thus, we make the following definition:
Definition 12.2.2. For a given cost functional J : T → R (not necessarily (12.2)), we define the optimal
stopping time τ ∗ , if it exists, by
τ = inf{t ≥ 0 : Xt ∈
/ D}, (12.3)
Assumption 12.3.1. We assume there exists an optimal stopping time τ ∗ which is a first exit time. We
denote by D the continuation region of τ ∗ . We also assume that X is a strong Markov process.
A priori, there is no reason that the optimal stopping time τ ∗ must be an exit time. But, we will go with
it for the time being. To begin, we define the reward-to-go functional (or, cost-to-go, if the goal is to
minimize)
Z τ
Jτ (x ) e–λt F(Xt )dt
:= E + ϕ(Xτ )X0 =x .
0
Observe, Jτ (x ) is the cost of the stopping rule τ when X0 = x . We define the value function as the
optimal cost-to-go
∗
V(x ) := Jτ (x ) = max Jτ (u).
τ ∈T
We will attempt to find an equation for V(x ). To this end, let τ be any admissible stopping time and
define
τ 0 := inf{t ≥ τ : Xt ∈
/ D},
where D is the continuation region of τ ∗ . Observe that τ 0 ≥ τ by construction and τ 0 = τ ∗ after time τ .
Thus, we have
∗
V(X0 ) = Jτ (X0 )
0
≥ Jτ (X0 )
Z τ0
0
e–λt F(Xt )dt –λτ
=E +e ϕ(Xτ 0 )X0
0
Z τ Z τ0
0
e–λt F(Xt )dt –λt –λτ
=E +E e F(Xt )dt + e ϕ(Xτ 0 )Xτ X0
0 τ
Z τ Z τ0
0 –τ )
e–λt F(Xt )dt –λτ –λ(t –τ ) –λ(τ
=E +e E e F(Xt )dt + e ϕ(Xτ 0 )Xτ X0
Z0τ τ
e–λt F(Xt )dt + e–λτ V(Xτ )X0 .
=E (12.4)
0
Here, we are using the fact that X is a strong Markov process. Now, supposing that V is sufficiently
smooth to apply Itô’s Lemma, we have
Z τ
e–λτ V(Xτ ) = V(X0 ) + e–λt (A – λ)V(Xt )dt + local martingale, (12.5)
0
210 CHAPTER 12. OPTIMAL STOPPING
where A is the infinitesimal generator of the X process. If the local martingale above is a true martingale,
then inserting (12.5) into (12.4) we obtain
Z τ
e–λt (with equality if τ 0 = τ ∗ ).
0≥E F(Xt ) + (A – λ)V(Xt ) dt X0 , (12.6)
0
Case: τ ≤ τ ∗ . In this case, we have by construction that τ 0 = τ ∗ and in inequality (12.6) becomes an
equality
Z τ
e–λt F(Xt ) + (A – λ)V(Xt ) dt X0 .
0=E
0
If τ ∗ = 0 then we clearly have V(x ) = ϕ(x ). Noting that τ ∗ > 0 if and only if X0 = x ∈ D, we obtain
Note that, at this point, we do not know D. Without knowing D we cannot solve for V using (12.7).
Now, we consider the general case.
In the case of τ = 0 we have J(τ ∗ ) ≥ J(0) and hence V(x ) ≥ ϕ(x ). Putting everything together we obtain
The PDE (12.8) is called a variational inequality. Note that if V is the solution to (12.8) we can
reconstruct the continuation region D as follows
Note that the formal derivation of the variational inequality (12.8) assumes that the optimal stopping
rule τ ∗ is a first exit time. What we would like is to simply find V(x ) by solving (12.8) and then conclude
that the optimal stopping rule τ ∗ is a first exit time with continuation region D given by (12.9). This is
the subject of the following theorem.
12.4. EXAMPLE: OPTIMAL STOPPING OF RESOURCE EXTRACTION 211
Theorem 12.3.2 (Verification). Fix probability space (Ω, F, P) and a filtration F = (Ft )t ≥0 and let
X = (Xt )t ≥0 be the d-dimensional time-homogenous diffusion given by (12.1). Let J(τ ) be the cost
functional defined in (12.2). Suppose there is a function V : Rd → R that is sufficiently smooth to
apply Itô’s Lemma and satisfies the variational inequality (12.8). Let D be given by (12.9). Denote
by T be the class of admissible stopping times (see Definition 12.2.1) that satisfy
d X
m Z τ
(j )
e–λs ∂xi V(Xt ) · σ (ij ) (Xt )dWt
X
E = 0. (12.10)
i =1 j =1 0
Define τ ∗ := inf{t ≥ 0 : Xt ∈
/ D} and suppose τ ∗ ∈ T. Then τ ∗ is optimal: J(τ ∗ ) ≥ J(τ ) for all τ ∈ T.
∗
Moreover, the optimal expected cost-to-go satisfies Jτ (x ) = V(x ).
We will not prove Theorem 12.3.2. Rather, we will simply comment on the theorem’s usefulness. What
Theorem 12.3.2 tells us is that, if we find a function V that satisfies the variational inequality (12.8) and
define τ ∗ = inf{t ≥ 0 : Xt ∈
/ D} where D is given by (12.9), then if V is sufficiently smooth and satisfies
(12.10) with τ = τ ∗ , then we can conclude that τ ∗ is the optimal stopping rule.
Remark 12.3.3 (Principle of smooth fit). There will often be more than one solution to the
variational inequality (12.8). However, only one of these solutions will be sufficiently smooth to apply
Itô’s Lemma, namely, the solution for which both V and its gradient ∇V are continuous on the boundary
∂D. Thus, when looking for solutions of the variational inequality, we often impose that the solution V
of (12.8) and its gradient ∇V be continuous on the boundary ∂D. We call this requirement the principle
of smooth fit.
dRt = –γRt dt .
Suppose the market price P = (Pt )t ≥0 of the resource follows a geometric Brownian motion
where W = (Wt )t ≥0 is a Brownian motion. The company incurs a fixed cost c per unit time to extract
R from the earth. Thus, if the company operates from time t = 0 to time t = τ it generates a profit of
Z τ
Profit = (γRt Pt – c)dt .
0
212 CHAPTER 12. OPTIMAL STOPPING
Let us define X = (Xt )t ≥0 by Xt = Rt Pt . One easily deduces that the dynamics of X are given by
Suppose the company wishes to choose a stopping rule τ that maximizes its expected profit. The cost
functional J(τ ) is given by
Z τ
J(τ ) = E (γXt – c)dt . (12.11)
0
F(x ) = γx – c, ϕ(x ) = 0, λ = 0.
A = (b – λ)x ∂x + 21 a 2 x 2 ∂x2 .
x ∈ Dc , ⇒ V(x ) = 0,
⇒ γx – c + (b – γ)x V 0 (x ) + 12 a 2 x 2 V 00 (x ) = γx – c ≤ 0,
⇒ x ≤ c/γ,
⇒ D ⊆ (c/γ, ∞). (12.14)
where d ∈ R is a constant to be determined. One can easily verify by direct substitution that
x ∈D ⇒ Vd (x ) > 0, ⇒ b < γ.
Intuitively, it makes sense that we must have b < γ. If we had b ≥ γ then the growth of X would be
non-negative, and it would make sense to extract resources for ever (i.e., τ ∗ = ∞). We will assume that
b < γ, which will lead to an optimal stopping rule τ ∗ which is finite.
D = (x ∗ , ∞), x ∗ ≥ c/γ,
where the restriction x ∗ ≥ c/γ comes from (12.14). In order to find x ∗ and d we shall use the principle
of smooth fit (see Remark 12.3.3). Namely, we shall require that V and V 0 be continuous at the boundary
of D. We have
We can use the two equations in (12.15) to solve for d and x ∗ . We find
c b–γ c log x ∗ γx ∗
x∗ = , d= – =: d ∗ .
γ b – γ – a 2 /2 2
b – γ – a /2 b – γ
Thus, we have obtained a formal solution for value function V and optimal stopping rule τ ∗
Vd ∗ (x )
x > x ∗,
V(x ) = τ ∗ = inf{t ≥ 0 : Xt ≤ x ∗ }.
0 x ≤ x ∗,
We leave it as an exercise for the reader to check that the value function V satisfies the conditions of
Theorem 12.3.2 (and therefore, V is the value function and τ ∗ is the optimal stopping rule).
12.5 Exercises
To Do.
214 CHAPTER 12. OPTIMAL STOPPING
Chapter 13
Suppose, however that we have the ability to generate a sequence of iid random variables (Xi )i ∈N where
Xi ∼ FX . Let us define
1 X n
µb n := X.
n i =1 i
µb n → µ, n → ∞.
Thus, it seems reasonable to take µb n as an estimate for µ. A natural question to ask, then, is: for a
given n, how good is our estimate µb n of µ? To answer this question, let us define σ 2 := VX. We
compute
n 1 Xn 1 1
VXi = 2 nVX = σ 2 .
X
Vµn = V(Xi /n) = 2
i =1
n i =1 n n
215
216 CHAPTER 13. MONTE CARLO METHODS
√ √
So the variance of µn scales like 1/n and the standard deviation Vµn scales like 1/ n. From the CLT,
we know that µb n is asymptotically normal
D
µb n ≈ Zn ∼ N(µ, σ 2 /n).
One quick side note: without an expression for FX or φX we cannot know σ 2 . However, we can estimate
σ 2 with the sample variance
1 X n
σbn2 := (X – µb n )2 .
n – 1 i =1 i
When σ 2 is unknown and n is large, we can construct the c-C.I. by replacing σ with σbn .
An alternative method of estimating µ can be described as follows. Suppose we can generate a sequence
of iid random variables (Yi ) and we know that EYi = µ. Then the following random variable
1 X n
νbn := Y ,
n i =1 i
is an alternative estimator of µ. In particular, νbn is unbiased (as Eνbn = µ) and satisfies (by the SLLN)
νbn → µ. Moreover, if VYi < VX, then it will follow that Vνbn < Vµb n . Choosing (Yi ) cleverly so that
Vνbn < Vµb n is called variance reduction.
The methods described above for estimating a mean µ of a random variable X by generating a sequence
of iid random variables – either (Xi ) or (Yi ) – is the Monte Carlo method in a nutshell. More generally,
we may be interested in computing, not just the expectation of a random variable, but the expectation
EF[X] of a functional F of an entire stochastic process X = (Xt )t ≥0 . For example, we may wish to
compute the time-weighted average of X, given by
1ZT
EF[X] := E Xt dt .
T 0
(i )
To do this, we will need to know how to generate iid random process X(i ) = (Xt )t ≥0 , i = 1, 2, . . ..
Note that a collection of processes (X(i ) ) is iid if the processes are independent and have the same
finite-dimensional distributions
X(i ) ⊥⊥ X(j ) , i 6= j ,
13.2. GENERATING RANDOM VARIABLES 217
(i ) (i ) (i ) D (j ) (j ) (j )
(Xt1 , Xt2 , . . . , Xtn ) = (Xt1 , Xt2 , . . . , Xtn ), ∀ 0 ≤ t1 ≤ t2 ≤ . . . ≤ tn < ∞.
With the above in mind, the focus of this chapter will be an answering the follow two questions:
1. How can we generate iid random variables? Or, more generally, how can we generate iid random
processes?
2. How can we “intelligently sample” (i.e., construct estimators of µ) in order to reduce the variance
of our estimates?
The first question will occupy Sections 13.2 through 13.6. Variance reduction methods will be the focus
on Section 13.7.
Assumption 13.2.1. Throughout Section 13.2 both U and the sequence (Ui ) will represent iid random
variables that are uniformly distributed Ui ∼ U(0, 1).
Theorem 13.2.2 (Inverse Transform Method). Let FX be the distribution function of a random
variable X defined on (Ω, F, P). Define
F–1
X (u) := inf{x : F(x ) ≥ u}. (13.1)
Then we have
F–1
X (U) ∼ FX . (13.2)
218 CHAPTER 13. MONTE CARLO METHODS
Proof. For simplicity, let us Suppose FX is strictly increasing (this is always the case for continuous
random variables). Noting that FX (x ) ∈ (0, 1) for all x ∈ R and P(U ≤ y) = y for all y ∈ (0, 1) we have
P(F–1
X (U) ≤ x ) = P(U ≤ FX (x )) = FX (x ),
Example 13.2.3. Let us apply Theorem 13.2.2 to generate an exponential random variable X ∼ E(λ).
For x ∈ (0, ∞) we have
Z x
–1
FX (x ) = dx λe–λx = 1 – e–λx , ⇒ F–1
X (u) = log(1 – u).
0 λ
Using the fact that U ∼ 1 – U, we find that
–1
log U ∼ E(λ).
λ
Example 13.2.4. Suppose X is a discrete random variable with P(X = xi ) = pi for i = 1, 2, . . . , n. Then
n i
F–1
X X X
FX (x ) = pi , ⇒ X (u) = xi 1{Pi –1 <u≤Pi } , Pi = pj . (13.3)
i :xi ≤x i =1 j =1
Theorem 13.2.5. Let FX be the distribution of a random variable X defined on (Ω, F, P). Let
Y := X|a < X ≤ b be a random variable whose distribution FY is the conditional distribution of X
given a < X ≤ b. Define
V := FX (a) + FX (b) – FX (a) U.
F–1
X (V) ∼ FY = FX|a<X≤b . (13.4)
P(F–1
X (V) ≤ x ) = P(V ≤ FX (x )) = P(FX (a) + (FX (b) – FX (a)) U ≤ FX (x ))
!
F (x ) – FX (a)
=P U≤ X
FX (b) – FX (a)
F (x ) – FX (a) P(X ≤ x , a < X ≤ b)
= X =
FX (b) – FX (a) P(a < X ≤ b)
= P(X ≤ x |a < X ≤ b) = FX|a<X≤b (x ),
D f (X)
Y = X|U ≤ Y . (13.5)
cfX (X)
Using Theorem 13.2.6, if we can generate a random variable X with density fX , then we can generate a
random variable Y with density fY using the following algorithm
1. Generate X.
220 CHAPTER 13. MONTE CARLO METHODS
2. Generate U.
3. If U ≤ c1 fY (X)/fX (X), return X. Otherwise, go back to Step 1.
In order for the acceptance rejection method to work efficiently, we would like P(U ≤ c1 fY (X)/fX (X)) to
be as close to one as possible. Thus, we should choose fX to be similar to fY and we should choose c to
be as small as possible.
Box-Mullar Method
The Box-Mullar Method of generating two independent Gaussian random variables is summarized in
the following theorem.
Note that we generate a a normal random variable X ∼ N(µ, σ 2 ) from a standard normal as follows
We now know how to generate independent Gaussian random variables. In this section, we will show
how to generate correlated Gaussian random variables. The density fX of multivariate Guassian vector
X ∈ Rd is given by
1
fX (x ) = q exp – 21 (x – µ)T Σ–1 (x – µ) ,
(2π)d |Σ|
where we have defined the mean vector µ and the covariance matrix Σ
EX1 E(X1 – µ1 )(X1 – µ1 ) E(X1 – µ1 )(X2 – µ2 ) . . . E(X1 – µ1 )(Xd – µd )
EX2 E(X2 – µ2 )(X1 – µ1 ) E(X2 – µ2 )(X2 – µ2 ) . . . E(X2 – µ2 )(Xd – µd )
µ= , Σ = .
.. .. .. ... ..
.,
. . .
EXd E(Xd – µd )(X1 – µ1 ) E(Xd – µd )(X2 – µ2 ) . . . E(Xd – µd )(Xd – µd )
Thus, if we wish to generate a d-dimensional Gaussian random vector X ∼ (µ, Σ) we need only to find a
matrix A such that AAT = Σ. Such a matrix A can be constructed via the Cholesky Factorization,
which we describe in the following theorem.
Then AAT = Σ.
We will not provide a proof of Theorem 13.2.8. Instead, we direct the interested reader to (Glasserman,
2013, Chapter 2).
Xn+1 = Yn+1 , n = 0, 1, 2, . . . ,
Note that Yn+1 can be generated using the Inverse Transform Method, described in Section 13.2.1.
where Tn+1 is an exponentially distributed random variable and Yn+1 is a discrete random variable with
–g(Xτn , j )
Tn+1 |Xτn ∼ E(–g(Xτn , Xτn )), P(Yn+1 = j |Xτn ) = , j = 1, 2, . . . , |S|.
g(Xτn , Xτn )
13.5. SIMULATING DIFFUSIONS 223
Note that both Tn+1 and Yn+1 can be generated using the Inverse Transform Method, described in
Section 13.2.1. Finally, as X is constant between jumps, we have
In principle, X and W could live in Rd and Rm , respectively. However, for simplicity, we shall assume
that d = m = 1.
Assumption 13.5.1. Throughout Sections 13.5 and 13.6 the random variable Z and the sequence (Zi )
will represent iid random variables that are Normally distributed Z, Zi ∼ N(0, 1).
From Proposition 8.2.9 we have that the Itô integral is independent of Xti and normally distributed
Z t Z t
i +1 i +1
σ(t )dWt ∼ N(0, v 2 (ti , ti +1 )), v 2 (ti , ti +1 ) := σ 2 (t )dt .
ti ti
Thus, if we wish to simulate the value of X on a grid Π = {t0 = 0, t1 , t2 , . . .}, we can do so using the
following algorithm
Z t
i +1
Xti +1 = Xti + µ(t )dt + v (ti , ti +1 )Zi +1 , (13.11)
ti
where we have used v Zi +1 ∼ N(0, v 2 ). Notice that the algorithm (13.11) used to generate the value of X
on the grid Π is exact, meaning the distributions of (Xt0 , Xt1 , Xt2 , . . .) as generated using (13.10) and
(13.11) are the same.
224 CHAPTER 13. MONTE CARLO METHODS
Example 13.5.2. Suppose S = (St )t ≥0 is a geometric Brownian motion with time-dependent coefficients
Note that the drift and diffusion coefficients of S are random. Nevertheless, we can simulate S exactly as
follows. First define X = log S. Then, by Itô’s Lemma, we have
dXt = b(t ) – 21 a 2 (t ) dt + a(t )dWt .
As the drift and diffusion coefficients of X are deterministic functions of time, we can simulate
(Xt0 , Xt1 , Xt2 , . . .) exactly. Then we can set (St0 , St1 , St2 , . . .) = (eXt0 , eXt1 , eXt2 , . . .).
Fom Proposition 8.2.9 we have that the Itô integral is independent of Xti and normally distributed
Z t Z t
i +1 i +1
e–κ(ti +1 –t ) adWt ∼ N(0, v 2 (ti , ti +1 )), v 2 (ti , ti +1 ) = e–2κ(ti +1 –t ) a 2 dt
ti ti
a2
= 1– e–2κ(ti +1 –ti ) . (13.12)
2κ
Thus, if we wish to simulate the value of X on a grid Π = {t0 = 0, t1 , t2 , . . .}, we can do so using the
following algorithm
where v is given by (13.12). Note that this simulation is exact. More generally, one can simulate exactly,
any process of the form
dXt = f (t ) + g(t )Xt dt + h(t )dWt ,
where (f , g, h) are deterministic functions of time. We leave it as an exercise for students to show that
Xti +1 |Xti is normally distributed.
13.5. SIMULATING DIFFUSIONS 225
Π = {t0 , t1 , . . . , tn }, ti = i δ, δ = T/n.
For δ 1 is seems reasonable to approximate the process X in (13.9) on the grid Π by a process X
b given
by
Z t Z t
i +1 i +1
X
b
ti +1 = Xti +
b µ(ti , X
b )dt +
ti σ(ti , X
b )dW
ti t
ti ti
=X
b + µ(t , X
ti i
b ) (t
ti i +1 – ti ) + σ(ti , Xti ) Wti +1 – Wti .
b (13.13)
√
X
b
ti +1 = Xti + µ(ti , Xti )δ + σ(ti , Xti ) δZi +1 ,
b b b
where we have used ti +1 – ti = δ and Wti +1 – Wti ∼ N(0, δ). The question naturally arises then: how
b converge to X as δ → 0? The give a quantifiable response to this question, we need to
fast does X
define some reasonable measures of convergence.
β
E|X nδ – XT | ≤ Cδ , as δ → 0. (13.14)
b
We say that X
b converges weakly to X with order β > 0 if there exists a constant C such that
b ) – Ef (X )| ≤ Cδ β , 2β+1
|Ef (X nδ T as δ → 0 ∀ f ∈ CP , (13.15)
where CkP is the set of functions whose derivatives up to order k are polynomially bounded.
the drift and diffusion coefficients of X satisfy the linear growth and Lipchitz conditions required for
existence and uniqueness of a strong solution of an SDE (see Theorem 9.1.4) and additionally satisfy
√
|µ(t , x ) – µ(s, x )| + |σ(t , x ) – σ(s, x )| ≤ C(1 + |x |) t – s, ∀ 0 ≤ s < t < ∞,
226 CHAPTER 13. MONTE CARLO METHODS
2β+1
for some C. Under the additional conditions that µ, σ ∈ CP , one can show that Xb converges weakly
to X with order β = 1. For detailed statements concerning weak and strong convergence of X
b to X we
Statements concerning order of convergence give us information about how quickly the approximation X
b
approaches X. However, as the constants C in (13.14) and (13.15) are typically unknown, an order of
convergence statement does not provide much guidance for choosing the size of δ to obtain a desired level
of accuracy. From a practical standpoint, one typically must resort to trial-and-error in order to find
an appropriate δ. For example, if one wishes to estimate Ef (XT ) one could simulate m sample paths
(1) b (2) b (m) )
(X
b ,X ,...,X with δ fixed and then compute
1 Xm
b (i ) ).
f (X (13.16)
m i =1 nδ
One could then repeat this procedure with smaller and smaller δ’s until the value of (13.16) does not
change “too much,” where “too much” depends on the desired level of accuracy.
We could have alternatively written the dynamics of X using a compensated Poisson random measure
N(dt
e , dz ) = N(dt , dz ) – ν(dz )dt as follows
Z Z
dXt = µ(t , Xt ) + γ(t , Xt – , z )ν(dz ) dt + σ(t , Xt )dWt + γ(t , Xt – , z )N(dt
e , dz ).
R R
However, for the purposes of simulating sample paths of X, it is preferable to work with the expression
(13.17). We will present two Euler-like discretizations of (13.17)
Π = {t0 , t1 , . . . , tn }, ti = i δ, δ = T/n.
13.6. SIMULATING JUMP-DIFFUSIONS 227
Thus, we can approximate the jump-term in (13.18) using a Bernoulli random variable Bi +1 and a
random varaible Yi +1 with distribution F as follows
Z D
γ(ti , X
b , z ) N(t
ti i +1 , dz ) – N(ti , dz ) ≈ γ(ti , Xti , Yi +1 )Bi +1 ,
b Yi +1 ∼ F, Bi +1 ∼ B(λδ).
R
Putting everything together, we can approximately simulate X b using the following algorithm
√ 2
X ti +1 = Xti + µ(ti , Xti )δ + σ(ti , Xti ) δZi +1 + γ(ti , Xti , Yi +1 )Bi +1 + O(δ )
b b b b b
b is not exact due to the error term of O(δ 2 ). However, as we are already
Note that our simulation of X
making an error of O(δ) be approximating X with X,
b the O(δ 2 ) term is not problematic.
τi +1 = τi + Ti +1 , τ0 := 0, Ti +1 ∼ E(λ).
where the jump times (τi ) and the fixed times (ti ) are placed in chronological order. Now, from time τi
to τi +1 – the dynamics of X are given by
Thus, in between jump dates, we can simulate X using the Euler scheme described in Section 13.5.3. At
the jump dates we have
D
Z
Xτi = Xτi – + γ(τi , Xτi – , z )∆N(τi , dz ) = Xτi – + γ(τi , Xτi – , Yi +1 ), Yi +1 ∼ F.
R
One advantage of simulating sample paths of X on a random time grid is that we obtain for each sample
path, the exact jump times for X. When we simulate X on a fixed time grid, we can only specify intervals
[ti , ti +1 ) during which jumps occur.
Note that µb n (b), like µb n ,is an unbiased estimator of µY as Eµb n (b) = µ. Moreover, we will show that,
by choosing b appropriately, we can obtain Vµb n (b) ≤ Vµn . To see how this is so, we compute
2 = VX and ρ
where σX XY is the correlation between X and Y. Note that σY (b) is quadratic and convex
as a function of b and achieves a minimum at
ρ σ CoV(X, Y)
b ∗ = XY Y = .
σX VX
As b ∗ is the minimizer of σY (b), we know that σY (b ∗ ) ≤ σY (0) = σY . Thus, we have
1 1
Vµb n (b ∗ ) = σY (b ∗ ) ≤ σY = Vµb n .
n n
Thus, by using µb n (b ∗ ) in place of µb n , we can achieve a lower variance estimator of µY . The variable X
whose expectation µX is known, is called a control variate.
In may practical scenarios, it may not be realistic to assume that we know CoV(X, Y) and VX, which
are needed to compute b ∗ . However, we can always estimate b ∗ using
Pn
(Xi – EX)(Yi – µb n )
bbn∗ = i =1
Pn 2 .
i =1 (Xi – EX)
One cautionary note: unlike µb n (b), which is an unbiased estimator of µY for any fixed b ∈ R, the
estimator µb n (bbn∗ ) is a biased estimator of µY (because Eµb n (bbn∗ ) 6= µY ).
Example 13.7.1. Suppose we wish to estimate Ef (ST ) where S = (St )t ≥0 is the solution of the following
SDE
Note that S is a martingale and thus EST = S0 . Assuming we can simulate iid paths of S (either exactly
or approximately), then we can estimate Ef (ST ) using the control variate method with (Xi , Yi ) =
(i ) (i )
(ST , f (ST )).
dP
Eϕ(X) = EZϕ(X),
e Z= .
dP
e
230 CHAPTER 13. MONTE CARLO METHODS
we can generate iid random variables (Xi ) with Xi ∼ F and iid random vectors (Xi , Zi ) with (Xi , Zi ) ∼ F.
e
1 X n
µb n = ϕ(Xi ), Xi ∼ F,
n i =1
1 X n
νbn = Z ϕ(Xi ), (Xi , Zi ) ∼ F.
e
n i =1 i
Both µb n and νbn are unbiased estimators of µ as Eµb n = Eνbn = µ. To see which estimator has a lower
variance, we compute
if Eϕ2 (X) ≥ EZ
e 2 ϕ2 (X) = EZϕ2 (X), then Vµb n ≥ V
e νb .
n
Now, let us consider the case where Z = p(X)/pe (X) where p is the density of X under P and pe is some
other density. Note that, with this choice for Z, we have P(X
e ∈ dx ) = pe (x )dx because
p(X) Z
p(x ) Z
P(X ∈ A) = E1{X∈A} = EZ1
e
{X∈A} = E
e 1{X∈A} = dx pe (x ) = dx p(x ).
pe (X) A pe (x ) A
Thus, if we choose Z = p(X)/pe (X) with pe given by (13.19), we will obtain an estimator νbn of µ with a
lower variance than µb n . Of course, from a practical standpoint, if we knew p, then we could compute
Eϕ(X) without the need for Monte Carlo simulation. Nevertheless, this exercise gives us insight as to
how we might choose Z. Namely, we should try to choose Z so that pe (x ) is large when |ϕ(x )|p(x ) is large.
where W is a Brownian motion under P. Suppose we wish to compute P(XT > x ) = E1{X>x } where
x X0 . We could approximate X in a grid Π = {t0 = 0, t1 , t2 , . . . , tm = T} with a process X
b where
X
b
ti +1 = Xti + a(Xti )(Wti +1 – Wti ),
b b i = 0, 1, . . . , m – 1.
1 X n
µb n = 1 b (i ) , b (i ) are iid realizations of X
where X b under P.
n i =1 {X T >x }
dP –1 2 1 2 m–1Y γ(W ft )
–W
2 γ T–γ WT = e– 2 γ T
f ft
Z= = e e i +1 i ,
dP
e
i =0
Then, using
e dP 1
P(XT > x ) = E {XT >x } = EZ1{XT >x } ,
e
dP
e
we could compute
1 X n
νbn = Z(i ) 1 b (i ) , b (i ) , Z(i )) are iid realizations of (X,
where (X b Z) under P.,
e
n i =1 {XT >x }
m–1 f(i ) f(i )
– 21 γ 2 T Y γ(Wtj +1 –Wtj )
Z(i ) =e e .
j =0
Y+ = g(Z1 , Z2 , . . . , Zm ).
232 CHAPTER 13. MONTE CARLO METHODS
D
Y– := g(–Z1 , –Z2 , . . . , –Zm ) = Y+ .
We call the pair (Y+ , Y– ) antithetic variables. Note that Y+ and Y– are not independent.
Now, suppose we wish to compute µ := EY+ . As we know how to generate iid standard normal random
variables (Zi ) we can generate iid random variable (Y+ –
i ) and (Y i ). Thus, we can construct two unbiased
estimators of µ as follows
1 X n 1 X n
µb n := (Y+ + Y+
n+i ), νbn := (Y+ + Y–i ),
2n i =1 i 2n i =1 i
We would like to know, under which conditions we have Vνbn < Vµb n . Noting that
V(Y+ + + + +
i + Y n+i ) = VYi + VYn+i = 2VY ,
V(Y+ – + – + – + + –
i + Y i ) = VYi + VYi + 2CoV(Y i , Yi ) = 2VY + 2CoV(Y , Y ),
we see that, if CoV(Y+ , Y– ) < 0, then we will have Vνbn < Vµb n . Note that we will have CoV(Y+ –
i , Yi ) < 0
if the function g is monotonic in each of its components, i.e., if
Example 13.7.3. Suppose we wish to compute Eϕ(XT ) where X = (Xt )t ≥0 satisfies the SDE dXt =
µ(t , Xt )dt + σ(t , Xt )dWt . We can generate an approximate path of X on a grid Π = {t0 = 0, t1 , . . . , tm =
b + defined by
T} with a process X
+ + + + q
X
b
ti +1 = Xti + µ(ti , Xti )(ti +1 – ti ) + σ(ti , Xti ) ti +1 – ti Zi +1 ,
b b b i = 0, 1, . . . , m – 1.
b – defined by
Alternatively, we can generate an approximate path of X on Π with a process X
– – – – q
X
b
ti +1 = Xti + µ(ti , Xti )(ti +1 – ti ) + σ(ti , Xti ) ti +1 – ti (–Zi +1 ),
b b b i = 0, 1, . . . , m – 1.
+ –
We can now construct antithetic variables (Y+ , Y– ) = (ϕ(X
b ), ϕ(X
T
b )).
T
13.8 Exercises
To do.
Bibliography
Cont, R. and P. Tankov (2004). Financial modelling with jump processes, Volume 2. Chapman & Hall.
Glasserman, P. (2013). Monte Carlo methods in financial engineering, Volume 53. Springer Science &
Business Media.
Grimmett, G. and D. Stirzaker (2001). Probability and random processes. Oxford university press.
Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes.
Biometrika 58 (1), 83–90.
Karlin, S. and H. Taylor (1981). A second course in stochastic processes. Academic Press.
Kloeden, P. and E. Platen (1992). Numerical Solution of Stochastic Differential Equations, Volume 23.
Springer-Verlag Berlin Heidelberg.
Linetsky, V. (2007). Chapter 6 spectral methods in derivatives pricing. In J. R. Birge and V. Linetsky
(Eds.), Financial Engineering, Volume 15 of Handbooks in Operations Research and Management
Science, pp. 223 – 299. Elsevier.
Øksendal, B. and A. Sulem (2005). Applied stochastic control of jump diffusions. Springer Verlag.
Shreve, S. E. (2004). Stochastic calculus for finance II: Continuous-time models, Volume 11. Springer
Science & Business Media.
233