ChrisRackauckas IntuitiveSDEs
ChrisRackauckas IntuitiveSDEs
Abstract
Stochastic differential equations (SDEs) are a generalization of deterministic differential
equations that incorporate a “noise term”. These equations can be useful in many applications
where we assume that there are deterministic changes combined with noisy fluctuations. Ito’s
Calculus is the mathematics for handling such equations. In this article we introduce stochastic
differential equations and Ito’s calculus from an intuitive point of view, building the ideas from
relatable probability theory and only straying into measure-theoretic probability (defining all
concepts along the way) as necessary. All of the proofs are discussed intuitively and rigorously:
step by step proofs are provided. We start by reviewing the relevant probability needed in order
to develop the stochastic processes. We then develop the mathematics of stochastic processes in
order to define the Poisson Counter Process. We then define Brownian Motion, or the Wiener
Process, as a limit of the Poisson Counter Process. By doing the definition in this manner,
we are able to solve for many of the major properties and theorems of the stochastic calculus
without resorting to measure-theoretic approaches. Along the way, examples are given to show
how the calculus is actually used to solve problems. After developing Ito’s calculus for solving
SDEs, we briefly discuss how these SDEs can be computationally simulated in case the ana-
lytical solutions are difficult or impossible. After this, we turn to defining some relevant terms
in measure-theoretic probability in order to develop ideas such as conditional expectation and
martingales. The conclusion to this article is a set of four applications. We show how the rules
of the stochastic calculus and some basic martingale theory can be applied to solve problems
such as option pricing, genetic drift, stochastic control, and stochastic filtering. The end of this
article is a cheat sheet that details the fundamental rules for “doing” Ito’s calculus, like one
would find on the cover flap of a calculus book. These are the equations/properties/rules that
one uses to solve stochastic differential equations that are explained and justified in the article
but put together for convenience.
Contents
1 Introduction 6
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
2 Probability Review 7
2.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Example 1: Bernoulli Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Example 2: Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . 8
2.2 Probability Generating Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Continuous Time Discrete Space Random Variables . . . . . . . . . . . . . . . . . . . 12
2.4.1 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Generalization: The Multivariate Gaussian Distribution . . . . . . . . . . . . 15
2.5.3 Gaussian in the Correlation-Free Coordinate System . . . . . . . . . . . . . . 15
2.6 Gaussian Distribution in PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Change of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.1 Multivariate Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Empirical Estimation of Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
4 Introduction to Stochastic Processes: Brownian Motion 34
4.1 Brownian Motion / The Wiener Process . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Understanding the Wiener Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Ito’s Rules for Wiener Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 A Heuristic Way of Looking at Ito’s Rules . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Wiener Process Calculus Summarized . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Example Problem: Geometric Brownian Motion . . . . . . . . . . . . . . . . 40
4.6 Kolmogorov Forward Equation Derivation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.1 Example Application: Ornstein–Uhlenbeck Process . . . . . . . . . . . . . . . 43
4.7 Stochastic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Fluctuation-Dissipation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3
6.5 Martingale SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5.1 Example: Geometric Brownian Motion . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Application of Martingale Theory: First-Passage Time Theory . . . . . . . . . . . . 65
6.6.1 Kolmogorov Solution to First-Passage Time . . . . . . . . . . . . . . . . . . 66
6.6.2 Stopping Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6.3 Reflection Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.7 Levy Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8 Markov Processes and the Backward Kolmogorov Equation . . . . . . . . . . . . . . 67
6.8.1 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8.2 Martingales by Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . 68
6.8.3 Transition Densities and the Backward Kolmogorov . . . . . . . . . . . . . . 68
6.9 Change of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.1 Definition of Change of Measure . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.2 Simple Change of Measure Example . . . . . . . . . . . . . . . . . . . . . . . 70
6.9.3 Radon-Nikodym Derivative Process . . . . . . . . . . . . . . . . . . . . . . . 70
6.9.4 Girsanov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Applications of SDEs 72
7.1 European Call Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1.1 Solution Technique: Self-Financing Portfolio . . . . . . . . . . . . . . . . . . 73
7.1.2 Solution Technique: Conditional Expectation . . . . . . . . . . . . . . . . . . 74
7.1.3 Justification of µ = r via Girsanov Theorem . . . . . . . . . . . . . . . . . . . 76
7.2 Population Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.1 Definitions from Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.2 Introduction to Genetic Drift and the Wright-Fisher Model . . . . . . . . . . 77
7.2.3 Formalization of the Wright-Fisher Model . . . . . . . . . . . . . . . . . . . . 78
7.2.4 The Diffusion Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.5 SDE Approximation of the Wright-Fisher Model . . . . . . . . . . . . . . . . 79
7.2.6 Extensions to the Wright-Fisher Model: Selection . . . . . . . . . . . . . . . . 80
7.2.7 Extensions to the Wright-Fisher Model: Mutation . . . . . . . . . . . . . . . 81
7.2.8 Hitting Probability (Without Mutation) . . . . . . . . . . . . . . . . . . . . . 81
7.2.9 Understanding Using Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.1 Deterministic Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.3 Stochastic Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.4 Example: Linear Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Stochastic Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1 The Best Estimate: E [Xt |Gt ] . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.2 Linear Filtering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 Discussion About the Kalman-Bucy Filter . . . . . . . . . . . . . . . . . . . . . . . . 91
4
8 Stochastic Calculus Cheat-Sheet 92
5
Acknowledgements
This article was based on the course notes of Math 271-C, stochastic differential equations, taught
by Xiaohui Xie at University of California, Irvine. It is the compilation of notes from Shan Jiang,
Anna LoPresti, Yu Liu, Alissa Klinzmann, Daniel Quang, Hannah Rubin, Jaleal Sanjek, Kathryn
Scannell, Andrew Schaub, Jienian Yang, and Xinwen Zhang.
1 Introduction
Newton’s calculus is about understanding and solving the following equation:
d dx
g(x) = g ′ (x)
dt dt
The purpose of this paper is to generalize these types of equations in order to include noise.
Let Wt be the Wiener process (aka Brownian motion whose properties will be determined later).
We write a stochastic differential equation (SDE) as
dx = f (x)dt + σ(x)dWt
which can be interpreted as “the change in x is given by deterministic changes f with noise of
variance σ”. For these equations, we will need to develop a new calculus. We will show that
Newton’s rules of calculus will not hold:
Instead, we can use Ito’s calculus (or other systems of calculus designed to deal with SDEs). In
Ito’ s calculus, we use the following equation to find dg(x):
1
dg(x) = g′ (x)dx + g′′ (x)σ 2 (x)dt
2
1
= g′ (x)f (x)dt + g ′ (x)σ(x)dt + g′′ (x)σ 2 (x)dt
2
If we let ρ(x, t) be the distribution of x at time t, we can describe the evolution of this distribution
using the PDE known as the Kolmogorov equation:
∂ρ(x, t) ∂ 1 ∂2 2
= − [f (x)ρ(x, t)] + 2
σ (x)ρ(x, t) .
∂t ∂x 2 ∂x
The we can understand the time development of differential equations with noise terms by un-
derstanding their probability distributions and how to properly perform the algebra to arrive at
solutions.
6
1.1 Outline
This article is structured as follows. We start out with a review of probability that would be
encountered in a normal undergraduate course. These concepts are then used to build the basic
theory of stochastic processes and importantly the Poisson counter process. We then define the
Wiener process, Brownian motion, as a certain limit of the Poisson counter process. Using this
definition, we derive the basic properties, theorems, and rules for solving SDEs, known as the
stochastic calculus. After we have developed the stochastic calculus, we develop some measure-
theoretic probability ideas that will be important for defining conditional expectation, an idea
central to fucture estimation of stochastic processes and martingales. These properties are then
applied to systems that may be of interest to the stochastic modeler. The first of which is the
European option market where we use our tools to derive the Black-Scholes equation. Next we
dabble in some continuous probability models for population genetics. Lastly, we look at stochastic
control and filtering problems that are central to engineering and many other disciplines.
2 Probability Review
This chapter is a review of probability concepts from an undergraduate probability course. These
ideas will be useful when doing the calculations for the stochastic calculus. If you feel as though
you may need to review some probability before continuing, we recommend Durrett’s Elementary
Probability for Applications, or at a slightly higher level which is more mathematical, Grinstead
and Snell’s Introduction to Probability. Although a full grasp of probability is not required, it is
recommended that you are comfortable with most of the concepts introduced in this chapter.
We defineE [X] as the mean of X, or the expectation of X. To find E [X], we take all the possible
values of X and weight them by the probability of X taking each of these values. So, for the
7
Bernoulli trial:
E [X] = Pr(H) · 1 + Pr(T ) · 0
=p
We use the probability distribution of X to describe all the probabilities of X taking each of its
values. There are only two possible outcomes in the Bernoulli trial and thus we can write the
probability distribution as
P (X = 1) = P (H) = p
.
P (X = 0) = P (T ) = 1 − p
Define V [X] as the variance of X. This is a measure of how much the values of X diverge from
the expectation of X on average and is defined as
h i
V [X] = σx2 = E (X − E [X])2 = E X 2 − E [X]2 .
S = {HH · · · H, T T · · · T, HT HT T · · · , ...}
or equivalently
S = {H, T }n
We may want to describe the how large the set S is, or the cardinality of the set S, represented by
|S|, as the “number of things in S”. For this example:
|S| = 2n
For each particular string of heads and tails, since each coin flip is independent, we can calculate
the probability of obtaining that particular string as
8
So for instance, P (HH · · · H) = pn . Say we want to talk about the probability of getting a certain
number of heads in this experiment. Then let X be the random variable for the number of heads.
We can describe the range(X) as the possible values that X can take: X ∈ {0, 1, 2, ..., n}. Note:
using “∈” is an abuse of notation since X is actually a function. Recall that the probability
distribution of X is
n k
Pr(X = k) = p (1 − p)n−k
k
Using this probability distribution, we can calculate the expectation and variance of X. The
expectation is
Xn Xn
n k
E [X] = µx = Pr(X = k) · k = p (1 − p)n−k · k
k
k=0 k=0
= n · p,
while the variance is
h i
V [X] = E (X − p)2
n
X
= Pr(X = k)(k − np)2
k=0
= n · p(1 − p).
Note that if Xi is the Bernoulli random variable associated with the ith coin toss, then
n
X
X= Xi .
i=1
Using these indicator variables we can compute the expectation and variance for the Binomial trial
more quickly. In order to do so, we use the following facts
E [aX + bY ] = aE [X] + bE [Y ]
and
V [aX + bY ] = a2 V [X] + b2 V [Y ] if X and Y are independent
to easily compute
n
X n
X
E [X] = E [Xi ] = p = np
i=1 i=1
9
and
n
X n
X
V [X] = V [Xi ] = p(1 − p) = np(1 − p)
i=1 i=1
The probability generating function is special because it gives an easy way to solve for the proba-
bility that the random variable equals a specific value. To do so, notice that
G(k) (0) = k! pk ,
10
and thus
G(k) (z)|z=0
Pr(X = k) = ,
k!
that is, the kth derivative of G evaluated at 0 gives a straight-forward way of finding p. Thus if
we can solve for G then we can recover the probability distribution (which in some cases may be
simple).
Thus we can set t = 0 and evaluate the derivative of the moment generating function at this point.
This gives us the following:
11
2.4 Continuous Time Discrete Space Random Variables
We can generalize these ideas to random variables with infinite time. To go from discrete to
continuous time, start with a line segment divided into intervals of size ∆t. Then let the interval
t
size ∆t → 0. The number of intervals is n = ∆t ., so we can also think of this as letting the number
of intervals n → ∞.
Let’s do this for the Binomial random variable. We can think of a continuous time random
variable version in the following way: there is a coin toss within each interval, the probability of
a “success” (the coin lands on heads) within each interval is λ∆t. Define X to be the number of
successes within the interval (0,t). Thus the probability of k successes in the interval (0, t) is
t
t
Pr(X = k) = ∆t (λ∆t)k (1 − λ∆t) ∆t −k
k
The probability generating function can then be written as follows:
12
1. Events are independent of each other
(a) Within a small interval ∆t, the probability of seeing one event is λ∆t
3. A special property of the Poisson distribution is that the expectation and variance are the
same:
where e−λt is the probability that no event occurs before t and λ∆t is the probability that the event
occurs in the (t, t + ∆t) time window. This is an exponential distribution:
f (t) = λe−λt
13
2.5.1 The Gaussian Distribution
Take a random variable X. We denote that X is Gaussian distributed with mean µ and variance
σ 2 squared by X ∼ N (µ, σ 2 ). This distribution can be denoted by the following properties:
(x−µ) 2
• Density function: ρ(X) = √ 1
2
e− 2σ2
2πσ
Ra
• Cumulative distribution function: P (X ≤ a) = −∞ ρ(x)dx
R∞
• Expectation: E [X] = −∞ xρ(x)dx = µ
R∞
• Variance: V [X] = −∞ (x − µ)2 ρ(x)dx = σ 2
The way to read this is that the probability that X will take a value within a certain interval is
given by P {x ∈ [x, x + dx]} = ρ(x) · dx.
Recall that the nth moment of X is defined as
Z ∞
p
E [x ] = xp ρ(x)dx.
−∞
Z ∞
p
E [x ] = xp ρ(x)dx
Z−∞
∞
1 x2
= xp √ e− 2σ2
−∞ 2πσ 2
Z ∞
1 x2
= √ xp−1 xe− 2σ2 dx
2πσ 2 −∞
x2
To solve this, we use integration by parts. We let u = xp−1 and dv = xe− 2σ2 . Thus du =
x2
(p − 1)xp−2 and v = −σ 2 e− 2σ2 . Therefore we see that
∞ Z ∞
σ2 x2 x2
E [xp ] = √ [(xp−1 e− 2σ2 ) + e− 2σ2 (p − 1)xp−2 dx].
2πσ 2 −∞ −∞
Notice that the constant term vanishes at both limits. Thus we get that
Z
p σ 2 (p − 1) ∞ − x22 p−2
E [x ] = √ e 2σ x dx
2πσ 2 −∞
= σ 2 (p − 1)E xp−2
14
Thus, using the base cases of the mean and the variance, we have a recursive algorithm for
finding all of the further variances. Notice that since we assumed µ = 0, we get that E [xp ] = 0 for
every odd p. For every even p, we can solve the recursive equation to get
p2
p σ2 p! p
E [x ] = p = (p − 1)!!σ
2 2 !
where the double factorial a!! means to multiply only the odd numbers from 1 to a.
and thus, since Σ gives the variance in each component and the covariance between components,
it is known as the Variance-Covariance Matrix.
since det(U ) = 1 (each eigenvector is of norm 1). Noting that the eigenvector matrix satisfies the
property U T = U −1 we can define a new coordinate system y = U x and substitute to get
Y n 2
1 1 T T T 1 1 T 1 y
− i
e− 2 x U U Σ U U x = p e− 2 y Λy =
−1
ρ(y) = p p e 2λi
(2π)n det Σ (2π)n det Λ i=1
(2πλi )
15
where λi is the ith eigenvalue. Notice that in the y-coordinate system, each of the components are
uncorrelated. Because this is a Gaussian distribution, this implies that each component of y is a
Gaussian random variance yi ∼ N (0, λi ).
∂p(x, t) 1 ∂ 2 p(x, t)
=
∂t 2 ∂x2
p(x, 0) = ψ(x)
Suppose we are in n dimensional space and matrices Q = QT are both positive and finite, and that
Qi,j are the entries of Q. Thus look at the PDE
∂p(x, t) 1 ∂ ∂
= Σni,j=1 qi,j p(x, t)
∂t 2 ∂xi ∂xj
1
= ∇p(t, x)T Q∇p(x, t)
2
The solution to this PDE is a multivariate Gaussian distribution
1
p(x, t) = p exp(−xT (2Qt)−1 x)
det(Q)(2πt)n
where the covariance matrix, Σ = tQ scales linearly with t. This shows a strong connection between
the heat equation (diffusion) and the Gaussian distribution, a relation we will expand upon later.
2.7 Independence
Two events are defined to be independent if the measure of doing both events is their product.
Mathematically we say that two events P1 and P2 are independent if
16
This definition can be generalized to random variables. By definition X and Y are independent
random variables if ρ(x, y) = ρx (x)ρy (y) which is the same as saying that the joint probability
density is equal to the product of the marginal probability densities
Z ∞
ρx (x) = ρ(x, y)dy,
−∞
µ(P1 ∩ P2 )
µ(P1 |P2 ) =
µ(P2 )
An important theorem here is Bayes’s rule. Bayes’ rule can also be referred to as “the flip-flop
theorem”. Let’s say we know the probability of P1 given P2 . Bayes’ theorem let’s us flip this
around and calculate the probability of P2 given P1 (note: these are not necessarily equal! The
probability of having an umbrella given that it’s raining is not the same as the probability of raining
given that you have an umbrella!). Mathematically, Bayes’ rule is the equation
µ(P1 |P2 )µ(P2 )
µ(P2 |P1 ) =
µ(P1 )
Bayes’s rule also works for the probability density functions. For the probability density functions
of random variables X and Y we get that
p(y|x)p(x)
p(x|y) = .
p(y)
17
or
p(x)|dx| = ρ(y)|dy|
and thus
dy ρ(y)
ρ(x) = ρ(y)| |= ′
dx |φ (y)|
where y = φ−1 (x).
since computationally this uses addition of the probabilities instead of the product (which in some
ways can be easier to manipulate and calculate). It can be proven that an equivalent estimator for
θ is
18
3 Introduction to Stochastic Processes: Jump Processes
In this chapter we will develop the ideas of stochastic processes without “high-powered mathemat-
ics” (measure theory). We develop the ideas of Markov processes in order to intuitively develop
the ideas of a jump process where the probability of jumps are Poisson distributed. We then use
this theory of Poisson jump processes in order to define the Brownian motion and prove its prop-
erties. Thus by the end of this chapter we will have intuitively defined the SDE and elaborated its
properties.
• Independent increment: Give a time interval τ form time point t, the probability of k things
happen in this interval does not depend on the time before:
• In a Poisson process, the probability k events happened before time t satisfies the Poisson
distribution: N (t) ∼ P oisson(λt),
(λt)k −λt
Pr(N (t) = k) = e
k!
That is to say: one can make predictions for the future of the process based solely on its present
state just as well as one could knowing the process’s full history. For example, if weather is a Markov
process, then the only thing that I need to know to predict the weather tomorrow is the weather
today. Note that the Poisson counter is a Markov process. This is trivial given the independent
increment part of the definition.
19
3.4 Time Evolution of Poisson Counter
To solve for the time evolution of the Poisson counter, we will instead of looking at a single trajectory
look at an ensemble of trajectories. Think of the ensemble of trajectories as an “amount” or
concentration of probability fluid that is flowing from one state to another. For a Poisson counter,
the flow is the average rate λ. Thus look at state i is the particles that have jumped i times. The
flow out of state i is λ times the amount of probability in state i, or λpi (t). The flow into the state
is simply the flow out of i − 1, or λpi−1 (t). Thus the change in the amount of probability at state
i is given by the differential equation
dpi (t)
= −λpi (t) + λpi−1 (t).
dt
or
pi (t + ∆t) − pi (t) = λ∆tpi−1 (t) − λ∆tpi (t).
To Solve this, define p(t) as the infinite vector (not necessarily a vector because it is countably
infinite but the properties we use here of vectors hold for a rigged basis) where of pi (t). Thus we
note that
ṗ(t) = Ap(t)
where
−λ
λ −λ
A=
.. .
λ .
..
.
To solve this, we just need to solve the cascade of equations. Notice that
p1 (t) = λte−λt .
20
To see this is the general solution, simply plug it in to see that it satisfies the differential equation.
Because
ṗ(t) = Ap(t),
is a linear system of differential equations, there is a unique solution. Thus our solution is the
unique solution.
ṗ(t)
= −2λpi (t) + λpi−1 (t) + λpi+1 (t).
dt
We assume that all of the probability starts at 0: pi (0) = δi0 . Notice that this can be written as
the system
.. .. ..
. . .
ṗ(t) = λ −2λ λ p(t) = Ap(t)
.. .. ..
. . .
where A is a tridiagonal and infinite in both directions. To solve for this, we use the probability
generating function. Define the probability generating function as
X∞ h i
g(t, z) = z i pi (t) = E z x(t)
i=−∞
where the summation is the Laurent series, which is the sum from 0 to infinity added with the sum
from -1 to negative infinity. Thus we use calculus and algebra to get
∞
X
∂g
= z i ṗi (t)
∂t
i=−∞
X∞
= z i [λpi−1 (t) + λpi+1 − 2λpi (t)]
i=−∞
X∞ ∞
X ∞
X
i i
= λ z pi−1 (t) + λ z pi+1 (t) − 2λ z i pi (t).
i=−∞ i=−∞ i=−∞
Notice that since the sum is infinite in both directions, we can trivially change the index and adjust
the amount of z appropriately, that is
∞
X ∞
X
z i pi−1 (t) = z z i pi (t) = zg(t, z)
i=−∞ i=−∞
21
∞
X ∞
i 1 X i 1
z pi+1 (t) = z pi (t) = g(t, z)
z z
i=−∞ i=−∞
and thus
∂g λ
= λz + − 2λ g(t, z).
∂t z
This is a simple linear differential equation which is solved as
−1 −2)t
g(t, z) = eλ(z+z .
g(k) (t, k)
pi (t) = |z=0 .
k!
Thus we can show by induction using this formula with our closed form solution of the probability
generating function that
∞
X (2λt)2m
pn (t) = e−2λt 2m m!(n + m)!
= e−2λt In (2λt)
m=0
2
This definition means that the probability of transition to another state simply depends on where
you are right now. This can be represented as a graph where your current state is the node i. The
probability of transition from state i to state j is simply Pij , the one-step transition probability.
Define
P1 (t)
−−→
P (t) = P (t) = ...
Pn (t)
22
as the vector of state probabilities. The way to interpret this is as though you were running many
simulations, then the ith component of the vector is the percent of the simulations that are currently
at state i. Since the transition probabilities only depend on the current state, we can write the
iteration equation
P (t + 1) = AP (t)
where
P11 P12 ··· P1n
P21 P22 ··· P2n
A= . .. .. .. .
.. . . .
Pn1 Pn2 · · · Pnn
Notice that this is simply a concise way to write the idea that at every timestep, Pi,j percent of
the simulations at state i transfer to state j. Notice then that the probabilities of transitioning at
any given time will add up to 1 (the probability of not moving is simply Pi,i ).
Definition: A Regular Markov Chain as a Markov Chain for which some power n of its
transition matrix A has only positive entries.
P ′ (t) = P (t)Q
where Q is a transition rate matrix which used to describe the flow of probability juices from state
i to the state j. Notice that
P ′ (t) = P (t)Q
P ′ (t + h) − P ′ (t)
= P ′ (t)Q
h
P (t + h) = (I + Qh)P (t)
P (t + h) = AP (t)
and thus we can think of a continuous-time Markov chain as a discrete-time Markov chain with
infinitely small timesteps and a transition matrix
A = I + Qh
Note that the transition rate matrix Q satisfies the following properties:
23
1. Transition flow between state i and j qij > 0 when i 6= j;
Property 1 is stating that the transition rate matrix is composed only “rates from i” and thus they
are all positive values. Property 2 is restating the relation between Q and A. Property 3 is stating
that the diagonal of Q is composed of the flows into the state i, and thus it will be a negative
number. One last property to note is that since
P ′ (t) = P (t)Q
we get that
P (t) = P (0)eQt
where eQt is the matrix exponential (defined by its Taylor Series expansion being the same as the
normal exponential). This means that there exists a unique solution to the time evolution of the
probability densities. This is an interesting fact to note: even though any given trajectory evolves
randomly, the way a large set of trajectories evolve together behaves deterministically.
0 = QP (t) = λ
since 0 is just a constant. Thus the vector P that satisfies this property is an eigenvector of Q. This
means that the eigenvector of Q corresponding to the eigenvalue of 0 gives the long run probabilities
of being in state i respectively.
24
3.8 The Differential Poisson Counting Process and the Stochastic Integral
We can intuitively define the differential Poisson Counting Process is the process that describes the
changes of a Poisson CounterNt (or equivalently N (t)). Define the differential stochastic process
dNt by its some integral Z t
Nt = dNt .
0
To understand dNt , let’s investigate some of its properties. Since Nt ∼ P oisson(λt), we know that
Z t
E [Nt ] = λt = E dNt .
0
Since E is defined as some kind of a summation or an integral, we assume that dNt2 is bounded
which, at least in the normal calculus, lets us swap the ordering of integrations. Thus
Z t
λt = E [dNt ] .
0
and thus
E [dNt ] = λ.
Notice this is simply saying that dNt represents the “flow of probability” that is on average λ.
Using a similar argument we also note that the variance of dNt = λ. Thus we can think of the
equation the term dNt as a kind of a limit, where
is probability of jumping k times in the increment dt which is given by the probability distribution
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Then how do we make sense of the integral? Well, think about writing the integral as a Riemann
sum:
Z t n−1
X
dNt = lim (N (ti+1 ) − N (ti ))
0 ∆t→0
i=0
where ti = i∆t. One way to understand how this is done is algorithmically/computationally. Let’s
say you wanted to calculate one “stochastic trajectory” (one instantiation) of N (t). What we can
25
do is pick a time interval dt. We can get N (t) by, at each time step dt, sample a value from the
probability distribution
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
and repeatedly add up the number of jumps such that N (t) is the total number of jumps at time
t. This will form our basis for defining and understanding stochastic differential equations.
where f is some arbitrary function describing deterministic changes in time where g defines the
“jumping” properties. The way to interpret this is as a time evolution equation for X. As we
increment in time by ∆t, we add f (X(t), t) to X. If we jump in that interval, we also add g(X(t), t).
The probability of jumping in the interval is given by
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Another way of thinking about this is to assume that the first jump happens at a time t1 . Then
X(t) evolves deterministically until t1 where it jumps by g(X(t), t), that is
!
lim X(t) = g lim X(t), t + lim X(t).
t→t+
1 t→t−
1 t→t−
1
Notice that we calculate the jump using the left-sided limit. This is known as Ito’s calculus and
it is interpreted as the jump process “not knowing” any information about the future, and thus it
jumps using only previous information.
We describe the solution to the SDE once again using an integral, this time we write it as
Z t Z t
X(t) = X(0) + f (X(t), t)dt + g(X(t), t)dN (t).
0 0
Notice that the first integral is simply a deterministic integral. The second one is a stochastic
integral. It can once again be understood as a Riemann summation, this time
Z t n−1
X
g(X(t), t)dNt = lim g(X(ti ), ti ) (N (ti+1 ) − N (ti )) .
0 ∆t→0
i=0
We can understand the stochastic part the same as before simply as the random amount of jumps
that happen in the interval (t, t + dt). However, now we multiply the number of jumps in the
interval by g(X(ti ), ti ), meaning “the jumps have changing amounts of power”.
26
3.10 Important Note: The Defining Feature of Ito’s Calculus
It is important to note that g is evaluated using the X and t before the jump. This is the defining
principle of the Ito Calculus and corresponds to the “Left-Hand Rule” for Riemann sums. Unlike
Newton’s calculus, the left-handed, right-handed, and midpoint summations do not converge to the
same value in the stochastic calculus. Thus all of these different ways of summing up intervals in
order to solve the integral are completely different calculi. Notably, the summation principle which
uses the midpoints
Z t X
n−1
g(X(ti+1 ), ti+1 ) + g(X(ti ), ti )
g(X(t), t)dNt = lim (N (ti+1 ) − N (ti )) .
0 ∆t→0 2
i=0
is known as the Stratonovich Calculus. You may ask, why choose Ito’s calculus? In some sense, it
is an arbitrary choice. However, it can be motivated theoretically by the fact that Ito’s Calculus is
the only stochastic calculus where the stochastic adder g does not “use information of the future”,
that is, the jump sizes do not adjust how far they will jump given the future information of knowing
where it will land. Thus, in some sense, Ito’s Calculus corresponds to the type of calculus we would
believe matches the real-world. Ultimately, because these give different answers, which calculus
best matches the real-world is an empirical question that could be investigated itself.
dXt = Xt dt
and thus X(t) = et before t1 . At t1 , we take this value and jump by Xt1 . Thus since immediately
before the jump we have a value et1 , immediately after the jump we have et1 + et1 = 2et1 . We once
again it begins to evolve as
dXt = Xt dt
but now with the initial condition X(t1 ) = 2et1 . We see that in this interval the linear equation
solves to X(t) = 2et . Now when we jump at t2 , we have the value 2et2 and jump by 2et2 to get 4et2
directly after the jump. Seeing the pattern, we get that
et , 0 ≤ t ≤ t1
2et , t1 ≤ t ≤ t2
X(t) = . .
.. ..
n t
2 e , tn ≤ t ≤ tn+1
27
3.12 Ito’s Rules for Poisson Jump Process
Given the SDE
n
X
dx(t) = f (x(t), t)dt + gi (x(t), t)dNi
i=1
where x ∈ R, Ni is a Poisson counter with rate λi , f : Rn → Rn , gi : Rn → Rn . Define Y = ψ(X, t)
as some random variable whose values are determined as a function of X and t. How do we find
dy(t), the time evolution of y? There are two parts: the determinsitic changes and the stochastic
jumps. The deterministic changes are found using Newton’s calculus. Notice using Newtonian
calculus that
∂ψ ∂t ∂ψ ∂x ∂ψ ∂ψ ∂ψ ∂ψ
∆Deterministic = + = + (dx)deterministic = + f (x).
∂t ∂t ∂x ∂t ∂t ∂x ∂t ∂x
The second part are the stochastic changes due to jumping. Notice that if the Poisson counter
process i jumps in the interval, the jump will change x from x to x + gi . This means that y will
change from ψ(x, t) to ψ(x+gi (x), t). Thus the change in y due to jumping is the difference between
the two times the number of jumps, calculated as
n
X
∆Jumps = [ψ(x + gi (x)) − ψ(x, t)]dNi ,
i=1
where dNi is the number of jumps in the interval. This approximation is not correct if some process
jumps multiple times in the interval, but if the interval is of size dt ([t, t + dt]), then the probability
that a Poisson process jumps twice goes to zero as dt → 0. Thus this approximation is correct for
infinitesimal changes. Putting these terms together we get
28
3.13 Dealing with Expectations of Poisson Counter SDEs
Take the SDE
m
X
dx = f (x, t)dt + gi (x, t)dNi
i=1
where Ni is a Poisson counter with rate λi . Notice that since Ni (t) ∼ P oisson(λt), we get
Also notice that because Ni (t) is a Poisson process, the probability Ni (t) will jump in interval
(t, t + h) is independent of X(σ) for any σ < t. This mean that, since the current change is
independent of the previous changes, E [g(x(t), t)dNi (t)] = E [g(x(t), t)] E [dNi ] = λi E [g(x(t), t)].
Thus we get the fact that
m
X
E [x(t + h) − x(t)] = E [f (x, s)h] + E [g(x(t), t)dNi (t)]
i=1
Xm
= E [f (x, t)] h + E [gi (x, t)] λi h
i=1
m
X
E [x(t + h) − x(t)]
= E [f (x, t)] + λi E [gi (x, t)]
h
i=1
29
and thus we get that the expectation changes as
dE x2
= −2E x2 + (2E [x] + 1) λ1 + (1 − 2E [x]) λ2 .
dt
To solve this, we would first need to complete solving the ODE for E [x], then plug that solution
into this equation to get another ODE, which we solve. However, notice that this has all been
changed into ODEs, something we know how to solve!
30
where the jump size is proportional to √1 . Then we have
λ
dE [xλ (t)] λ λ
= √ − √ =0
dt 2 λ 2 λ
We use Ito’s Rules with g = ±1 and the Binomial theorem to get
dE xpλ (t) 1 p 1 p
= E x+ √ − xp dN1 + x− √ − xp dN2
dt λ λ
p 1 p 1 p 1 p 1
p p−1 p−2
= E x + √ x + x + . . . − xp dN1 + xp − √ xp−1 + xp−2 + . . . − xp dN2
1 λ 2 λ 1 λ 2 λ
p 1 p−2 p 1 p−1 p 1 p−3
= E x + . . . (dN1 + dN2 ) + √ x + √ x + . . . (dN1 − dN2 )
2 λ 1 λ 3 λ3
p 1 p−2 λ λ
= E x +
2 λ 2 2
p
p−2
= E x
2
where we drop off all the higher order λ1 terms since the will go to zero as λ → ∞. This means
that in the limit we get
dE[xp (t)] p(p − 1)
= E[xp−2 (t)]
dt 2
Thus as λ → ∞ we can ignore higher order terms to get all of the odd moments as 0 and the even
moments as:
d 2
1. dt E[x (t)] =1
p! t
p
2. E[xp (t)] = p
! 2
2
2
Let σ 2 = t Then all moments match, so as λ → ∞ the random variable x∞ (t) will be Gaussian
with mean 0 and variance t. Thus we can think of x∞ (t) as a stochastic process whose probability
distribution starts as a squished Gaussian distribution which progressively flattens linearly with
time.
31
and thus Xm
dE [ψ] ∂ψ
=E f (x, t) + E [ψ(x + gi (x), t) − ψ(x, t)] λi .
dt ∂x
i=1
Recall that the definition of the expected value is
Z ∞
E [ψ(x)] = ρ(x, t)ψ(x)dx
−∞
and thus
Z ∞ Z ∞ m
X Z ∞
∂ρ ∂ψ
ψ(x)dx = f (x, t)ρ(x, t)dx + λi [ψ(x + gi (x), t) − ψ(x, t)]ρ(x, t)dx
−∞ ∂t −∞ ∂x −∞i=1
We next simplify the equation term by term using integration by parts. What we want to get
is every term having a ψ(x) term so we can group all the integrals. Thus take the first integral
Z ∞
∂ψ
f ρdx.
−∞ ∂x
32
Z Z m Z ∞ Z ∞
−∞
∂p −∞
∂(f ρ) X ρ(g̃i−1 (x), t)
ψdx = − ψdx + λi [ ψ(x) dx − ψpdx].
∞ ∂t ∞ ∂x
i=1 −∞ |1 + gi′ (g̃i−1 (x)| −∞
Since ψ(x) we arbitrary, let ψ be the indicator for the arbitrary set A, that is
(
1, if x ∈ A
ψ(x) = IA (x) =
0 o.w.
for any A ⊂ R. Thus, in order for this to be satisfied for all subsets of the real numbers, the
integrand must be identically zero. This means
m
∂ρ ∂(f ρ) X ρ(g̃i−1 (x), t)
+ − λi −ρ =0
∂t ∂x |1 + gi′ (g̃i−1 (x)|
i=1
which we arrange as
m
∂p ∂(f p) X ρ(g̃i−1 (x), t)
=− + λi −ρ .
∂t ∂x
i=1
|1 + gi′ (g̃i−1 (x)|
This equation describes the time evolution of the probability density function ρ(x, t) via a deter-
ministic PDE.
33
4 Introduction to Stochastic Processes: Brownian Motion
In this chapter we will use the properties of Poisson Counting Processes in order to define Brow-
nian Motion and derive a calculus for dealing with stochastic differential equations written with
differential Wiener/Brownian terms.
where ti = i∆t Notice once again that we are evaluating g using the Left-hand rule as this is the
defining feature of Ito’s Calculus: it does not use information of the future.
34
Recall that this definition is defined by the bidirectional Poisson Counter. Can we then un-
derstand the interval as a number of jumps? Since since the rate of jumps is infinitely high, we
can think of this process as making infinitely many infinitely small jumps in every interval of time.
Thus we cannot understand the interval as the “number of jumps” because infinitely many will
occur! However, given the proof from 3.13.3, we get that W (t) ∼ N (0, t). Thus we can think of
(dWi+1 − dWi ) ∼ N (0, dt), that is, the size of the increment is normally distributed. Algorithmi-
cally solving the integral by taking a normally distributed random number with variance dt and
multiplying it by g to get the value of g(X, t)dWt over the next interval of time. Using Ito’s Rules
for Wiener Processes (which we will derive shortly) we can easily prove that
h i
1. E (W (t) − W (s))2 = t − s for t > s.
Notice that this means E[(W (t + ∆t) − W (t))2 ] = ∆t and thus in the limit as ∆t → 0, then
E[(W (t + ∆t) − W (t))2 ] → 0 and thus the Wiener process is continuous almost surely (with prob-
ability 1). However, it can be proven that it is not differentiable with probability 1! Thus dWt is
some kind of abuse of notation because the derivative of Wt does not really exist. However, we can
still use it to understand the solution to an arbitrary SDE
as the integral Z Z
t t
Xt = X0 + f (Xt , t)dt + g(Xt , t)dWt .
0 0
We now define Yt = ψ(x, t). Using Ito’s Rules for Poisson Jump Processes, we get that
Xn Xn
∂ψ ∂ψ(x, t) gi (x) gi (x)
dYt = dψ(x, t) = dt + f (x, t)dt + ψ x+ √ − ψ(x, t) dNi + ψ x− √ − ψ(x, t) dN−i .
∂t ∂x i=1 λ i=1 λ
35
To simplify this, we expand ψ by λ to get
gi (x) gi (x) g 2 (x) 3
ψ x+ √ = ψ(x) + ψ ′ (x) √ + ψ ′′ (x) i + O(λ− 2 )
λ λ λ
gi (x) gi (x) g 2 (x) 3
ψ x− √ = ψ(x) − ψ ′ (x) √ + ψ ′′ (x) i + O(λ− 2 )
λ λ λ
and thus
Xn Xn
∂ψ ∂ψ(x, t) gi (x) 1
dψ(x) = dt + f (x, t)dt + ψ′ (x) √ (dNi − dN−i ) ψ′′ (x)gi2 (x) (dNi + dN−i ) .
∂t ∂x i=1 λ i=1
λ
Let us take a second to justify dropping off the higher order terms. Recall that the Wiener
3
process looks at the limit as λ → ∞. In the expansion, the terms dropped off are O(λ− 2 ). Recall
that
λ
E [dNi ] = .
2
Thus we see that the expected contribution of these higher order terms is
h 3
i 1
lim E O(λ− 2 )dNi = lim O(λ− 2 ) = 0.
λ→∞ λ→∞
λ
V [dNi ] = E [dNi ] = .
2
Thus we get that h i
3 1
lim V O(λ− 2 )dNi = lim O(λ− 2 ) = 0.
λ→∞ λ→∞
Therefore, since the variance goes to zero, the contribution of these terms are not stochastic. Thus
the higher order terms deterministically make zero contribution (in more rigorous terms, they make
no contribution to the change with probability 1). Therefore, although this at first glace looked
like an approximation, we were actually justified in only taking the first two terms of the Taylor
series expansion.
Sub-Calculation
To simplify this further, we need to do a small calculation. Define
1
dzi = (dNi + dN−i ) .
λ
36
Notice that
dE [zi ] 1 λ λ
= + =1
dt λ 2 2
which means
E [zi ] = t.
Using Ito’s rules
!
1 2
dzi2 = zi + 2
− zi (dNi + dN−i )
λ
2zi 1
= + 2 (dNi + dN−i )
λ λ
and thus
dE zi2 2E [zi ] 1 λ λ
= + 2 +
dt λ λ 2 2
2t 1
= + 2 λ
λ λ
1
= 2t +
λ
to make
t
E zi2 = t2 + .
λ
This means that
t t
V [zi ] = E zi2 − E [zi ]2 = t2 + − t2 = .
λ λ
Thus look at
Z = lim zi .
λ→∞
37
Solution for Ito’s Rules
Return to
n n
g 2 (x)
X gi (x) X gi (x) 1
dψ(x) = ψ ′ (x)(f (X, t)dt+ ψ ′ (x) √ (dNi − dN−i )+ ψ ′′ (x)gi2 (x) −ψ ′ (x) √ + ψ ′′ (x) i (dNi + dN−i ) .
i=1 λ i=1 λ λ λ
Notice that the last term is simply zi . Thus we take the limit as λ → ∞ to get that
n
X n
X
dψ(x) = ψ ′ (x)(f (X, t)dt + ψ ′ (x)gi (x)dWt + ψ ′′ (x)gi2 (x)dt
i=1 i=1
where hx, yi is the dot product between x and y and ∇2 is the Hessian.
If we let
dt × dt = 0
dWi × dt = 0
dt : i = j
dWi × dWj =
0 : i 6= j
then this simplifies to
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1
38
which is once again Ito’s Rules. Thus we can think of Ito’s Rules is saying that dt2 is sufficiently
small, dt and dWi are uncorrelated, and dWi2 = dt which means that the differential Wiener process
squared is a deterministic process. In fact, we can formalize this idea as the defining property of
Brownian motion. This is captured in Levy Theorem which will be stated in 6.7.
where Wi (t) is a standard Brownian motion. We have showed that Ito’s Rules could be interpreted
as:
dt × dt = 0
dWi × dt = 0
dt : i = j
dWi × dWj =
0 : i 6= j
and thus if y = ψ(x, t), Ito’s rules can be written as
∂ψ ∂ψ 1 ∂2ψ
dy = dt + dx + (dx)2
∂t ∂x 2 ∂x2
where, if we plug in dx, we get
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dy = dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1
Note that we can also generalize Ito’s lemma to the multidimensional X ∈ Rn case:
m
X m
∂ψ ∂ψ 1X
dψ(X) = , f (X) dt + , gi (X) dWi + gi (X)T ∇2 ψ(X)gi (X)dt
∂X ∂X 2
i=1 i=1
There are many other facts that we will state but not prove. These are proven using Ito’s Rules.
They are as follows:
39
1. Product Rule: d(Xt Yt ) = Xt dY + Yt dX + dXdY .
Rt Rt Rt
2. Integration By Parts: 0 Xt dYt = Xt Yt − X0 Y0 − 0 Yt dXt − 0 dXt dYt .
h i
3. E (W (t) − W (s))2 = t − s for t > s.
5. Independent Increments: E [(Wti − Ws1 ) (Wt2 − Ws2 )] = 0 if [t1 , s1 ] does not overlap [t2 , s2 ].
hR i
t
6. E 0 h(t)dWt = E [h(t)dWt ] = 0.
hR i hR i
T T
7. Ito Isometry: E 0 Xt dWt =E 0 Xt2 dt
dx = αxdt + σxdW.
To solve this, we start with our intuition from Newton’s Calculus that this may be an exponential
growth process. Thus we check Ito’s equation on the logarithm ψ(x, t) = ln(x) for this process is,
1 1 2 2 1 1
d (ln x) = 0 + (αx) − σ x 2
dt + (σx) dW
x 2 x x
1 2
d (ln x) = α − σ dt + σdW.
2
Notice that since the Wiener process W (t) ∼ N (0, t), the log of x is distributed normally as
N ( α − 21 σ 2 t, σ 2 t). Thus x(t) is distributed as what is known as the log-normal distribution.
40
4.6 Kolmogorov Forward Equation Derivation
The Kolmogorov Forward Equation, also known to Physicists as the Fokker-Planck Equation, is
important because it describes the time evolution of the probability density function. Whereas the
stochastic differential equation describes how one trajectory of the stochastic processes evolves, the
Kolmogorov Forward Equation describes how, if you were to be running many different simulations
of the trajectory, the percent of trajectories that are around a given value evolves with time.
We will derive this for the arbitrary drift process:
m
X
dx = f (x)dt + gi (x)dWi .
i=1
Take the expectation of both sides. Because expectation is a linear operator, we can move it inside
the derivative operator to get
"m #
d ∂ψ 1 X ∂2ψ
E [ψ(x)] = E f (x) + E g2 (x) .
dt ∂x 2 ∂x2 i
i=1
Notice that E [dWi ] = 0 and thus the differential Wiener terms dropped out.
Recall the definition of expected value is
Z ∞
E[ψ(x)] = ρ(x, t)ψ(x)dx
−∞
where ρ(x, t) is the probability density of equaling x at a time t. Thus we get that the first term as
Z ∞
d ∂ρ
E [ψ(x)] = ψ(x)dx
dt −∞ ∂t
41
In order for the probability distribution to be bounded (which it must be: hit must integrate
i to 1),
∂ψ
ρ must vanish at both infinities. Thus, assuming bounded expectation, E ∂x f (x) < ∞, we get
that
Z ∞
∂ψ ∂(ρf )
E f (x) = − ψ(x)dx.
∂x −∞ ∂x
The next term we manipulate similarly,
Z ∞
∂ψ 2 2 ∂2ψ 2
E g (x) = g (x)ρ(x, t)dx
∂x2 i −∞ ∂x
2 i
∞ Z ∞
2 ∂ψ ∂ gi2 (x)ρ(x, t) ∂ψ
= ρg − dx
∂x −∞ −∞ ∂x ∂x
∞ " #∞ Z
2 ∂ψ ∂ ρg2 1 ∞ ∂ 2 gi2 (x)ρ(x, t)
= ρg − ψ + ψ(x)dx
∂x −∞ ∂x 2 −∞ ∂x2
−∞
Z ∞ 2 2
1 ∂ gi (x)ρ(x, t)
= 2 ψ(x)dx
−∞ ∂x2
where we note that, at the edges, the derivative of ρ converges to zero since ρ converges to 0 and
thus the constant terms vanish. Thus we get that
Z ∞ Z ∞ Z
∂ρ ∂(ρf ) 1 X ∞ ∂(gi2 ρ)
ψ(x)dx = − ψ(x)dx + ψ(x)dx
−∞ ∂t −∞ ∂x 2 −∞i
∂x
Since ψ(x) is arbitrary, let ψ(x) = IA (x), the indicator function for the set A:
(
1, x ∈ A
IA (x) =
0 o.w.
42
which we re-arrange as
m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + 2
gi (x)ρ(x, t) ,
∂t ∂x 2 ∂x
i=1
dx = −xdt + dWt ,
where Wt is Brownian motion. The Forward Kolmogorov Equation for this SDE is thus
∂ρ ∂ 1 ∂2
= (xρ) + ρ(x, t)
∂t ∂x 2 ∂x2
Assume that the initial conditions follow the distribution u to give
ρ(x, 0) = u(x)
and the boundary conditions are absorbing at infinity. To solve this PDE, let y = xet and apply
Ito’s lemma
dy = xet dt + et dx = et dW
and notice this follows has the Forward Kolmogorov Equation
∂ρ e2t ∂ 2 ρ
= .
∂t 2 ∂y 2
which is a simple form of the Heat Equation. If ρ(x, 0) = δ(x), the Dirac-δ function, then we know
2t
this solves as a Gaussian with diffusion constant e2 to give us
1 y2
ρ(y, t) = √ e− 2e2t t .
2πe2t t
and thus y ∼ N (0, e2t t). To get the probability density function in terms of x, we would simply do
the pdf transformation as described in 2.9.
dx = axdt + dWt .
43
Notice that
dE [x]
= aE [x]
dt
and thus
E [x] = E [x0 ] eat
which converges iff a < 0. Also notice that
dx2 = 2xdx + dt
= 2x(axdt + dWt ) + dt
= (2ax2 + 1)dt + dWt
and thus
dE x2
= 2aE x2 + 1
dt
which gives
2 2 1
E x = E x0 + e2at
2a
which converges iff a < 0. This isn’t a full proof, but it motivates the idea that if the deterministic
coefficient is less than 0, then, just as in the deterministic case, the system converges and is thus
stable.
44
can be found at the steady state using the formula
and thus
x(t + ∆t) = x(t) + f (x, t)∆t
defines a recursive solution for the value of x at a time t given values of x at previous times. All
that is left for the approximation is some initial condition needs to be started, such as x(0) = y
and thus problem is solved iteratively.
Now we take the stochastic differential equation
45
Once again, we define some small fixed constant ∆t and write
46
We can think of a different type of convergence, known as the weak convergence, as
Notice how different these ideas of convergence are. Weak convergence means that the average
trajectory we computationally simulate does ∆tβ good, where as strong convergence means that
every trajectory does ∆tγ good. Thus if we are looking at properties defined by ensembles of
trajectories, then weak convergence is what we are looking at. However, if we want to know how
good a single trajectory is, we have to look at the strong convergence.
Here comes the kicker. For the Euler-Maruyama method, it has a strong convergence of order
1
2 and a weak convergence of order 1. That’s to say it has a slower convergence than the Euler
method, and the average properties only converge as fast as the Euler method! Thus, in practice,
this method is not very practical given its extremely slow convergence.
Notice that we have added a − 21 g(X, t)gx (X, t)dt term in order to cancel out the 12 g(X, t)gx (X, t) (dWt )2
term in the stochastic Taylor series expansion. Notice too that we only sample one random number,
but the second order term uses that random number squared. This method is known as Milstein’s
method. As you may have guessed, since it now accounts for all order 1 effects, it has order 1 strong
and weak convergence.
47
∆t
X(t + ∆t) = X(t) + f ∆t + g∆Wt + ggx (∆Wt )2 − ∆t
2
1 1 2
+ gfx ∆Ut + f fx + g fxx ∆t2
2 2
1
+ f gx + g2 gxx (∆Wt ∆t − ∆Ut )
2
1 1 2
+ g (ggx )x (∆Wt ) − ∆t ∆Wt
2 3
where √
∆Wt = ∆tηi
and Z t+∆t Z s
∆Ut = dWs ds
t t
can be written as
1
∆Ut = ∆t3 λi
3
where λi ∼ N (0, 1), a standard normal random variable (that is not ηi !).
48
with stages
s
X
(0) (0) (0) (0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t (5.2)
j=1
s
X
(0) (1) (1) I(1,0)
+ Bij g tn + cj ∆t, Hj ,
∆t
j=1
s
X
(1) (1) (0) (0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t (5.3)
j=1
s
X √
(1) (1) (1)
+ Bij g tn + cj ∆t, Hj ∆t
j=1
2. αT B (0) e = 1 T 2
10. β (3) B (1) e = −1
2 3
3. αT B (0) e = 2 T 2
11. β (4) B (1) e = 2
T
4. β (1) A(1) e = 1 T
12. β (1) B (1) B (1) e = 0
5. β (2)T A(1) e = 0
T
T 13. β (2) B (1) B (1) e = 0
6. β (3) A(1) e = −1
T
T
7. β (4) A(1) e = 0 14. β (3) B (1) B (1) e = 0
T 2 T
8. β (1) B (1) e = 1 15. β (4) B (1) B (1) e = 1
49
1 (1)T (1) (0) 1 (3)T (1) (0)
16. β A B e + β A B e =0
2 3
These methods are the (Rößler) SRI methods. We will refer to the algorithms by the tuple of
44 coefficients A0 , B0 , β (i) , α . Note that this method can be easily extended to multiple Ito
dimensions in the case of diagonal noise with similar results. We only focus on a single Ito dimension
for simplicity of notation (though our results will extend to higher Ito dimensions as well in the
trivial manner). To satisfy the conditions, Rößler proposed the following scheme known as SRIW1:
T T
αT β (1) β (2)
T T
β (3) β (4)
T T
α̃T β̃ (3) β̃ (4)
3 3 3
4 4 2
0 0 0 0 0
0 0 0 0 0 0 0
1 1 1
4 4 2
1 1 0 -1 0
1 1 1
4 0 0 4 -5 3 2
1 2 4 2 4
3 3 0 0 -1 3 3 0 -1 3 − 13 0
2 − 43 − 32 0 2 5
3 − 23 1
1 1
2 2 0 0 0 0 0 0 0 0 0 0
50
In the case where noise is additive, the methods can be vastly simplified to
Xs s
X
(0) (0) (1) (2) I(1,0) (1)
Un+1 = Un + αi f tn + ci ∆t, Hi ∆t + βi I(1) + βi g(tn + ci ∆t) (5.4)
∆t
i=1 i=1
with stages
s
X s
X I
(0) (0) (0) (0) (0) (1) (1,0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t + Bij g tn + cj ∆t (5.5)
∆t
j=1 j=1
The coefficients A0 , B0 , β (i) , α must satisfy the conditions for order 1:
T T
1. αT e = 1 2. β (1) e = 1 3. β (2) e = 0
where c(0) = A(0) e with f ∈ C 1,3 (I × Rd , Rd ) and g ∈ C 1 (I, Rd ). These are the (Rößler) SRA
methods. From these conditions he proposed the following Strong Order 1.5 scheme known as
SRA1:
T T
αT β (1) β (2)
3 3 1
4 4 2
1 2
3 3
1 0 -1 1
51
estimate, it was shown that a natural error estimator exists for any high-order SRK method. A
simplified version is simply:
E = |δED + EN | (5.6)
Xs
(0) (0)
≤ δ ∆t f tn + ci ∆t, Hi
i=1
s
X
(3) I(1,0) (4) I(1,1,1) (1) (1)
+ βi + βi g tn + ci ∆t, Hi
∆t ∆t
i=1
where s is the number of stages and δ is a user-chosen balance between determinsitic and noise
error in the error estimate. A similar summation gives the estimate for additive noise equations.
With the error estimate, the overall algorithm is depicted as:
Accept if q≥1
Compute a step of Estimate the error
Use E to
size h using two E using the difference
calculate
different integration between the methods
q (Eq 21)
methods (e.g. Eq 2) (Sections 2 and 3) Reject if q<1
Update h:=q*h Save the information for Use the Brownian Bridge
the change of W over to determine a value
the interval (t+qh,t+h) for W(t+qh) (Eq 22)
52
For the acceptance/rejectance of the step, care must be taken to not bias the Wiener process. If
extreme values of ∆W are always thrown out then the sample properties are no longer valid. Thus
we must always keep any calculated value of ∆W . The procedure has to be enhanced as follows.
First, propose a step with ∆W P and ∆Z P for a timestep h. If these are rejected, we wish to instead
attempt a step of size qh. Thus we need to sample a new value at W (tn + qh) using the known
values of W (tn ) and W (tn +h). To do so, we use the result that if W (0) = 0 and W (h) = L, then by
the properties of the Brownian Bridge we calculate that for q ∈ (0, 1), W (qh) ∼ N (qL, (1 − q)qh).
We then propose to step by qh and take the random numbers ∆W = W (qh) and ∆Z = Z(qh)
found via their appropriate distribution from the Brownian bridge. We then store the modified
versions of ∆W P and ∆Z p . Notice that since we have moved ∆W in the qh timestep, what remains
is ∆W = ∆W P − ∆W and ∆Z = ∆Z P − ∆Z as the change in the Brownian path from qh to h.
We then store the values L = 1 − qh, ∆W , ∆Z as a 3-tuple in a stack to represent that after our
current calculations, over the next interval of size L1 , the Brownian process W will change by L2
and the process Z will change by L3 . Thus when we finally get to tn + qh, we look at these values
to tell us how the Brownian path changes over the next 1 − qh time units. By doing so, we will
effectively keep the properties of the Brownian path while taking arbitrary steps. This leads to the
RSwM1 algorithm. More complex handling of the timestep, but using the same general idea, leads
to the more efficient RSwM3 algorithm. Note that these stepping routines are compatible with any
high order SDE method as long as some error estimator exists.
This can thus be a way to solve for the probability density using computational PDE solvers. If we
have an initial condition, X(0) = x0 , then this corresponds to the initial condition ρ(x, 0) = δ(x−x0 )
where δ is the Dirac-δ function. This can be particularly useful for first-passage time problems,
where we can set the boundary conditions as absorbing: ρ(a, t) = ρ(b, t) = 0, and thus ρ is the
total probability distribution of trajectories that have not been absorbed. Likewise, if we want
a boundary condition where we say “trajectories reflect off the point a”, then we simply use a
∂ρ
condition ∂x |x=a = 0.
However, though this method may be enticing to those who are experienced with computational
PDE solvers, this method is not the holy grail because it cannot simulate trajectories. If you need
53
the trajectory of a single instantiation of the process, for example, “what does the stock price over
time look like?”, you cannot use this method.
1. ∅ ∈ F,
2. A ∈ F ⇒ AC ∈ F,
S
3. A1 , A2 , . . . ∈ F ⇒ ∞n=1 An ∈ F.
Propositions:
1. Ω ∈ F.
T∞
2. A1 , A2 , . . . ∈ F ⇒ n=1 An ∈ F.
To prove proposition 1, notice that since ∅ ∈ F, by property 2 the compliment of the empty set,
Ω, must be in F. Proposition two follows by DeMorgan’s Law. By property 3 the union is in F,
and so the union of the compliments are in F by applying property 2 to each component. Thus
the compliment of the union of the complements is in F by property 2. By DeMorgan’s Law,
the intersection of the complement of the complements, or simply the intersection, must be in F
proving the proposition.
Definition: Given Ω and F, a probability measure P is a function F → [0, 1] such that:
1. P (Ω) = 1
S P∞
2. A1 , A2 , , . . . is a sequence of disjoint subsets in F ⇒ P ( ∞n=1 An ) = i=1 P (An ) (Countable
additivity)
54
Definition: (Ω, F, P ) is a probability space. S T∞
Proposition: A1 ⊂ A2 . . . , ⊂ Ai , . . . ∈ F, then P ( ∞n=1 An ) = limn→∞ P (An ) and P ( n=1 An ) =
limn→∞ P (An ) .
These propositions come directly from the properties of measures. See a measure theory text
for more information if you do not know and you’re Curious George. There are many possibilities
for what kind of sample space, Ω, we may want to consider:
1. Ω can be finite.
P [a, b] = b − a, 0 ≤ a ≤ b ≤ 1.
whereΩ = [0, 1] and F is the set of all closed intervals of Ω. We note here that the Borel σ-algebra,
B[0, 1], which is F, also contains all open intervals as constructed by
∞
[
1 1
(a, b) = a + ,b − .
n n
n=1
Thus the Borel sets on the real line are the closed and opened sets! Thus, all of the sets you would
ever use in practice are in the Borel set σ-algebra.
55
• F0 = {φ, Ω∞ }, P (φ) = 0, P (Ω∞ ) = 1
• F1 = {φ, Ω∞ , AH , AT } where
– AH = {w ∈ Ω∞ : w1 = H}
– AT = {w ∈ Ω∞ : w1 = T }
S S
• F1 = {φ, Ω∞ , AH , AT , AHH , AT T , AHT , AT H , AC C C C
HH , AT T , AHT , AT H , AHH AT T , . . . , AHT AT H }
where
– AHH = {w ∈ Ω∞ : w1 = H, w2 = H}
– AHT = {w ∈ Ω∞ : w1 = T, w2 = T }
..
.
We can see that F0 ⊂ F1 ⊂ F2 . . . and that the cardinality can grow very quickly. That is,
n
|Fn | = 22 . We can define !
∞
[
F∞ = σ Fi
i=1
Notice that
• Question 1: Is A ∈ F∞ ?
ω ∈ ∩∞ ∞ ∞ ∞ ∞ ∞
m=1 ∪N =1 ∩n=N An,m ⇒ A = ∩m=1 ∪N =1 ∩n=N An,m
Thus A ∈ F∞ .
Question 2: By Strong Law of Large Numbers, P (A) = 1.
56
6.2 Random Variables and Expectation
Definition: A random variable is a real valued function X : Ω → R ∪ {∞, −∞} with the property
that for Borel subsets B of R, {X ∈ B} = {ω ∈ ω : X(ω) ∈ B}.
Definition: A measure is a nonnegative countably additive set function; that is a function
µ : F → R with
8
H T
4 4
T H
2
T
1 Thus
• S1 (ω) = 8 if ω1 = H; 2 otherwise;
• ...
57
6.2.2 Example: Uniform Random Variable
A random variable X uniformly distributed on [0, 1] can be simulated based on the example of
infinite independent coin toss with p = 0.5. To do so, let Yn (ω) be the indicator that the nth coin
toss is a heads, that isYn (ω) = 1 if ωn = H and 0 otherwise. Thus we define the random variable
∞
X Yn (ω)
X= .
2n
n=1
Notice if we look at the base-2 decimal expansion of X, that would be the sequence of heads and
tails where a 1 in the ith digit means the ith toss was a heads. Thus we see that the range of X is
[0, 1] and every decimal expansion has equal probability of occurring, meaning that X is uniformly
distributed on [0, 1]. Thus we get that the probability of being between a and b is
µX (a, b) = b − a, 0 ≤ a ≤ b ≤ 1.
we know that almost all of the time x = 1 (since the rationals are a countable set while the
irrationals are an uncountable set (a larger infinity)). This is however not computable using the
Reimann integral. But, using the Lebesgue integral we get
Z 1
df (x) = 1µ(A)
0
where µ(A) means the measure (the length) of the set where f (x) = 1. Since there are only
countably many holes, µ(A) = µ([0, 1]) = 1. Thus
Z 1
df (x) = 1
0
58
which matches our intuition. This may sound scary at first, but if you’re not comfortable with these
integrals you can understand the following theory by just reminding yourself it’s simply another
way to integrate that works on weirder sets than the Reimannian integral.
R R + dP −
R − +
4. For any general X, we define Ω XdP =
R Ω X Ω X dP , where X = max{0, X} and
−
X = max{−X, 0}. X is integrable if Ω |X|dP < ∞.
Note that the definition of almost surely (a.s.) will be explained in soon.
59
Define
f (x) = lim fn (x).
n→∞
or equivalently
E(X) = lim E(Xn ).
n→∞
Theorem: Dominated Convergence Theorem. Take {Xn }∞ n=1 be a sequence of random vari-
ables converging almost surely (a.s.) to a random variable X. Assume that there exists a random
variable Y s.t. |Xn | ≤ Y a.s. for all n, and E(Y ) < ∞, then E(X) = limn→∞ E(Xn ).
Theorem: Fatou’s Lemma. Take {Xn }∞ n=1 be a sequence of random variables converging
almost surely (a.s.) to a random variable X. Thus E(X) ≤ lim inf n→∞ E(Xn ).
60
Definition: Let Ω 6= ∅. Let T be a fixed positive number. Assume that for all t ∈ [0, T ], there
exists a σ-algebra f (t). Assume that if s < t, f (s) ⊂ f (t). We define the collection of σ-algebras
FT = {f (t)}0≤t≤T as the filtration at time T . It is understood intuitively as the complete set of
information about the stochastic process up to time T .
Now we need to bring random variable definitions up to our measure-theoretic ideas.
Definition: Let X be a random variable in (Ω, f, p). The σ-algebra generated by X is the set
σ(X) = X −1 (A) : A ∈ f
6.3.2 Independence
Definition: Take two events A,B∈ f . The events A and B are independent if P (A ∩ B) =
P (A)P (B).
Definition: Let G, H ⊂ f be two sub-σ-algebras of f . G and H are independent if, for all
A ∈ G and B ∈ H, P (A ∩ B) = P (A)P (B).
Definition: Take the random variables X and Y of the probability space (Ω, f, p). X and Y
are independent if σ(X) and σ(Y )are independent.
These are straight-forward and obvious to make yourself feel better about all this and make it
feel easy.
1. E[X|G] is G-measurable.
61
Let us understand what this definition means. We can interpret E[X|G] to be the random variable
that is the “best guess” for the values of X given the information in G. Since the only information
that we have is the information in G, this means that E[X|G] is G-measurable. Our best guess for
the value of X is the expectation of X. Notice
R that in measure-theoretic probability, the expectation
of E[X|G] for the event A is defined as A E[X|G]dp. Thus the partial averaging property is simply
saying that we have adapted the random variable E[X|G] such that its expectation for every event
that has happened in G is the same as the expectation of X, which means that for any thing that
has happened, our best guess is simply what X was!
Definition: Take the random variables X and Y . We define E[X|Y ] = E[X|σ(Y )].
This gives us our link back to the traditional definition of conditional expectations using random
variables. Notice that this is how we formally define functions of random variables. Writing the
conditional expectation as E[Y |X = xi ] = yi , we can think of this as a mapping from xi to yi . Thus
we can measure-theoretically interpret f (Y ) = E[X|Y ].
(a) This can be generalized: For all A ⊂ R, if X ∈ A a.s., then E[X|G] ∈ A a.s.
5. Taking out what is known: If X is G-measurable, then E[XY |G] = XE[Y |G].Notice that
this is because if X is G-measurable, it is known given the information of G and thus can be
treated as a constant.
62
S0 u2 = p2
H
S0 u
H T
S0 ud = p(1 − p)
S0
S0 ud = p(1 − p)
T H
S0 d
T
S0 d2 = (1 − p)2
↑ ↓ ↑ ↓ ↑
S0 ω1 S1 ω2 S2
Let’s say we have the information given by the filtration G and we let X be our earnings from
the game after two coin-flips. We explore the properties in the following scenarios.
1. G = F0 . Recall that F0 = {Ω∞ , ∅} and thus it literally contains no information. Thus we have
that X is independent of the information in F0 . E[X|F0 ] =E[X]. We can calculate this using the
traditional measures:
X
E[X] = xi Pr(X = xi ) = S0 p2 u2 + 2p(1 − p)ud + (1 − p)2 d2 .
i
2. G = F1 . Recall that F1 is the information contained after the first event. Thus we know the
result of the first event, ω1 . Thus we can calculate what we would expect X to be given what we
know about the first event. Thus
(
S0 pu2 + (1 − p)ud , ω1 = H
E[X|F1 ] = .
S0 pud + (1 − p)d2 , ω1 = T
63
6.4 Martingales
Definition: Take the probability space (Ω, F, p) with the filtration {Ft }0≤t≤T , and let {Mt }0≤t≤T
is an adaptive stochastic process w.r.t. {Ft }. Mt is a martingale if E[Mt |Fs ] = Ms for all 0 ≤ s ≤ t.
If E[Mt |Fs ] ≥ Ms then we call Mt a sub-martingale, while if E[Mt |Fs ] ≤ Ms then we call Mt a
super-martingale.
These definitions can be interpreted as follows. If, knowing the value at time s, our best guess
for Mt is that it is the value Ms , then Mt is a martingale. If we assume that it will grow in
expectation in time, it’s a sub-martingale while if we assume that its value will shrink in time, it is
a super-martingale.
Recalling that Sn is a martingale if we would expect that our value in the future is the same as
now, then
1. If α = 1, then Sn is a martingale.
2. If α ≥ 1, then Sn is a sub-martingale.
3. If α ≤ 1, then Sn is a super-martingale.
since on average Bt − Bs = 0.
64
6.5 Martingale SDEs
Theorem: Take the SDE
dXt = a(Xt , t)dt + b(Xt , t)dWt .
If a(Xt , t) = 0, then Xt is a martingale.
Proof: If we take the expectation of the SDE, then we see that
dE[Xt ]
= a(Xt , t)
dt
and thus if a(Xt , t) = 0, the expectation does not change and thus Xt is a martingale.
∂f ∂f 1 ∂2f
dZt = dt + dWt + (dWt )2
∂t ∂Wt 2 ∂Wt2
1 1
= − θ 2 Zt dt − θZtdWt + θ 2 Zt dt
2 2
= −θZtdWt
since (dWt )2 = dt. Thus since there is no deterministic part, Zt is a martingale. Thus since Z0 = 1,
we get that
E[Zt |Fs ] = Zs ,
and
E[Zt |F0 ] = 1.
Here we will consider first-passage time problems. These are problems where we start in a set
and want to know at what time we he the boundary. Fix x > 0. Let τ = inf{t ≥ 0 : Xt = x}
which is simply the first time that Xt hits the point x.
65
6.6.1 Kolmogorov Solution to First-Passage Time
Notice that we can solve for the probability definition using the Forward Kolmogorov Equation
m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + 2
gi (x)ρ(x, t) .
∂t ∂x 2 ∂x
i=1
Given the problem, we will also have some initial condition ρ(x, 0) = ρ0 (x). If we are looking for
first passage to some point x0 , then we put an absorbing condition there ρ(x0 , t) = 0. Thus the
probability that you have not hit x0 by the time t is given by the total probability that has not
been absorbed by the time t, and thus
Z ∞
Pr(t ≥ τ ) = ρ(x, t)dx
−∞
This method will always work, though many times the PDE will not have an analytical solution.
However, this can always be solved using computational PDE solvers.
where M (t ∧ τ ) = min(t, τ ). This simply the process that we stop once it hits x.
Theorem: If Mt is a martingale, then Mt∧τ is a martingale.
The proof is straight-forward. Since the stopped martingale does not change after τ , then the
expectation definitely will not change after τ . Since it is a martingale before τ , the expectation
does not change before τ . Thus Mt∧τ is a martingale.
66
6.6.3 Reflection Principle
Another useful property is known as the Reflection Principle. Basically, if we look at any Brownian
motion at the time T , there is just as much of a probability of it going up and there is of it going
down. Thus the trajectory that is reflected after a time τ is just as probable as the non-reflected
path.
Wiener_process_reflection.png
(Picture taken from Oksendal Stochastic Differential Equations: An Introduction with Applica-
tions) We can formally write the Reflection Principle as
P (τ ≤ t) = P (τ ≤ t, Bt < x) + P (τ ≤ t, Bt ≥ x)
= 2P (Bt ≥ x)
Z ∞
1 − u2
= 2 √ e 2t du.
x 2πt
then we say x satisfies the Markov property and is thus a Markov process. Another way of stating
this is that
E [f (Xt )|Fs ] = E [f (Xt )|σ(Xs )] ,
67
or
P (xt ∈ A|Fs ) = P (xt ∈ A|xs ).
Intuitively, this property means that the total information for predicting the future of xt is com-
pletely encapsulated by the current state xt .
∂v ∂v 1 ∂2v
dv = dt + dx + (dx)2
∂t ∂x 2 ∂x2
∂v ∂v 1 ∂2v 2
= dt + (a(Xt )dt + b(Xt )dWt ) + b (x(t), t)dt
∂t
∂x 2 ∂x2
∂v ∂v 1 ∂2 v ∂v
= + a(Xt ) + b2 (x(t), t) 2 dt + b(Xt ) dWt .
∂t ∂x 2 ∂x ∂x
Since v is a martingale, the deterministic part must be zero. Thus we get the equation
∂v ∂v 1 ∂2v
+ a(Xt ) + b2 (x(t), t) 2 = 0.
∂t ∂x 2 ∂x
We can solve this using the initial condition v(x, T ) = h(x) to solve for v(x(t), t).
Define the transition density function as
which is the probability of transitioning from x to y between t0 and t. Notice that the probability
of transitioning from h(x) to h(y) from the time t to T is
Z ∞
t,x
v(x, t) = E [h(x(T ))] = ρ(t, T, x, y)h(y)dy.
−∞
68
Now suppose we know h(z) = δ(x − y), meaning that we know that the trajectory ends at y. Thus
Z ∞
v(x, t) = ρ(t, T, x, y)δ(x − y)dy
−∞
= ρ(t, T, x, y).
∂ρ ∂ρ 1 2 ∂2 ρ
+ a(x(t), t) + b (x(t), t) 2 = 0.
∂t ∂x 2 ∂x
with the terminal condition ρ(T, x) = δ(x − y) where y is the place the trajectory ends. Thus this
equation, the Kolmogorov Backward Equation, tells us that if we know the trajectory ends at y at
a time T , what is the probability that it was x and a time t < T . Notice that this is not the same
as the Kolmogorov Forward Equation. This is because diffusion forward does not equal diffusion
backwards. For example, say you place dye in water. It will diffuse to spread out uniformly around
the water. However, if we were to play that process in reverse it would look distinctly different.
Ẽ [x] = E [xz]
69
6.9.2 Simple Change of Measure Example
Take (Ω, F, p) and let X be a standard normal random variable under p. Define y = x + θ. Notice
that under p, y ∼ N (θ, 1). However, if we define
θ2
z(ω) = e−θX(ω)− 2
and define Z
p̃(A) = zdp,
A
in this reweighing of the probability space y is standard normal. This is because by definition the
pdf of y using the measure p̃ is
fe(y) = z(θ)N (x + θ, 1)
2 /2 2 /2 1
= eθx−θ e−(x+θ) √
2π
1 2
= √ e−x /2 ∼ N (0, 1)
2π
Zt = E[ζ|Ft ].
Notice that
1. Z(t) is a martingale.
70
6.9.4 Girsanov Theorem
Using the change of measure determined by the RNDP, we can always construct a Brownian motion
from an adapted stochastic process. This is known as Girsanov Theorem.
Theorem: Girsanov Theorem. Let Wt be a Brownian motion in (Ω, F, P ) and Ft be a
filtration with 0 ≤ t ≤ T . Let θ(t) be an adapted stochastic process on Ft . Define
Rt Rt
θ(u)dW (u)− 12 θ 2 (u)du
Z(t) = e− 0 0
Z t
W̃t = Wt + θ(u)du
0
dW̃t = dWt + θ(t)dt
1. E [Z] = 1
dZt = ψ(ln Zt )
1
= Zt d(ln Zt ) + Zt (d(ln Zt ))2
2
1 1
= Zt [−θ(t)dW − θ 2 (t)dt + θ 2 (t)dt]
2 2
= −Zt θ(t)dW
and thus since the deterministic changes are zero, Zt is a martingale. Notice that since Z(0) = 1,
E [Zt ] = Z0 = 1. Thus we have proven the first property.
In order to prove the second property, we employ the Levy Theorem from 6.7. By the theorem
we have that if W̃t is a martingale under p̃ and dW̃t × dW̃t = dt, then Wt is Brownian Motion under
p̃. Notice that
dW̃t × dW̃t = (dWt + θ(t)dt)2 = dWt2 = dt.
71
Thus in order to show W̃t is a Brownian motion under p̃, we simply need to show that W̃t is a
martingale. Since Zt is an RNDP, we use property 2 of RNDPs to see
h i
E W̃t Zt |Fs
Ẽ[W̃t |Fs ] = .
Zs
Thus we use Ito’s Rules to get
d W̃t Zt = W̃t dZt + Zt dW̃t + dW̃t dZ
= −W̃t Zt θ(t)dWt + Zt dWt + Zt θ(t)dt
= Zt Zt − W̃t θ(t) dWt
and thus, since the deterministic changes are zero, W̃t Zt is a martingale under p. Thus we use the
definition of a martingale to get
W̃s Zs
Ẽ[W̃t |Fs ] = = W̃s
Zs
and thus W̃s is a martingale under p̃. Therefore by Levy Theorem we get that W̃t is a Brownian
motion under p̃, completing the proof.
7 Applications of SDEs
Using the theory we have developed, we will now look into some applications of SDEs.
be the value of the option at the time T . Let v(t, S(t)) be the value of the European call option at
the time t. What we wish to do is find out how to evaluate v.
72
7.1.1 Solution Technique: Self-Financing Portfolio
Let X(t) be the value of our portfolio. Let we assume there are only two things: this stock and
the money market. By finding out the optimal amount of money we should be investing into the
stock we can uncover its intrinsic value and thus determine v. Note that if we put our money in
the money market, the it occurs interest at a rate r. Thus we can write the value of our portfolio
as
∂v ∂v 1 ∂2v
dv(t, S(t)) = dt + dS(t) + (dS)2
∂t ∂x 2 ∂x2
∂v ∂v ∂v 1 ∂2v 2 2
= dt + µS(t) dt + σ S(t)dW + σ S (t)dt
∂t
∂x 2
∂x 2 ∂x2
∂v ∂v 1∂ v 2 2 ∂v
= + µS(t) + σ S (t) dt + σ S(t)dW
∂t ∂x 2 ∂x2 ∂x
We now assume that there is no arbitrage. This can be rooted in what is known as the “Efficient
Market Hypothesis” which states that a free-market running with complete information and rational
individuals will operate with no “arbitrage” where arbitrage is the “ability to beat the system”.
For example, if one store is selling a certain candy bar for $1 and another store is buying it for
$2, there is an arbitrage here of $1 and you can make money! But, if these people had complete
information and were rational, we assume that they will have worked this out and no opportunity
like this will be available. In stock market terms, this means that the price of a good equates with
the value of the good. This means that we can assume that the value of the option at the time
t, v(X, t), will equate with its market price. Notice that, since the only way we could have made
money with our portfolio is that the value of the invested stock has increased in order to make our
option more valuable, the value of the option at time t is equal to the value of our portfolio. Since
price equates with value we get the condition
v(X, t) = X(t)
and thus
dv = dX.
73
Thus we solve for v by equating the coefficients between the dX and the dv equations. Notice that
the noise term gives us that
∂v
σ S(t) = σ∆(t)S(t)
∂x
∂v
= ∆(t)
∂x
where we interpret the differential value of v to be associated with the stock price at t
∂v ∂v
= .
∂x ∂x S(t)
This is known as the hedging. The matching the dt coefficients we receive the equation
∂v ∂v 1 ∂2v 2 2
rX(t) + (µ − r)∆(t)S(t) = + µS(t) + σ S
∂t ∂x 2 ∂x2
where we replace X(t) = v(X, t) to get
∂v ∂v 1 ∂2v 2 2
rv + (µ − r)∆(t)S(t) = + µS(t) + σ S
∂t ∂x2 2 ∂x2
∂v ∂v 1 2 2 ∂ v
rv − r x = + σ x
∂x ∂t 2 ∂x2
as a PDE for solving for v. Since we have no arbitrage, the value of the option, v, equals the price
of the option, g(T, S(T ), at the time T . This gives us the system
∂v ∂v 1 ∂2v
+ rx + σ 2 x2 2 = rv linear parabolic PDE
∂t
v(T, ∂x 2 ∂x
x) = [x − k]+
whose solution gives us the evaluation of the option at the time t. This is known as the Black-
Scholes-Merdin Equation. This equation can be solved using a change of variables, though we will
use another method.
74
and thus by definition v is a martingale. Thus we use Ito’s Rules on e−rt v to get
2
−rt −rt ∂v −rt ∂v −rt 1 ∂ v 2
d(e v(t, S(t))) = e −rv + ]dt + e dS + e (dS)
∂t ∂x 2 ∂x2
and use the martingale property to equate the deterministic changes with zero to once again uncover
the Black-Scholes equation.
Thus, since we know that it’s equivalent to say µ = r and evaluate the equation
we note that
v(t, S(t)) = e−r(T −t) E t,S(t) [[S(t) − k]+ |Ft ].
Recall from the SDE that
dS(t) = µS(t)dt + σS(t)dW
which is Geometric Brownian Motion whose solution is
σ2
)(T −t)+σ(W (T )−W (t)
S(T ) = S(t)e(r− 2 .
Thus, noting that W (T ) − W (t) = ∆W ∼ N (0, T − t), we use the definition of expected value to
get
Z ∞ 2
σ2
)(T −t)+σu 1 − u
v(t, S(t)) = [S(t)e(r− 2 − k]+ p
e 2(T −t) du
−∞ 2π(T − t)
Z ∞ 2
1 2
(r− σ2 )(T −t)+σu − u
= p S(t)e − k e 2(T −t) du
2π(T − t) α
where
k σ2
ln − r− (T − t)
S(t) 2
α= .
σ
We can solve this to get
v(t, x) = xN (d+ ) − ke−rτ N (d− )
where τ = T − t and
1 x σ2
d± (τ, x) = √ ln + r ± τ
σ τ k 2
and Z y
1 u2
N (y) , √ e− 2 du = −erf c(y).
2π −∞
75
7.1.3 Justification of µ = r via Girsanov Theorem
Take the generalized stock price SDE
and let Rt
r(s)ds
D(t) = e 0 .
Thus we use Ito’s Rules to get
where
µ(t) − r(t)
θ(t) = .
σ(t)
Thus we define dW̃t = dWt + θ(t)dt to get
d (D(t)S(t)) = σ(t)D(t)dW̃t .
Therefore if we define Rt Rt
θ(u)dW (u)− 12 θ 2 (u)du
Zt = e− 0 0 ,
and Z
p̃(A) = Zt dp,
A
then dW̃t is a Brownian motion under p̃. Therefore, since there is no deterministic change, D(t)S(t)
is a martingale under p̃. We note that D(t)X(t) is also a martingale under p̃. We call p̃ the Risk-
Neutral World. Note that in the Risk-Neutral World that, we can set the price of the option,
discounted by D(t), as the expectation conditioned on the totality of information that we have.
Thus for V (S(T ), T ) as the payoff of a derivative for a security for stock S(T ) at time T , we get
This is an equivalent expression as the conditional expectation from before, saying that we can let
µ = r because this is simply a change of measure into the Risk-Neutral World.
76
7.2.1 Definitions from Biology
First we will introduce some definitions from biology. A gene locus is simply a location on a
chromosome or a portion of a chromosome. It can represent a gene, a SNP (single base pair
polymorphism) or simply a location. An allele is one of a number of alternative forms of the locus.
A dominant allele is usually capitalized. For human and other diploid animals, there are typically
two alleles of paternal and maternal origin. A genotype is the genetic makeup of an individual.
In this case it will denote the types of alleles an individual individual carries. For example, if the
individual has one dominant allele and one recessive allele, its genotype is Aa. Having the same
pair of alleles is called homozygous while having different alleles is called heterozygous. A mutation
is a random genetic change. Here we refer to it as the random spontaneous change of one allele to
another. Fitness is the measure of survivability and ability to reproduce of an individual possessing
a certain genotype (this is wrong for many philosophical / scientific reasons, but we can take this
as a working definition since this is a major topic worth its own book) . Neutral evolution is the
case where all genotypes of interest have the same fitness. This implies that there is no selection
from such genetic variations and all change is inherently random.
4. Mating is completely random and is determined by random sampling with replacements. This
means any individual can randomly give rise to 0, 1, . . . many offsprings.
From this, the time evolution of the Wright-Fisher model is defined as:
2. At generation 1, each A or a allele from generation 0 may result in one, more or zero copies
of that allele.
77
7.2.3 Formalization of the Wright-Fisher Model
Let us characterize the Wright-Fisher model in probabilistic terms. Let Xn be the total number of
A alleles at generation n. Considering that there are a total of 2N alleles in any given generation,
we can define Pn = X 2N to be the percentage of alleles that are A. Let X0 be some known initial
n
condition. Because there are 2N independent alleles that could be passed and the probability of
generating an A allele at generating n is Pn−1 (because of sampling with replacement), we can derive
the distribution for Xn using indicator functions. Order the alleles in the population. Let Xn,i be
1 if the ith allele is an A and 0 if the ith allele is a 0. Since we are sampling with replacement, the
choice of allele i is independent of the choice of allele j for any i 6= j. Thus each Xn,i ∼ Bern(Pn−1 ),
that is each P is a Bernoulli random variable with probability of heads equal to Pn−1 . Therefore,
since Xn = i Xn,i and each Xn,i are independent, Xn is modeled by the result of 2N coin-flips
each with a probability Pn−1 of heads. Thus
Xi ∼ Bin(2N, Pi−1 ),
which makes
2N
Pr(Xk = k, 0 ≤ k ≤ 2N ) = k
Pn−1 (1 − Pn−1 )2N −k .
k
Thus we can show that
E [Xn |Xn−1 ] = 2N Pn−1 = Xn−1
which implies Xn is a martingale. Also
Notice that we can think of this process as some form of a random walk for Xn with probability of
going right as Pn−1 . One can rigorous show then that for the 1-dimension random walk of this sort
lim Pn = 0 or 1
n→∞
which means that the process will fix to one of the endpoints in a finite time. Since Xn is a
martingale, we note that
lim E [Xn ] = X0 ,
n→∞
which means
Pr(X∞ = 2N ) = P0 .
78
7.2.4 The Diffusion Generator
If we assume N is large, then it is reasonable to assume that the gene frequencies Pn will change in
a manner that is almost continuous and thus can be approximated by an SDE. Consider an SDE
of the form p
dXt = a(x)dt + b(x)dWt
with the initial condition X0 = x. Applying Ito’s rule gives us
df 1 d2 f
df (Xt ) = dXt + (dXt )2 ,
dX 2 dX 2
df df p 1 d2 f
= a(x)dt + b(x)dW + b(x)dt.
dX dX 2 dX 2
∂f ∂2f
Writing fX = ∂X and fXX = ∂X 2
, we get that
E [f (Xt )] 1
= fX a(x) + fXX b(x).
dt 2
Thus we define L to be the operator
E [f (Xt )] 1
Lf = = fX a(x) + fXX b(x).
dt 2
L is known as the diffusion generator. If we let f (x) = x then
dE [f (x)]
Lf = a(x) =
dt
where a(x) defines the infinitesimal mean changes. If we define f (y) = (y − x)2 for a fixed x, then
d h i
b(x) = E (Xt − x))2
dt
makes b(x) define the infinitesimal variance. Therefore, we can generally relate discrete stochastic
processes to continuous stochastic processes by defining the SDE with the proper a and b such
that the mean and variance match up. This can formally be shown to be the best continuous
approximation to a discrete Markov process using the Kolmogorov Forward Equation.
79
Notice too then that
V [Pn ] = Pn−1 (1 − Pn−1 ).
Because
E [Xn |Xn−1 ] = Xn−1
we get that
dE [Xn ]
= 0.
dt
This means that we define the SDE approximation of the Wright-Fisher model to be the one that
matches the mean and variances, that is
p
dXt = Xt (1 − Xt )dWt .
and thus
∆Xn = Xn − Xn−1 ≈ Xn−1 (1 − Xn−1 )s.
Therefore we make
dE [Xn ]
= sXn−1 (1 − Xn−1 ) .
dt
To make this continuous we note
dE [Pn ] s
= Pn−1 (1 − Pn−1 ) .
dt 2N
= γPn−1 (1 − Pn−1 )
s
where γ = 2N . Therefore we make this continuous by matching the diffusion operator as
80
d 1 d2
L = γx(1 − x)
+ x(1 − x) 2
dx 2 dx
and thus we approximate the Wright-Fisher model with selection as the SDE
p
dXt = γXt (1 − Xt ) dt + Xt (1 − Xt )dWt
when N is large.
d 1 d2
+ x(1 − x) 2 .
L = (γx(1 − x) + β1 (1 − x) + β2 x)
dx 2 dx
Therefore the best SDE approximation to the Wright-Fisher Model with selection and mutation is
p
dXt = (γx(1 − x) + β1 (1 − x) + β2 x) dt + Xt (1 − Xt )dWt .
E [h(t)|Fs ] = h(s)
Z 1
h(x) = Pr(X t = y|X0 = x)h(y)dy
0
Z 1 Z 1
δ(x − y)h(y)dy = Pr(X t = y|X0 = x)h(y)dy
0 0
Z 1
0 = [Pr(Xt = y|X0 = x) − δ(x − y)] h(y)dy.
0
81
Thus
Z 1
∂ρ
0 = h(y)dy
0 ∂t
Z 1
∂(aρ) 1 2 ∂ 2 ρ
0 = − + b h(y)dy
0 ∂x 2 ∂x2
Lh = 0
h(a) = 1
h(b) = 0
where
d 1 d2
L = a(x) + b(x) .
dx 2 dx
dh
To solve for the probability density function of the first-passage time, define ψ = dx and thus
ψ ′ = − a(x)
b(x) . Therefore we get
Rxa(y)
− 0 b(y) dy
ψ(x) = ψ(0)e
and Z x
h(x) = ψ(y)dy.
0
This is expanded in the book by Kimora and Crow.
∂ 1 ∂2 2
0=− [a(x)π(x, t)] + 2
b (x)π(x, t) .
∂x 2 ∂x
For the Wright-Fisher model with selection and mutation, we note that this can be solved so that
82
7.3 Stochastic Control
The focus of this lecture will be on stochastic control. We begin by looking at optimal control in
the deterministic case.
ẋ = F (x(t), u(t))
where x(t) is the state variable and u(t) is the control variable. We wish to apply optimal control
over a fixed time period [0, T ] such that
Z T
V (x(t), t) = min C[x(t), u(t)]dt + D[x(T )]
u 0
where C is the cost function, D is the terminal cost, and V (x(t), t) is the optimal or minimal cost at
a time t. Thus what we are looking to do is, given some cost function C, we are looking to minimize
the total cost. Let’s say for example that we want to keep x at a value c, but sending control signals
costs some amount α. Thus an example could be that C(x(t), u(t)) = (x(t) − c)2 + αu(t). What
we want to do is solve for the best u(t).
0 t ← dt → T
At point t there is an optimal value V (x(t), t). One way to try to control the system is by
taking small steps and observing how the system evolves. We will do this by stepping backwards.
Divide the cost into two components. Notice that
Essentially the goal is to find the u(t) for the increment dt that minimizes the growth of V . Do a
Taylor expansion on V to get
∂V ∂V
V (x(t + dt), t + dt) ≈ V (x(t), t) + + · ẋ(t)dt
∂t ∂X
83
∂V
The third part of the above equation ∂X · ẋ(t)dt is not necessarily scalar, but will be the dot product
between two scalars. The importance of this is because it is recursive, and this will lead to the
minimal solution. By plugging this in we get
∂V ∂V
min C(x(t), u(t)) + + , F (x(t), u(t)) = 0.
u ∂t ∂X
∂V
since V (x(t), t) does not depend on u(t) in the range (t, t + dt). We move ∂t outside to get
∂V ∂V
+ min C(x, u) + , F (x, u) =0
∂t u ∂x
The initial condition is the at the cost at the end must equal the terminal cost:
V (x, T ) = D(x).
This defines an ODE where, counting back from T , the terminal cost is the initial condition and
we solve backwards using the stepping equation. However, the PDE that needs to be solved to find
the optimal control of the system which may be hard because the minimization may be nontrivial.
This is known as deterministic optimal control, where the best u will be found. These types of
equations are known as Hamilton-Jacobian-Bellman (HJB) equations, and are famous if you are
studying optimal control.
What type of control is being used? That is the question that needs to be addressed, because there
are many types of controls:
1. Open Loop Control (Deterministic Control).
Suppose u(t, ω) = u(t). This case has no random events and thus it will be deterministic
control (open looped control).
2. Open Looped Control (Feedback Control).
Suppose Ut is Mt -adapted, where Mt is the σ-algebra generated by Xs , 0 ≤ s ≤ t . Essentially
for this σ-algebra you have all the information about the trajectory from 0 up to a certain
time point, the history must be known.
84
3. Markov Control
U (t, ω) = u0 (t, xt (ω)). Markov control uses less control than the open looped control. It only
uses what is current, there is nothing beyond that. This is commonly used in programming
applications because only the last state needs to be saved, leading to iterative solutions that
do not require much working memory or RAM. An example of where this would be used is a
control theory about robots. The robot has to decide to walk or stop, and this decision only
depends on the current state.
These types of controls explain what type of information is allowed to be used. Regardless, we
solve the problem using dynamic programming but this time we use the stochastic Taylor series
expansion:
∂V ∂V 1 ∂2V
V (x(t + dt), t + dt) = · dt + · dX(t) + (dxt )T (dxt )
∂t ∂x 2 ∂x2
∂2V
where ∂x2 is the Hessian of V . Thus
∂V ∂V ∂V 1X ∂2V
V (x(t+dt), t+dt) = V (x(t), t)+ dt+ , b(xt , ut ) dt+ , σ(xt , ut )dBt + aij
∂t ∂x ∂x 2 ∂xi ∂xj
i,j
where
aij = (σσ T )ij .
After obtaining the expectation we solve as before to get
∂V ∂V 1 X ∂ V
2
+ min C(x, u) + h , b(xt , ut )i + aij =0
∂t u ∂x 2 ∂xi ∂xj
ij
V (x, T ) = D(x).
Whether or not the solution exists, or is unique are hard questions. In general the goal is a practical
solution to the HJB equations. This type of equation does not necessarily have an analytical
solution. Though caution needs to be exercised as numerical solutions have issues as well. For
example, how could one find the u that minimize this? Some form of the calculus of variations?
There is no clear way of how to do this.
85
σt Rn×m , Mt ∈ Rn×k .
This minimizes u over the expectation,
Z T
a x,0
V (x, 0) = E (xTt Ct Xt + uTt Dt ut )dt + XTT RXT
0
where Ct are the costs of controlling at t and Dt is the terminal cost. Try to make the u small. We
denote
ψ(t, x) = min V u
u
Using s as the time stand-in variable, we plug the equation into the HJB equation to get
∂ψ Xn
∂ψ 1 X ∂ ψ
2
+ min xT Cs X + uT Ps v + (Hs x + Ms )i + (σs σsT )ij =0
∂s u ∂xi 2 ∂xi ∂xj
i=1 ij
xṠt x + ȧt + min xT ct x + V T Dt v+ < Ht x + Mt vm2S)tx > tr[(σt σtT )St ] = 0
u
Ṡt + St At St + Bt St + Ct = 0
These two equations give you St and at such that we arrive at the optimal value.
86
7.4 Stochastic Filtering
The general problem involves a system of equations
(System) dXt = b(t, xt )dt + σ(t, xt )dUt
(Observations) dZt = Ct , xt )dt + γ(t, xt )dVt
Ut : p-dim Brownian motion
Vt : r-dim Brownian motion
where Xt is the state, Zt is the observation, and Ut and Vt are two independent Brownian motions.
Assume F , G, C, D are bounded on bounded intervals, Z0 = 0.
In this problem, we seek to estimate the value of the system X at a future time t based on the
observations U up to the present time s < t, that is, conditional on Gt , the σalgebra generated by
{Zs }0≤s≤t .
where
Kt := Y : Ω → Rn : Y ∈ L2 (P ) and Y is Gt -measurable .
with L2 (P ) being the set of L2 integrable functions by the measure P .
Theorem: Let Gt ⊂ Ft be a sub-σ-algebra and let X ∈ L2 (p) be Ft -measurable. Let N =
Y ∈ L2 (P ) : Y isGt -measurable . It follows that
PN (Xt ) = E [Xt |Gt ]
where PN is the orthogonal projection of X onto N .
Proof: To prove that PN (Xt ) = E [Xt |Gt ], we simply need to show it satisfies the two properties
of the conditional expectation. Notice that it is trivial that PN (Xt ) is Gt -measurable since every
X ∈ N is Gt -measurable. Thus we just need to check the Partial Averaging Property. Since PN is
an orthogonal projection onto N , we get that X − PN (X) is orthogonal to N . Now take IA ∈ N
as an arbitrary indicator function for A ∈ Gt . This means that we define IA as
(
1, ω ∈ A
IA (ω) = .
0, o.w.
87
Since IA ∈ N , X − PN (X) is orthogonal to IA . Thus we from the Hilbert-space dot product that
Z Z
hX − PN (X), IA i = 0 = (X − PN (X)) IA dp = (X − PN (X)) dp
Ω A
where Xt is the state, Zt is the observation, and Ut and Vt are two independent Brownian motions.
Assume F , G, C, D are bounded on bounded intervals, Z0 = 0, and X0 is normally distributed.
For this problem, we will simply outline the derivation for the Kalman-Bucy Filter and provide
intuition for what the derivation means (for the full proof, see Oksendal). The derivation proceeds
as follows:
Step 1: Show It’s A Gaussian Process Let L be the closure (the set including its limit points)
of the set L2 (p) of random variables that are linear combinations of the form
with sj ≤ t and each cj ∈ R. Let PL be the projection from L2 (p) onto L. It follows that
X̂t = PL (Xt ).
88
We can interpret this step as saying that the best estimate for Xt can be written as a linear
combination of past values of Zt . Notice that since the variance term in the SDE is not dependent
on Z and X, the solution will be a Gaussian distribution. Since the sum of Gaussian distributions
is a Gaussian distribution, this implies that X̂ is Gaussian distributed! This gives the grounding for
our connection between estimating Xt from Zt by using Brownian motions and Gaussian processes.
Because this step is so important, we include a proof.
Theorem: Take X, Zs s ≤ t be random variables in L2 (p) and assume that (X, Zs1 , . . . , Zsn ) ∈
Rn+1 has a normal distribution. For all s1 , . . . , sn ≤ t with n ≥ 1, it follows that
PL (X) = E [X|Gt ] = PK (X).
Proof: Define X̌ = PL (X) and X̃ = X − X̌. This means that X̃ ⊥ L. Thus we can conclude
PL (X) = PK (X) if X̃ must be trivial, that is X̃ = 0. We do this in steps:
1. If (y1 , . . . , yk ) ∈ Rn is normally distributed, then c1 y1 + . . . + ck yk is normally distributed.
We leave out the proof that this means that in the limit as k → ∞ this is still normally
distributed. Thus, since (X, Zs1 , . . . , Zsn ) is normally distributed, X̌ is normally distributed.
h i
2. Since X̃ ⊥ X̌ and each Zsj ∈ L, X̃ is orthogonal to each Zsj . Thus E X̃Zsj = 0. Since
the Zs ’s are jointly Gaussian, non-correlation implies independence. X̃ is independent of
Zsi , . . . , Zsn . Thus denoting G as the σ-algebra generated by the Zs , we get X̃ is independent
of Gt .
3. Take IG to be the indicator function for events ω in the any arbitrary set G ⊂ Gt . Since
X̃ = X − X̌, we multiply both sides by the indicator and take expectations to get
h i
E X − X̌ IG = E IG X̃ .
Since X̃ is independent of G,
h i
E X − X̌ IG = E [IG ] E X̃
= E [IG ] E X − X̌ .
Since the probability of any individual event is measure zero, E [IG ] = 0. Thus
E X − X̌ IG = 0
which gives Z Z
Xdp = X̌dp
G G
for any G ⊂ Gt . Thus the partial averaging property is satisfied, meaning
X̌ = E [X|Gt ]
completing the proof.
89
Step 2: Estimate Using a Gram-Schmidt-Like Procedure To understand this step, recall
the Gram-Schmidt Procedure. What the Gram-Schmidt procedure does is, given a countable set
of vectors, it finds a basis set of orthogonal vectors that will span the same space. It does this
by iteration. First, it takes the first vector as the first basis vector. Then it recursively does the
following. You take the next vector v, find its projection of this vector onto the current basis space
by using the dot product (call it vp ), and then, knowing that this implies v − vp is orthogonal to
the basis space, we add v − vP as a basis vector.
Here we do a similar procedure. Since there are countably many Zt that are used to span PL ,
order them. We replace Zt by the innovation process Nt defined as
Z t
Nt = Zt − (GX)∧ s ds
0
where
(GX)∧
s = PL(X,s) (G(s)X(s) = G(s)X̂(s)
or equivalently
dNt = dZt − G(t)X̂t dt
= G(t) X − X̂ dt + D(t)dVt .
Note that (GX)∧ s basically the “basis set spanned by the t < s” and thus the next set we add to the
basis set is kind of Zt minus that the dot product with that value. Thus this is a type of continuous
version of the Gram-Schmidt procedure. We can prove that the following properties hold:
1. Nt has orthogonal increments: E [(Nt1 − Ns1 ) (Nt2 − Ns2 )] = 0 for every non-overlapping
[s1 , t1 ] and [s2 , t2 ]. So each time increment is orthogonal (since time is the basis of this
Gram-Schmidt-Like procedure).
2. L(N, t) = L(Z, t). That is simply that N and Z span the same space, as guaranteed by the
Gram-Schmidt process.
dNt
Step 3: Find the Brownian Motion Define dRt = D(t) . Using the non-overlapping inde-
pendence property, we can show that Rt is actually a Brownian motion. Notice trivially that
L(N, t) = L(R, t) and thus the space spanned by this Brownian motion is sufficient for the estima-
tion of X̂. Therefore we get that
X̂ = PL(R,t) (X(t))
Z t
∂
= E [Xt ] + E [Xt Rs ] dRs .
0 ∂s
90
where
X̂0 = E [X0 ]
2
and S(t) = E X − X̂ satisfies the Riccardi equation
G2 (t)
S ′ (t) = 2F (t)S(t) − 2 S 2 (t) + C 2 (t)
D (t)
h i
S(0) = E (X0 − E [X0 ])2 .
that, if we do not know the form of the equation, we can think of F , C, G, and D as unobservables
as well. Thus using a multidimensional version of the Kalman filter, we can iteratively estimate
the “constants” too! Thus, if we discretize this problem, we can estimate the future values by
estimating G, D, F , and C, and then using these to uncover X. Notice that since these constants
are changing, this type of a linear solution can approximate any non-linear interaction by simply
making the time-step small enough. Thus even though this is just the “linear approximation”, the
linear approximation can computationally solve the non-linear problem! For a detailed assessment
of how this is done, refer to Hamilton’s Time Series Analysis.
91
8 Stochastic Calculus Cheat-Sheet
Basic Probability Definitions
Binomial Distribution ∼ Bin(n, p)
n
• Distribution function:P (X = k) = pk (1 − p)n−k
k
P
• Cumulative Distribution: P (X ≤ k) = i≤k P (X = i)
n
P
• Expectation: E[X] = kP (X = k) = np
i=1
Useful Properties
92
Poisson Counter SDEs
m
X
dx = f (x, t)dt + gi (x, t)dNi
i=1
Ni (t) ∼ P oisson(λi t)
E [Ni (t)] = λt
Z t n−1
X
dNt = lim (N (ti+1 ) − N (ti )) , ti = i∆t
0 ∆t→0
i=0
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Ito’s Rule
X m
∂ψ ∂ψ
dYt = dψ(x, t) = dt + f (x)dt + [ψ(x + gi (x), t − ψ(x, t)]dNi
∂t ∂x
i=1
Wt ∼ N (0, t)
(
n dt, n = 1
(dt) =
0, o.w
dWin dtm = 0
dt : i = j
dWi × dWj =
0 : i 6= j
Z t n−1
X
g(X, t)dWt = lim g(Xti , ti ) dWti+1 − dWti , ti = i∆t
0 ∆t→0
i=1
∂ψ ∂ψ 1 ∂2ψ
dy = dψ(x, t) = dt + dx + (dx)2
∂t ∂x 2 ∂x2
93
Ito’s Rule
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dy = dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1
∂ρ ∂ρ 1 2 ∂2 ρ
+ a(x(t), t) + b (x(t), t) 2 = 0
∂t ∂x 2 ∂x
Fluctuation-Dissipation Theorem
Useful Properties
5. Independent Increments: E [(Wti − Ws1 ) (Wt2 − Ws2 )] = 0 if [t1 , s1 ] does not overlap [t2 , s2 ].
hR i
t
6. E 0 h(t)dWt = E [h(t)dWt ] = 0.
hR i hR i
T T
7. Ito Isometry: E 0 Xt dWt =E 0 Xt2 dt
94
Simulation Methods
n
X
dx = f (x, t)dt + gi (x, t)dWi ηi , λi ∼ N (0, ∆t) ηi and λi are independent
i=1
√ 1 3
∆Wt = ∆tηi ∆Ut = ∆t λi
3
Other Properties
Properties of Conditional Expectation
1. E[X|G] exists and is unique except on a set of measure 0.
(a) This can be generalized: For all A ⊂ R, if X ∈ A a.s., then E[X|G] ∈ A a.s.
5. Taking out what is known: If X is G-measurable, then E[XY |G] = XE[Y |G].Notice that
this is because if X is G-measurable, it is known given the information of G and thus can be
treated as a constant.
95
6. Iterated Conditioning: If H ⊂ G ⊂ F, then E [E[X|G]|H] = E[X|H].
96