0% found this document useful (0 votes)

9 views96 pages

ChrisRackauckas IntuitiveSDEs

This document provides an intuitive introduction to stochastic differential equations (SDEs) and Ito's calculus, emphasizing their applications in various fields. It covers the necessary probability theory, the development of stochastic processes, and includes step-by-step proofs and examples. The article concludes with applications of stochastic calculus in areas like option pricing and genetic drift, along with a cheat sheet for fundamental rules of Ito's calculus.

Uploaded by

tester212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views96 pages

ChrisRackauckas IntuitiveSDEs

Uploaded by

tester212

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

An Intuitive Introduction For Understanding and Solving

Stochastic Differential Equations

Chris Rackauckas

May 28, 2017

Abstract
Stochastic differential equations (SDEs) are a generalization of deterministic differential
equations that incorporate a “noise term”. These equations can be useful in many applications
where we assume that there are deterministic changes combined with noisy fluctuations. Ito’s
Calculus is the mathematics for handling such equations. In this article we introduce stochastic
differential equations and Ito’s calculus from an intuitive point of view, building the ideas from
relatable probability theory and only straying into measure-theoretic probability (defining all
concepts along the way) as necessary. All of the proofs are discussed intuitively and rigorously:
step by step proofs are provided. We start by reviewing the relevant probability needed in order
to develop the stochastic processes. We then develop the mathematics of stochastic processes in
order to define the Poisson Counter Process. We then define Brownian Motion, or the Wiener
Process, as a limit of the Poisson Counter Process. By doing the definition in this manner,
we are able to solve for many of the major properties and theorems of the stochastic calculus
without resorting to measure-theoretic approaches. Along the way, examples are given to show
how the calculus is actually used to solve problems. After developing Ito’s calculus for solving
SDEs, we briefly discuss how these SDEs can be computationally simulated in case the ana-
lytical solutions are difficult or impossible. After this, we turn to defining some relevant terms
in measure-theoretic probability in order to develop ideas such as conditional expectation and
martingales. The conclusion to this article is a set of four applications. We show how the rules
of the stochastic calculus and some basic martingale theory can be applied to solve problems
such as option pricing, genetic drift, stochastic control, and stochastic filtering. The end of this
article is a cheat sheet that details the fundamental rules for “doing” Ito’s calculus, like one
would find on the cover flap of a calculus book. These are the equations/properties/rules that
one uses to solve stochastic differential equations that are explained and justified in the article
but put together for convenience.

Contents
1 Introduction 6
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1
2 Probability Review 7
2.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Example 1: Bernoulli Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Example 2: Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . 8
2.2 Probability Generating Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Continuous Time Discrete Space Random Variables . . . . . . . . . . . . . . . . . . . 12
2.4.1 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Generalization: The Multivariate Gaussian Distribution . . . . . . . . . . . . 15
2.5.3 Gaussian in the Correlation-Free Coordinate System . . . . . . . . . . . . . . 15
2.6 Gaussian Distribution in PDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Change of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9.1 Multivariate Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Empirical Estimation of Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Introduction to Stochastic Processes: Jump Processes 19

3.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 The Poisson Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Time Evolution of Poisson Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Bidirectional Poisson Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Discrete-Time Discrete-Space Markov Process . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Continuous-Time Discrete-Space Markov Chain . . . . . . . . . . . . . . . . . . . . . 23
3.7.1 Example: DNA Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.8 The Differential Poisson Counting Process and the Stochastic Integral . . . . . . . . 25
3.9 Generalized Poisson Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10 Important Note: The Defining Feature of Ito’s Calculus . . . . . . . . . . . . . . . . 27
3.11 Example Counting Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.12 Ito’s Rules for Poisson Jump Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.12.1 Example Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.13 Dealing with Expectations of Poisson Counter SDEs . . . . . . . . . . . . . . . . . . 29
3.13.1 Example Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.13.2 Another Example Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.13.3 Important Example: Bidirectional Poisson Counter . . . . . . . . . . . . . . . 30
3.14 Poisson Jump Process Kolmogorov Forward Equation . . . . . . . . . . . . . . . . . 31
3.14.1 Example Kolmogorov Calculation . . . . . . . . . . . . . . . . . . . . . . . . 33

2
4 Introduction to Stochastic Processes: Brownian Motion 34
4.1 Brownian Motion / The Wiener Process . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Understanding the Wiener Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Ito’s Rules for Wiener Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 A Heuristic Way of Looking at Ito’s Rules . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Wiener Process Calculus Summarized . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5.1 Example Problem: Geometric Brownian Motion . . . . . . . . . . . . . . . . 40
4.6 Kolmogorov Forward Equation Derivation . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6.1 Example Application: Ornstein–Uhlenbeck Process . . . . . . . . . . . . . . . 43
4.7 Stochastic Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Fluctuation-Dissipation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Computational Simulation of SDEs 45

5.1 The Stochastic Euler Method - Euler-Maruyama Method . . . . . . . . . . . . . . . 45
5.2 A Quick Look at Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Milstein’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 KPS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 High Strong Order Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Timestep Adaptivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.7 Simulation Via Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . 53

6 Measure-Theoretic Probability for SDE Applications 54

6.1 Probability Spaces and σ-algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Example: Uniform Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1.2 Coin Toss Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Random Variables and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.1 Example: Coin Toss Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2.2 Example: Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.3 Expectation of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.4 Properties of Expectations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.5 Convergence of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.6 Convergence Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Filtrations, Conditional Expectations, and Martingales . . . . . . . . . . . . . . . . . 60
6.3.1 Filtration Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.3 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3.4 Properties of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . 62
6.3.5 Example: Infinite Coin-Flipping Experiment . . . . . . . . . . . . . . . . . . 62
6.4 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.4.1 Example: Infinite Coin-Flipping Experiment Martingale Properties . . . . . . 64
6.4.2 Example: Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3
6.5 Martingale SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.5.1 Example: Geometric Brownian Motion . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Application of Martingale Theory: First-Passage Time Theory . . . . . . . . . . . . 65
6.6.1 Kolmogorov Solution to First-Passage Time . . . . . . . . . . . . . . . . . . 66
6.6.2 Stopping Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.6.3 Reflection Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.7 Levy Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8 Markov Processes and the Backward Kolmogorov Equation . . . . . . . . . . . . . . 67
6.8.1 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.8.2 Martingales by Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . 68
6.8.3 Transition Densities and the Backward Kolmogorov . . . . . . . . . . . . . . 68
6.9 Change of Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.1 Definition of Change of Measure . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9.2 Simple Change of Measure Example . . . . . . . . . . . . . . . . . . . . . . . 70
6.9.3 Radon-Nikodym Derivative Process . . . . . . . . . . . . . . . . . . . . . . . 70
6.9.4 Girsanov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Applications of SDEs 72
7.1 European Call Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.1.1 Solution Technique: Self-Financing Portfolio . . . . . . . . . . . . . . . . . . 73
7.1.2 Solution Technique: Conditional Expectation . . . . . . . . . . . . . . . . . . 74
7.1.3 Justification of µ = r via Girsanov Theorem . . . . . . . . . . . . . . . . . . . 76
7.2 Population Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.1 Definitions from Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.2 Introduction to Genetic Drift and the Wright-Fisher Model . . . . . . . . . . 77
7.2.3 Formalization of the Wright-Fisher Model . . . . . . . . . . . . . . . . . . . . 78
7.2.4 The Diffusion Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.5 SDE Approximation of the Wright-Fisher Model . . . . . . . . . . . . . . . . 79
7.2.6 Extensions to the Wright-Fisher Model: Selection . . . . . . . . . . . . . . . . 80
7.2.7 Extensions to the Wright-Fisher Model: Mutation . . . . . . . . . . . . . . . 81
7.2.8 Hitting Probability (Without Mutation) . . . . . . . . . . . . . . . . . . . . . 81
7.2.9 Understanding Using Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.1 Deterministic Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.3 Stochastic Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3.4 Example: Linear Stochastic Control . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Stochastic Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.1 The Best Estimate: E [Xt |Gt ] . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4.2 Linear Filtering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 Discussion About the Kalman-Bucy Filter . . . . . . . . . . . . . . . . . . . . . . . . 91

4
8 Stochastic Calculus Cheat-Sheet 92

5
Acknowledgements
This article was based on the course notes of Math 271-C, stochastic differential equations, taught
by Xiaohui Xie at University of California, Irvine. It is the compilation of notes from Shan Jiang,
Anna LoPresti, Yu Liu, Alissa Klinzmann, Daniel Quang, Hannah Rubin, Jaleal Sanjek, Kathryn
Scannell, Andrew Schaub, Jienian Yang, and Xinwen Zhang.

1 Introduction
Newton’s calculus is about understanding and solving the following equation:
d dx
g(x) = g ′ (x)
dt dt
The purpose of this paper is to generalize these types of equations in order to include noise.
Let Wt be the Wiener process (aka Brownian motion whose properties will be determined later).
We write a stochastic differential equation (SDE) as

dx = f (x)dt + σ(x)dWt

which can be interpreted as “the change in x is given by deterministic changes f with noise of
variance σ”. For these equations, we will need to develop a new calculus. We will show that
Newton’s rules of calculus will not hold:

dg(x) 6= g ′ (x)dx = g ′ (x)f (x)dt + g ′ (x)σ(x)dt

Instead, we can use Ito’s calculus (or other systems of calculus designed to deal with SDEs). In
Ito’ s calculus, we use the following equation to find dg(x):
1
dg(x) = g′ (x)dx + g′′ (x)σ 2 (x)dt
2
1
= g′ (x)f (x)dt + g ′ (x)σ(x)dt + g′′ (x)σ 2 (x)dt
2
If we let ρ(x, t) be the distribution of x at time t, we can describe the evolution of this distribution
using the PDE known as the Kolmogorov equation:

∂ρ(x, t) ∂ 1 ∂2 2
= − [f (x)ρ(x, t)] + 2
σ (x)ρ(x, t) .
∂t ∂x 2 ∂x
The we can understand the time development of differential equations with noise terms by un-
derstanding their probability distributions and how to properly perform the algebra to arrive at
solutions.

6
1.1 Outline
This article is structured as follows. We start out with a review of probability that would be
encountered in a normal undergraduate course. These concepts are then used to build the basic
theory of stochastic processes and importantly the Poisson counter process. We then define the
Wiener process, Brownian motion, as a certain limit of the Poisson counter process. Using this
definition, we derive the basic properties, theorems, and rules for solving SDEs, known as the
stochastic calculus. After we have developed the stochastic calculus, we develop some measure-
theoretic probability ideas that will be important for defining conditional expectation, an idea
central to fucture estimation of stochastic processes and martingales. These properties are then
applied to systems that may be of interest to the stochastic modeler. The first of which is the
European option market where we use our tools to derive the Black-Scholes equation. Next we
dabble in some continuous probability models for population genetics. Lastly, we look at stochastic
control and filtering problems that are central to engineering and many other disciplines.

2 Probability Review
This chapter is a review of probability concepts from an undergraduate probability course. These
ideas will be useful when doing the calculations for the stochastic calculus. If you feel as though
you may need to review some probability before continuing, we recommend Durrett’s Elementary
Probability for Applications, or at a slightly higher level which is more mathematical, Grinstead
and Snell’s Introduction to Probability. Although a full grasp of probability is not required, it is
recommended that you are comfortable with most of the concepts introduced in this chapter.

2.1 Discrete Random Variables

2.1.1 Example 1: Bernoulli Trial
A Bernoulli trial describes the experiment of a single coin toss. There are two possible outcomes
in a Bernoulli trial: the coin can land on heads or it can land on tails. We can describe this using
S, the set of all possible outcomes:
S = {H, T }
Let the probability of the coin landing on heads be Pr(H) = p. Then, the probability of the coin
landing on tails is Pr(T ) = 1 − p . LetX be a random variable. This is a “function” that maps
S → R. It describes the values associated with each possible outcome of the experiment. For
example, let X = 1 if we get heads, and X = 0 if we get tails, then

1 if ω = H
X(ω) =
0 if ω = T

We defineE [X] as the mean of X, or the expectation of X. To find E [X], we take all the possible
values of X and weight them by the probability of X taking each of these values. So, for the

7
Bernoulli trial:
E [X] = Pr(H) · 1 + Pr(T ) · 0
=p

We use the probability distribution of X to describe all the probabilities of X taking each of its
values. There are only two possible outcomes in the Bernoulli trial and thus we can write the
probability distribution as

P (X = 1) = P (H) = p
.
P (X = 0) = P (T ) = 1 − p

Define V [X] as the variance of X. This is a measure of how much the values of X diverge from
the expectation of X on average and is defined as
h i
V [X] = σx2 = E (X − E [X])2 = E X 2 − E [X]2 .

For the Bernoulli Trial, we get that

h i
V [X] = E (X − p)2
= Pr(X = 1)(1 − p)2 + Pr(X = 0)(0 − p)2
= p(1 − p)

2.1.2 Example 2: Binomial Random Variable

This describes an experiment where we toss a coin n times. That is, it is a Bernoulli trial repeated
n times. Each toss/repetition is assumed to be independent. The set of all possible outcomes
includes every possible sequence of heads and tails:

S = {HH · · · H, T T · · · T, HT HT T · · · , ...}

or equivalently
S = {H, T }n
We may want to describe the how large the set S is, or the cardinality of the set S, represented by
|S|, as the “number of things in S”. For this example:

|S| = 2n

For each particular string of heads and tails, since each coin flip is independent, we can calculate
the probability of obtaining that particular string as

Pr(s ∈ S) = pk (1 − p)n−k , where kis the number of heads in s

8
So for instance, P (HH · · · H) = pn . Say we want to talk about the probability of getting a certain
number of heads in this experiment. Then let X be the random variable for the number of heads.
We can describe the range(X) as the possible values that X can take: X ∈ {0, 1, 2, ..., n}. Note:
using “∈” is an abuse of notation since X is actually a function. Recall that the probability
distribution of X is
n k
Pr(X = k) = p (1 − p)n−k
k
Using this probability distribution, we can calculate the expectation and variance of X. The
expectation is
Xn Xn
n k
E [X] = µx = Pr(X = k) · k = p (1 − p)n−k · k
k
k=0 k=0
= n · p,
while the variance is
h i
V [X] = E (X − p)2
n
X
= Pr(X = k)(k − np)2
k=0
= n · p(1 − p).

This is all solved using the Binomial theorem:

X n
n
(a + b) = ak bn−k .
k
k

Note that if Xi is the Bernoulli random variable associated with the ith coin toss, then
n
X
X= Xi .
i=1

Using these indicator variables we can compute the expectation and variance for the Binomial trial
more quickly. In order to do so, we use the following facts

E [aX + bY ] = aE [X] + bE [Y ]

and
V [aX + bY ] = a2 V [X] + b2 V [Y ] if X and Y are independent
to easily compute
n
X n
X
E [X] = E [Xi ] = p = np
i=1 i=1

9
and
n
X n
X
V [X] = V [Xi ] = p(1 − p) = np(1 − p)
i=1 i=1

2.2 Probability Generating Functions.

A probability function G(z) is a power series that can be used to determine the probability distri-
bution of a random variable X. We define G(z) as

G(z) := E z X .
For example, the Bernoulli random variable has the generating function

G(z) = E z X = Pr(X = 1)z 1 + Pr(X = 0)z 0
= pz + (1 − p)
while the Binomial random variable has the generating function
n
X
G(z) = Pr(X = k) · z k
k=0
n
X n
= pk (1 − p)n−k · z k
k
k=0
n
X n
= (pz)k (1 − p)n−k
k
k=0
= (pz + 1 − p)n
and thus by the Binomial thorem
X n
(a + b)n = ak bn−k .
k
k

we let a = pz and b = 1 − p to get

G(z) = (pz + 1 − p)n .
In general, where X = {x1 , x2 , ..., xn } and Pr(X = xi ) = pi :
n
X
G(z) = z xi pi
i=1

The probability generating function is special because it gives an easy way to solve for the proba-
bility that the random variable equals a specific value. To do so, notice that
G(k) (0) = k! pk ,

10
and thus
G(k) (z)|z=0
Pr(X = k) = ,
k!
that is, the kth derivative of G evaluated at 0 gives a straight-forward way of finding p. Thus if
we can solve for G then we can recover the probability distribution (which in some cases may be
simple).

2.3 Moment Generating Functions

A moment generating function is a function that is used to recover the expectation of the powers
(the moments) of X. A moment generating function may not exist for a given random variable,
but when it does exist the moment generating function will be unique. We define the moment
generating function M (t) as

M (t) := E etX
The moments of X are found by taking derivatives with respect to t of the moment generating
function, evaluated at t = 0. The first derivative is the first moment, the second derivative is the
second moment, etc. This gives us the expectations of X k :

1 dE etX
M (t)|t=0 = |t=0 = E [X]
dt
..
.
h i
(k) dn E etX (k)
M (t)|t=0 = |t=0 = E X
dtn
For example, for the Binomial random variable, the moment generating function is:
n
X
tX
n k
tk
M (t) = E e = e p (1 − p)n−k
k
k=0
n
X n k
= et p (1 − p)n−k
k
k=0
= (e p + (1 − p))n
t

Thus we can set t = 0 and evaluate the derivative of the moment generating function at this point.
This gives us the following:

M ′ (k)|t=0 = npet (et p + 1 − p)n−1 |t=0 = np = E [X]

which is the the first moment of X, or just the expectation of X.

11
2.4 Continuous Time Discrete Space Random Variables
We can generalize these ideas to random variables with infinite time. To go from discrete to
continuous time, start with a line segment divided into intervals of size ∆t. Then let the interval
t
size ∆t → 0. The number of intervals is n = ∆t ., so we can also think of this as letting the number
of intervals n → ∞.
Let’s do this for the Binomial random variable. We can think of a continuous time random
variable version in the following way: there is a coin toss within each interval, the probability of
a “success” (the coin lands on heads) within each interval is λ∆t. Define X to be the number of
successes within the interval (0,t). Thus the probability of k successes in the interval (0, t) is
t
t
Pr(X = k) = ∆t (λ∆t)k (1 − λ∆t) ∆t −k
k
The probability generating function can then be written as follows:

G(z) = (zp + 1 − p)n

= (1 + p(1 − z))n
= (1 + λ(1 − z)∆t)n

λ(1 − z)t n
= 1+
n

Recall that the definition of ex is x n

ex = lim 1+ .
n→∞ n
Thus we see that n
λ(1 − z)t
lim G(z) = lim 1+ = eλt(z−1)
n→∞ n→∞ n
We can then evaluate the probability generating function to see that

G(0) = Pr(X = 0) = e−λt

..
.
G(k) (z)|z=0 λtk −λt
Pr(X = k) = = e
k! k!
λtk −λt
So when ∆t → 0, Pr(X = k) = k! e with X ∈ {0, 1, 2, ...} (discrete but no upper limit). This is
the Poisson distribution.

2.4.1 The Poisson Distribution

For the Poisson distribution, there is one parameter, λ, which tells the rate of success. Few things
to note:

12
1. Events are independent of each other

(a) Within a small interval ∆t, the probability of seeing one event is λ∆t

2. The sum of all the probabilities is 1:

∞
X ∞
X (λt)k
Pr(X = k) = e−λt = eλt e−λt = 1
k
k=0 k=0

3. A special property of the Poisson distribution is that the expectation and variance are the
same:

• The expectation is:

t
E [X] = np = λ∆t = λt
∆t
• and the variance is:
t
V [X] = np(1 − p) = λ∆t(1 − λ∆t) = λt(1 − λt)
∆t
We might want to determine the probability of the next event occurring at some specified time.
Suppose we start at time t = 0 and want to know the probability of the first event occurring at a
specified time. In continuous time, we can determine this probability by considering the probability
that an event will occur in some small interval starting at our specified time (and that no event
occurs before this time interval). Let T be the time to the next event. Then,

P (T ∈ [t, t + ∆t]) = λe−λt ∆t

where e−λt is the probability that no event occurs before t and λ∆t is the probability that the event
occurs in the (t, t + ∆t) time window. This is an exponential distribution:

f (t) = λe−λt

2.5 Continuous Random Variables

So far we have been talking about discrete random variables, first in discrete time and then in
continuous time. We can also talk about continuous random variables. Let X ∈ R That is, X
can take an infinite number of values. For example, consider the one-dimensional Gaussian density
function.

13
2.5.1 The Gaussian Distribution
Take a random variable X. We denote that X is Gaussian distributed with mean µ and variance
σ 2 squared by X ∼ N (µ, σ 2 ). This distribution can be denoted by the following properties:
(x−µ) 2
• Density function: ρ(X) = √ 1
2
e− 2σ2
2πσ
Ra
• Cumulative distribution function: P (X ≤ a) = −∞ ρ(x)dx
R∞
• Expectation: E [X] = −∞ xρ(x)dx = µ
R∞
• Variance: V [X] = −∞ (x − µ)2 ρ(x)dx = σ 2

The way to read this is that the probability that X will take a value within a certain interval is
given by P {x ∈ [x, x + dx]} = ρ(x) · dx.
Recall that the nth moment of X is defined as
Z ∞
p
E [x ] = xp ρ(x)dx.
−∞

If we assume µ = 0, the pth moment function of X can be simplified as

Z ∞
p
E [x ] = xp ρ(x)dx
Z−∞
∞
1 x2
= xp √ e− 2σ2
−∞ 2πσ 2
Z ∞
1 x2
= √ xp−1 xe− 2σ2 dx
2πσ 2 −∞
x2
To solve this, we use integration by parts. We let u = xp−1 and dv = xe− 2σ2 . Thus du =
x2
(p − 1)xp−2 and v = −σ 2 e− 2σ2 . Therefore we see that
∞ Z ∞
σ2 x2 x2
E [xp ] = √ [(xp−1 e− 2σ2 ) + e− 2σ2 (p − 1)xp−2 dx].
2πσ 2 −∞ −∞

Notice that the constant term vanishes at both limits. Thus we get that
Z
p σ 2 (p − 1) ∞ − x22 p−2
E [x ] = √ e 2σ x dx
2πσ 2 −∞

= σ 2 (p − 1)E xp−2

14
Thus, using the base cases of the mean and the variance, we have a recursive algorithm for
finding all of the further variances. Notice that since we assumed µ = 0, we get that E [xp ] = 0 for
every odd p. For every even p, we can solve the recursive equation to get
p2
p σ2 p! p
E [x ] = p = (p − 1)!!σ
2 2 !

where the double factorial a!! means to multiply only the odd numbers from 1 to a.

2.5.2 Generalization: The Multivariate Gaussian Distribution

We now generalize the Gaussian distribution to multiple dimensions. Let X be a vector of n vari-
ables. Thus we define X ∼ N (µ, Σ) to be a random variable following the probability distribution
1 1 T −1
ρ(x) = p e− 2 (x−µ) Σ (x−µ)
n
(2π) det Σ
where x, µ ∈ Rn and Q ∈ Rn×n . Notice that E [x] = µ and thus µ is a vector of the means while
the variance is given by Σ:
   2

V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn ) σX 1
σ X1 X2 . . . σ X1 Xn
 ..   .. 
 Cov(X1 , X2 ) . Cov(X2 , Xn )   . 
Σ=  =  σ X1 X2 σ X2 Xn 
 .. . .. .
..   .
.. .. .. 
 .   . . 
Cov(X1, Xn ) Cov(X2 , Xn ) . . . V ar(Xn ) σ X1 Xn σ X2 Xn . . . σX 2
n

and thus, since Σ gives the variance in each component and the covariance between components,
it is known as the Variance-Covariance Matrix.

2.5.3 Gaussian in the Correlation-Free Coordinate System

Assume µ = 0̄. Since Σ is positive definite and symmetric matrix, Σ has a guaranteed eigendecom-
position
Σ = U ΛU T
where Λ is a diagonal matrix of the eigenvalues and U is a matrix where the ith column is the
ith eigenvector. Thus we notice that

det(Q) = det(U ΛU T ) = det(U )2 det(Λ) = λ1 λ2 ...λn

since det(U ) = 1 (each eigenvector is of norm 1). Noting that the eigenvector matrix satisfies the
property U T = U −1 we can define a new coordinate system y = U x and substitute to get
Y n 2
1 1 T T T 1 1 T 1 y
− i
e− 2 x U U Σ U U x = p e− 2 y Λy =
−1
ρ(y) = p p e 2λi
(2π)n det Σ (2π)n det Λ i=1
(2πλi )

15
where λi is the ith eigenvalue. Notice that in the y-coordinate system, each of the components are
uncorrelated. Because this is a Gaussian distribution, this implies that each component of y is a
Gaussian random variance yi ∼ N (0, λi ).

2.6 Gaussian Distribution in PDEs

Suppose we have the diffusion equation

∂p(x, t) 1 ∂ 2 p(x, t)
=
∂t 2 ∂x2
p(x, 0) = ψ(x)

The solution to this problem is a Gaussian distribution

1 − x2
p(x, t) = e 2t
2πt
Heret acts like the variance, with σ 2 = t. Since Green’s function for this PDE is simply the Gaussian
distribution, the general solution can be written simply as the convolution
Z
1 − (x−y)2
p(x, t) = e 2t p(y, 0)dy.
2πt

Suppose we are in n dimensional space and matrices Q = QT are both positive and finite, and that
Qi,j are the entries of Q. Thus look at the PDE

∂p(x, t) 1 ∂ ∂
= Σni,j=1 qi,j p(x, t)
∂t 2 ∂xi ∂xj
1
= ∇p(t, x)T Q∇p(x, t)
2
The solution to this PDE is a multivariate Gaussian distribution
1
p(x, t) = p exp(−xT (2Qt)−1 x)
det(Q)(2πt)n

where the covariance matrix, Σ = tQ scales linearly with t. This shows a strong connection between
the heat equation (diffusion) and the Gaussian distribution, a relation we will expand upon later.

2.7 Independence
Two events are defined to be independent if the measure of doing both events is their product.
Mathematically we say that two events P1 and P2 are independent if

µ(P1 ∩ P2 ) = µ(P1 )µ(P2 ).

16
This definition can be generalized to random variables. By definition X and Y are independent
random variables if ρ(x, y) = ρx (x)ρy (y) which is the same as saying that the joint probability
density is equal to the product of the marginal probability densities
Z ∞
ρx (x) = ρ(x, y)dy,
−∞

and ρy is defined likewise.

2.8 Conditional Probability

The probability of the event P1 given that P2 has occurred is defined as

µ(P1 ∩ P2 )
µ(P1 |P2 ) =
µ(P2 )
An important theorem here is Bayes’s rule. Bayes’ rule can also be referred to as “the flip-flop
theorem”. Let’s say we know the probability of P1 given P2 . Bayes’ theorem let’s us flip this
around and calculate the probability of P2 given P1 (note: these are not necessarily equal! The
probability of having an umbrella given that it’s raining is not the same as the probability of raining
given that you have an umbrella!). Mathematically, Bayes’ rule is the equation
µ(P1 |P2 )µ(P2 )
µ(P2 |P1 ) =
µ(P1 )
Bayes’s rule also works for the probability density functions. For the probability density functions
of random variables X and Y we get that
p(y|x)p(x)
p(x|y) = .
p(y)

2.9 Change of Random Variables

Suppose we know a random variable Y : P → Rn and we wish to change it via some function
φ : Rn → Rm to the random variable X as X = φ(Y ). How do we calculate ρ(x) from ρ(y)? Notice
that if we define
Range(φ) = {x|x = φ(y) for some y}
then ρ(x) = 0 if x ∈
/ Range(φ) (since y will never happen and thus x will never happen). For the
other values, we note that using calculus we get
dx
= φ′ (y)
dy
Suppose x ∈ S. This means that we want to look for the relation such that

Pr (x ∈ (x, x + dx)) = Pr (y ∈ (y, y + dy))

17
or
p(x)|dx| = ρ(y)|dy|
and thus
dy ρ(y)
ρ(x) = ρ(y)| |= ′
dx |φ (y)|
where y = φ−1 (x).

2.9.1 Multivariate Changes

If there are multiple variables, then this generalizes to
X 1
ρ(x) = ρ(y) ρ(φ−1 (y))
∇φ(y)
{y|φ(y)=x}

2.10 Empirical Estimation of Densities

Take a probability density ρ with a vector of parameters (θ1 , . . . , θm ). We write this together as
ρ(θ1 ,...,θm ) (x). Suppose that given data we wish to find the “best parameter set” that matches
the data. This would mean we would want to find the parameters such that the probability of
generating the data that we see is maximized. Assuming that each of n data points are from
independent samples, the probability of getting all n data points is the probability of seeing each
individual data point all multiplied together. Thus the likelihood of seeing a given set of n data
points is
n
Y
L(x; θ) = ρ(θ1 ,...,θm ) (xi )
i=1
where we interpret L(x; θ) as the probability of seeing the data set of the vector x given that
we have chosen the parameters θ. Maximum likelihood estimation simply means that we wish to
choose the parameters θ such that the probability of seeing the data is maximized, and so our best
estimate for the parameter set, θ̂, is calculated as
θ̂ = max L(x; θ).
θ

Sometimes, it’s more useful to use the Log-Likelihood

n
X
l(x; θ) = ρ(θ1 ,...,θm ) (xi )
i=1

since computationally this uses addition of the probabilities instead of the product (which in some
ways can be easier to manipulate and calculate). It can be proven that an equivalent estimator for
θ is

θ̂ = max l(x; θ).

18
3 Introduction to Stochastic Processes: Jump Processes
In this chapter we will develop the ideas of stochastic processes without “high-powered mathemat-
ics” (measure theory). We develop the ideas of Markov processes in order to intuitively develop
the ideas of a jump process where the probability of jumps are Poisson distributed. We then use
this theory of Poisson jump processes in order to define the Brownian motion and prove its prop-
erties. Thus by the end of this chapter we will have intuitively defined the SDE and elaborated its
properties.

3.1 Stochastic Processes

A stochastic process is the generalization of a random variable to being a changing function in time.
Formally, we define a stochastic process as a collection of random variables {X(t)}t≥0 where X(t)
is a random variable for the value of X at the time t. This can also be written equivalently as Xt .

3.2 The Poisson Counter

Definition: The Poisson counter is the stochastic process {N (t)}t≥0 where N is the number of
events have happened by the time t and the probability of an event happening in a time interval is
Poisson distributed. The Poisson counter satisfies the following properties:
• At time 0, no events have happened: N (0) = 0

• Independent increment: Give a time interval τ form time point t, the probability of k things
happen in this interval does not depend on the time before:

Pr (N (t + τ ) − N (t)|N (s), s < t) = Pr (N (t + τ ) − N (t))

• In a Poisson process, the probability k events happened before time t satisfies the Poisson
distribution: N (t) ∼ P oisson(λt),

(λt)k −λt
Pr(N (t) = k) = e
k!

3.3 Markov Process

Definition: A Markov process {X(t)}t≥0 as a process with the following property:

Pr (X(t + τ ) = X|X(δ), δ ≤ t)) = Pr X(t + τ ) = X|X(t))

That is to say: one can make predictions for the future of the process based solely on its present
state just as well as one could knowing the process’s full history. For example, if weather is a Markov
process, then the only thing that I need to know to predict the weather tomorrow is the weather
today. Note that the Poisson counter is a Markov process. This is trivial given the independent
increment part of the definition.

19
3.4 Time Evolution of Poisson Counter
To solve for the time evolution of the Poisson counter, we will instead of looking at a single trajectory
look at an ensemble of trajectories. Think of the ensemble of trajectories as an “amount” or
concentration of probability fluid that is flowing from one state to another. For a Poisson counter,
the flow is the average rate λ. Thus look at state i is the particles that have jumped i times. The
flow out of state i is λ times the amount of probability in state i, or λpi (t). The flow into the state
is simply the flow out of i − 1, or λpi−1 (t). Thus the change in the amount of probability at state
i is given by the differential equation
dpi (t)
= −λpi (t) + λpi−1 (t).
dt
or
pi (t + ∆t) − pi (t) = λ∆tpi−1 (t) − λ∆tpi (t).
To Solve this, define p(t) as the infinite vector (not necessarily a vector because it is countably
infinite but the properties we use here of vectors hold for a rigged basis) where of pi (t). Thus we
note that
ṗ(t) = Ap(t)
where  
−λ
 λ −λ 
 
A=
 ..  .
 λ . 

..
.
To solve this, we just need to solve the cascade of equations. Notice that

ṗ0 (t) = −λp0 (t) =⇒ p0 (t) = e−λt

ṗ1 (t) = −λp1 (t) + λp0 (t)

= −λp1 (t) + λe−λt

and thus we solve the linear differential equation to get

p1 (t) = λte−λt .

For all the others we note that

ṗi (t) = −λpi (t) + λpi−1 (t)
which solves as
(λt)i −λt
pi (t) = e .
i!

20
To see this is the general solution, simply plug it in to see that it satisfies the differential equation.
Because

ṗ(t) = Ap(t),
is a linear system of differential equations, there is a unique solution. Thus our solution is the
unique solution.

3.5 Bidirectional Poisson Counter

Now let us assume that we have two Poisson counter processes: one counting up with a rate λ and
one counting down with a rate λ. Using the same flow argument, we get that

ṗ(t)
= −2λpi (t) + λpi−1 (t) + λpi+1 (t).
dt
We assume that all of the probability starts at 0: pi (0) = δi0 . Notice that this can be written as
the system
 
.. .. ..
 . . . 
ṗ(t) =  λ −2λ λ  p(t) = Ap(t)

.. .. ..
. . .
where A is a tridiagonal and infinite in both directions. To solve for this, we use the probability
generating function. Define the probability generating function as
X∞ h i
g(t, z) = z i pi (t) = E z x(t)
i=−∞

where the summation is the Laurent series, which is the sum from 0 to infinity added with the sum
from -1 to negative infinity. Thus we use calculus and algebra to get
∞
X
∂g
= z i ṗi (t)
∂t
i=−∞
X∞
= z i [λpi−1 (t) + λpi+1 − 2λpi (t)]
i=−∞
X∞ ∞
X ∞
X
i i
= λ z pi−1 (t) + λ z pi+1 (t) − 2λ z i pi (t).
i=−∞ i=−∞ i=−∞

Notice that since the sum is infinite in both directions, we can trivially change the index and adjust
the amount of z appropriately, that is
∞
X ∞
X
z i pi−1 (t) = z z i pi (t) = zg(t, z)
i=−∞ i=−∞

21
∞
X ∞
i 1 X i 1
z pi+1 (t) = z pi (t) = g(t, z)
z z
i=−∞ i=−∞
and thus
∂g λ
= λz + − 2λ g(t, z).
∂t z
This is a simple linear differential equation which is solved as
−1 −2)t
g(t, z) = eλ(z+z .

However, to notice the importance of this, go back to the definition of g:

∞
X
g(t, z) = z i pi (t).
i=−∞
Notice that, if we look at the nth derivative, only one term, the i = n term, does not have a z. Thus
if we take z = 0 (and discard all terms that blow up), we see that this singles out term which is
equal to n!pi (t). This leads us to the fundamental property of the probability generating function:

g(k) (t, k)
pi (t) = |z=0 .
k!
Thus we can show by induction using this formula with our closed form solution of the probability
generating function that
∞
X (2λt)2m
pn (t) = e−2λt 2m m!(n + m)!
= e−2λt In (2λt)
m=0
2

where In (x) is the nth Bessel function.

3.6 Discrete-Time Discrete-Space Markov Process

Definition: a discrete-time stochastic process is a sequence of random variables X = X1 , X2 , ..., Xn .
Xn is the state of the process X at time n and X0 is the initial state. This process is called a
discrete-time Markov Chain if for all m and all possible states i0 , i1 , ...i, j ∈ X,

Pr (Xn+1 = j|Xn = in , . . . , X0 = i0 ) = Pr (Xn+1 = j|Xn = in ) = Pij .

This definition means that the probability of transition to another state simply depends on where
you are right now. This can be represented as a graph where your current state is the node i. The
probability of transition from state i to state j is simply Pij , the one-step transition probability.
Define  
P1 (t)
−−→  
P (t) = P (t) =  ... 
Pn (t)

22
as the vector of state probabilities. The way to interpret this is as though you were running many
simulations, then the ith component of the vector is the percent of the simulations that are currently
at state i. Since the transition probabilities only depend on the current state, we can write the
iteration equation
P (t + 1) = AP (t)
where  
P11 P12 ··· P1n
 P21 P22 ··· P2n 
 
A= . .. .. ..  .
 .. . . . 
Pn1 Pn2 · · · Pnn
Notice that this is simply a concise way to write the idea that at every timestep, Pi,j percent of
the simulations at state i transfer to state j. Notice then that the probabilities of transitioning at
any given time will add up to 1 (the probability of not moving is simply Pi,i ).
Definition: A Regular Markov Chain as a Markov Chain for which some power n of its
transition matrix A has only positive entries.

3.7 Continuous-Time Discrete-Space Markov Chain

Define a stochastic process in continuous time with discrete finite state space to be a Markov chain
if
Pr (Xn+1 = j|Xn = in , ..., X0 = i0 ) = pij (tn+1 − tn )
where pij is the solution of the forward equation

P ′ (t) = P (t)Q

where Q is a transition rate matrix which used to describe the flow of probability juices from state
i to the state j. Notice that

P ′ (t) = P (t)Q
P ′ (t + h) − P ′ (t)
= P ′ (t)Q
h
P (t + h) = (I + Qh)P (t)
P (t + h) = AP (t)

and thus we can think of a continuous-time Markov chain as a discrete-time Markov chain with
infinitely small timesteps and a transition matrix

A = I + Qh

Note that the transition rate matrix Q satisfies the following properties:

23
1. Transition flow between state i and j qij > 0 when i 6= j;

2. Transition probability from i and i aij = qij h;

Pn P
3. j=1 qij = 0, qii = − j6=i qij

Property 1 is stating that the transition rate matrix is composed only “rates from i” and thus they
are all positive values. Property 2 is restating the relation between Q and A. Property 3 is stating
that the diagonal of Q is composed of the flows into the state i, and thus it will be a negative
number. One last property to note is that since

P ′ (t) = P (t)Q

we get that
P (t) = P (0)eQt
where eQt is the matrix exponential (defined by its Taylor Series expansion being the same as the
normal exponential). This means that there exists a unique solution to the time evolution of the
probability densities. This is an interesting fact to note: even though any given trajectory evolves
randomly, the way a large set of trajectories evolve together behaves deterministically.

3.7.1 Example: DNA Mutations

Look at a many bases of DNA. Each can take four states: {A, T, C, G} Denote the percent that are
in state i at the time t as pi (t). We write our system as the vector

P (t) = (pA (t), pG (t), pC (t), pT (t))T

Define Q by the mutation rates from A to G, C to T, etc (yes, this is a very simplistic model. Chill
bra.). We get that the evolution of the probability of being in state i at time t is given by the
differential equation
P ′ (t) = QP (t).
Since we have mathematized this process, we can now use our familiar rules to investigate the
system. For example, if we were to run this system for a very long time, what percent of the bases
will be A? Since this is simply a differential equation, we find this by simply looking at the steady
state: the P (t) s.t. P ′ (t) = 0. Notice that means that we have

0 = QP (t) = λ

since 0 is just a constant. Thus the vector P that satisfies this property is an eigenvector of Q. This
means that the eigenvector of Q corresponding to the eigenvalue of 0 gives the long run probabilities
of being in state i respectively.

24
3.8 The Differential Poisson Counting Process and the Stochastic Integral
We can intuitively define the differential Poisson Counting Process is the process that describes the
changes of a Poisson CounterNt (or equivalently N (t)). Define the differential stochastic process
dNt by its some integral Z t
Nt = dNt .
0
To understand dNt , let’s investigate some of its properties. Since Nt ∼ P oisson(λt), we know that
Z t
E [Nt ] = λt = E dNt .
0

Since E is defined as some kind of a summation or an integral, we assume that dNt2 is bounded
which, at least in the normal calculus, lets us swap the ordering of integrations. Thus
Z t
λt = E [dNt ] .
0

Since the expected value makes a constant, we intuit that

Z t
λt = αdt = αt
0

and thus
E [dNt ] = λ.
Notice this is simply saying that dNt represents the “flow of probability” that is on average λ.
Using a similar argument we also note that the variance of dNt = λ. Thus we can think of the
equation the term dNt as a kind of a limit, where

N (t + dt) − N (t) = dNt

is probability of jumping k times in the increment dt which is given by the probability distribution

(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Then how do we make sense of the integral? Well, think about writing the integral as a Riemann
sum:
Z t n−1
X
dNt = lim (N (ti+1 ) − N (ti ))
0 ∆t→0
i=0

where ti = i∆t. One way to understand how this is done is algorithmically/computationally. Let’s
say you wanted to calculate one “stochastic trajectory” (one instantiation) of N (t). What we can

25
do is pick a time interval dt. We can get N (t) by, at each time step dt, sample a value from the
probability distribution
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
and repeatedly add up the number of jumps such that N (t) is the total number of jumps at time
t. This will form our basis for defining and understanding stochastic differential equations.

3.9 Generalized Poisson Counting Processes

We now wish to generalize the counting processes. Define a counting process X(t) as

dX(t) = f (X(t), t)dt + g(X(t), t)dN (t)

where f is some arbitrary function describing deterministic changes in time where g defines the
“jumping” properties. The way to interpret this is as a time evolution equation for X. As we
increment in time by ∆t, we add f (X(t), t) to X. If we jump in that interval, we also add g(X(t), t).
The probability of jumping in the interval is given by
(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Another way of thinking about this is to assume that the first jump happens at a time t1 . Then
X(t) evolves deterministically until t1 where it jumps by g(X(t), t), that is
!
lim X(t) = g lim X(t), t + lim X(t).
t→t+
1 t→t−
1 t→t−
1

Notice that we calculate the jump using the left-sided limit. This is known as Ito’s calculus and
it is interpreted as the jump process “not knowing” any information about the future, and thus it
jumps using only previous information.
We describe the solution to the SDE once again using an integral, this time we write it as
Z t Z t
X(t) = X(0) + f (X(t), t)dt + g(X(t), t)dN (t).
0 0

Notice that the first integral is simply a deterministic integral. The second one is a stochastic
integral. It can once again be understood as a Riemann summation, this time
Z t n−1
X
g(X(t), t)dNt = lim g(X(ti ), ti ) (N (ti+1 ) − N (ti )) .
0 ∆t→0
i=0

We can understand the stochastic part the same as before simply as the random amount of jumps
that happen in the interval (t, t + dt). However, now we multiply the number of jumps in the
interval by g(X(ti ), ti ), meaning “the jumps have changing amounts of power”.

26
3.10 Important Note: The Defining Feature of Ito’s Calculus
It is important to note that g is evaluated using the X and t before the jump. This is the defining
principle of the Ito Calculus and corresponds to the “Left-Hand Rule” for Riemann sums. Unlike
Newton’s calculus, the left-handed, right-handed, and midpoint summations do not converge to the
same value in the stochastic calculus. Thus all of these different ways of summing up intervals in
order to solve the integral are completely different calculi. Notably, the summation principle which
uses the midpoints
Z t X
n−1
g(X(ti+1 ), ti+1 ) + g(X(ti ), ti )

g(X(t), t)dNt = lim (N (ti+1 ) − N (ti )) .
0 ∆t→0 2
i=0

is known as the Stratonovich Calculus. You may ask, why choose Ito’s calculus? In some sense, it
is an arbitrary choice. However, it can be motivated theoretically by the fact that Ito’s Calculus is
the only stochastic calculus where the stochastic adder g does not “use information of the future”,
that is, the jump sizes do not adjust how far they will jump given the future information of knowing
where it will land. Thus, in some sense, Ito’s Calculus corresponds to the type of calculus we would
believe matches the real-world. Ultimately, because these give different answers, which calculus
best matches the real-world is an empirical question that could be investigated itself.

3.11 Example Counting Process

Define the counting process
dXt = Xt dt + Xt dNt .
Suppose that the jumps occur at t1 < t2 < . . . . Thus before t1 , we evolve deterministically as

dXt = Xt dt

and thus X(t) = et before t1 . At t1 , we take this value and jump by Xt1 . Thus since immediately
before the jump we have a value et1 , immediately after the jump we have et1 + et1 = 2et1 . We once
again it begins to evolve as

dXt = Xt dt
but now with the initial condition X(t1 ) = 2et1 . We see that in this interval the linear equation
solves to X(t) = 2et . Now when we jump at t2 , we have the value 2et2 and jump by 2et2 to get 4et2
directly after the jump. Seeing the pattern, we get that


 et , 0 ≤ t ≤ t1


2et , t1 ≤ t ≤ t2
X(t) = . .

 .. ..


 n t
2 e , tn ≤ t ≤ tn+1

27
3.12 Ito’s Rules for Poisson Jump Process
Given the SDE
n
X
dx(t) = f (x(t), t)dt + gi (x(t), t)dNi
i=1
where x ∈ R, Ni is a Poisson counter with rate λi , f : Rn → Rn , gi : Rn → Rn . Define Y = ψ(X, t)
as some random variable whose values are determined as a function of X and t. How do we find
dy(t), the time evolution of y? There are two parts: the determinsitic changes and the stochastic
jumps. The deterministic changes are found using Newton’s calculus. Notice using Newtonian
calculus that
∂ψ ∂t ∂ψ ∂x ∂ψ ∂ψ ∂ψ ∂ψ
∆Deterministic = + = + (dx)deterministic = + f (x).
∂t ∂t ∂x ∂t ∂t ∂x ∂t ∂x
The second part are the stochastic changes due to jumping. Notice that if the Poisson counter
process i jumps in the interval, the jump will change x from x to x + gi . This means that y will
change from ψ(x, t) to ψ(x+gi (x), t). Thus the change in y due to jumping is the difference between
the two times the number of jumps, calculated as
n
X
∆Jumps = [ψ(x + gi (x)) − ψ(x, t)]dNi ,
i=1

where dNi is the number of jumps in the interval. This approximation is not correct if some process
jumps multiple times in the interval, but if the interval is of size dt ([t, t + dt]), then the probability
that a Poisson process jumps twice goes to zero as dt → 0. Thus this approximation is correct for
infinitesimal changes. Putting these terms together we get

dψ(x, t) = ∆Deterministic + ∆Jumps

X m
∂ψ ∂ψ
= dy(t) = dψ(x, t) = dt + f (x)dt + [ψ(x + gi (x)) − ψ(x, t)]dNi .
∂t ∂x
i=1

which is Ito’s Rule for Poisson counter processes.

3.12.1 Example Problem

Let

dx(t) = −x(t)dt + dN1 (t) − dN2 (t)

where N1 (t) is a Poisson counter with rate λ1 and N2 (t) is a Poisson counter with rate λ2 . Let us
say we wanted to know the evolution of Y = X 2 . Thus ψ(X, t) = X 2 . Using Ito’s Rules, we get

dx2 (t) = 2x(−x(t))dt + ((x + 1)2 − x2 )dN1 + ((x − 1)2 − x2 )dN2

= −2x2 dt + (2x + 1)dN1 + (1 − 2x)dN2

28
3.13 Dealing with Expectations of Poisson Counter SDEs
Take the SDE
m
X
dx = f (x, t)dt + gi (x, t)dNi
i=1

where Ni is a Poisson counter with rate λi . Notice that since Ni (t) ∼ P oisson(λt), we get

E [Ni (t)] = λt.

Also notice that because Ni (t) is a Poisson process, the probability Ni (t) will jump in interval
(t, t + h) is independent of X(σ) for any σ < t. This mean that, since the current change is
independent of the previous changes, E [g(x(t), t)dNi (t)] = E [g(x(t), t)] E [dNi ] = λi E [g(x(t), t)].
Thus we get the fact that
m
X
E [x(t + h) − x(t)] = E [f (x, s)h] + E [g(x(t), t)dNi (t)]
i=1
Xm
= E [f (x, t)] h + E [gi (x, t)] λi h
i=1
m
X
E [x(t + h) − x(t)]
= E [f (x, t)] + λi E [gi (x, t)]
h
i=1

and thus we take the limit as h → 0 to get

X m
dE [x(t)]
= E [f (x, t)] + λi E [gi (x, t)] .
dt
i=1

3.13.1 Example Calculations

Given SDE
dx = −xdt + dN1 − dN2
we apply Ito’s Rule to get
dE [x]
= −E [x] + λ1 − λ2 .
dt
Notice that this is just an ODE in E [x]. If it makes it easier to comprehend, let Y = E [x] to see
that this is simply
Y ′ = −Y + λ1 − λ2 .
which we can solve using our tools from ODEs. Notice the connection here: even though the
trajectory is itself stochastic, its expectations change deterministically. Now we apply Ito’s rules
to get
dx2 = −2x2 dt + (2x + 1)dN1 + (1 − 2x)dN2

29
and thus we get that the expectation changes as

dE x2
= −2E x2 + (2E [x] + 1) λ1 + (1 − 2E [x]) λ2 .
dt
To solve this, we would first need to complete solving the ODE for E [x], then plug that solution
into this equation to get another ODE, which we solve. However, notice that this has all been
changed into ODEs, something we know how to solve!

3.13.2 Another Example Calculation

Given SDEs
dx = −xdt + zdt,
dz = −2zdN.
for z ∈ {−1, 1}. Thus by Ito’s Rules
dx2 = 2x(−xdt + zdt),
= −2x2 dt + 2xzdt.
while
d(xz) = zdx + xdz,
= (z 2 − xz)dt − 2xzdN.
Thus we get that

dE x2
= −2E x2 + 2E [xz]
dt
dE [xz]
= E x2 − 2E [xz] λ − E [xz]
dt
as the system of ODEs that determine the evolution of certain expectations central to the solving
of the variance and the covariance.

3.13.3 Important Example: Bidirectional Poisson Counter

λ
Suppose we have two Poisson processes, dN1 (t) and dN2 (t) with rates 2 and define y(t) by the SDE
dy(t) = dN1 (t) − dN2 (t)
We can rescale this equation s.t.
1
xλ (t) = √ y(t)
λ
1 1
dxλ (t) = √ dN1 (t) − √ dN2 (t)
λ λ

30
where the jump size is proportional to √1 . Then we have
λ

dE [xλ (t)] λ λ
= √ − √ =0
dt 2 λ 2 λ
We use Ito’s Rules with g = ±1 and the Binomial theorem to get

dE xpλ (t) 1 p 1 p
= E x+ √ − xp dN1 + x− √ − xp dN2
dt λ λ
p 1 p 1 p 1 p 1
p p−1 p−2
= E x + √ x + x + . . . − xp dN1 + xp − √ xp−1 + xp−2 + . . . − xp dN2
1 λ 2 λ 1 λ 2 λ

p 1 p−2 p 1 p−1 p 1 p−3
= E x + . . . (dN1 + dN2 ) + √ x + √ x + . . . (dN1 − dN2 )
2 λ 1 λ 3 λ3

p 1 p−2 λ λ
= E x +
2 λ 2 2
p
p−2
= E x
2

where we drop off all the higher order λ1 terms since the will go to zero as λ → ∞. This means
that in the limit we get
dE[xp (t)] p(p − 1)
= E[xp−2 (t)]
dt 2
Thus as λ → ∞ we can ignore higher order terms to get all of the odd moments as 0 and the even
moments as:
d 2
1. dt E[x (t)] =1
p! t
p
2. E[xp (t)] = p
! 2
2
2

Let σ 2 = t Then all moments match, so as λ → ∞ the random variable x∞ (t) will be Gaussian
with mean 0 and variance t. Thus we can think of x∞ (t) as a stochastic process whose probability
distribution starts as a squished Gaussian distribution which progressively flattens linearly with
time.

3.14 Poisson Jump Process Kolmogorov Forward Equation

Take the SDE
m
X
dx = f (x, t)dt + gi (x, t)dNi
i=1
Assume that f and gi are sufficiently smooth and the initial density is small. What we wish to
find is the probability density function for X at a time t, ρ(x, t). To derive this, take the arbitrary
function ψ(x). By the multivariate Ito’s Rules,
X m
∂ψ
dψ(x) = f (x, t)dt + [ψ(x + gi (x), t) − ψ(x, t)]dNi
∂x
i=1

31
and thus Xm
dE [ψ] ∂ψ
=E f (x, t) + E [ψ(x + gi (x), t) − ψ(x, t)] λi .
dt ∂x
i=1
Recall that the definition of the expected value is
Z ∞
E [ψ(x)] = ρ(x, t)ψ(x)dx
−∞

and thus
Z ∞ Z ∞ m
X Z ∞
∂ρ ∂ψ
ψ(x)dx = f (x, t)ρ(x, t)dx + λi [ψ(x + gi (x), t) − ψ(x, t)]ρ(x, t)dx
−∞ ∂t −∞ ∂x −∞i=1

We next simplify the equation term by term using integration by parts. What we want to get
is every term having a ψ(x) term so we can group all the integrals. Thus take the first integral
Z ∞
∂ψ
f ρdx.
−∞ ∂x

Here we let u = f ρ and dv = ∂ψ

∂x in order to get that
Z ∞ Z −∞
∂ψ ∂(f ρ)
f ρdx = f ρψ|∞
−∞ − ψ(x)dx.
−∞ ∂x ∞ ∂x
Notice that in order for the probability density to integrate to 1 (and thus the integral be bounded),
we must have ρ vanish at the infinities. Thus
Z ∞ Z −∞
∂ψ ∂(f ρ)
f ρdx = − ψdx.
−∞ ∂x ∞ ∂x
Next we take the g term that does not have ψ in the same version. In order to solve the first
g integral, we will need to change the variables to make the integrating variable simpler. Thus let
z = g̃i (x) = x + gi (x). Therefore dz = (1 + gi′ (x)) dx Using this, we can re-write
Z ∞ Z ∞
ρ(g̃i−1 (z), t)
ψ(x + gi (x))ρdx = ψ(z) dz
−∞ −∞ |1 + gi′ (g̃i−1 (z)|
However, notice that limx→∞ z = ∞ and limx→−∞ z = −∞ and thus we do not need to change the
bounds, making Z ∞ Z ∞
ρ(g̃i−1 (x), t)
ψ(x + gi (x))ρdx = ψ(x) dx.
−∞ −∞ |1 + gi′ (g̃i−1 (x)|
Thus we plug these integrals back into our equation to get
Z ∞ Z ∞ m
X Z ∞
∂ρ ∂ψ
ψ(x)dx = , f (x, t)ρ(x, t)dx + λi [ψ(x + gi (x), t) − ψ(x, t)]ρ(x, t)dx
−∞ ∂t −∞ ∂x i=1 −∞

32
Z Z m Z ∞ Z ∞
−∞
∂p −∞
∂(f ρ) X ρ(g̃i−1 (x), t)
ψdx = − ψdx + λi [ ψ(x) dx − ψpdx].
∞ ∂t ∞ ∂x
i=1 −∞ |1 + gi′ (g̃i−1 (x)| −∞

We collect all of the terms to one side to get

Z m !
∞
∂ρ ∂(f ρ) X ρ(g̃i−1 (x), t)
ψ(x) + − λi − ρ dx = 0
−∞ ∂t ∂x
i=1
|1 + gi′ (g̃i−1 (x)|

Since ψ(x) we arbitrary, let ψ be the indicator for the arbitrary set A, that is
(
1, if x ∈ A
ψ(x) = IA (x) =
0 o.w.

Thus we get that

Z m !
∂ρ ∂(f ρ) X ρ(g̃i−1 (x), t)
+ − λi − ρ dx = 0
A ∂t ∂x
i=1
|1 + gi′ (g̃i−1 (x)|

for any A ⊂ R. Thus, in order for this to be satisfied for all subsets of the real numbers, the
integrand must be identically zero. This means
m
∂ρ ∂(f ρ) X ρ(g̃i−1 (x), t)
+ − λi −ρ =0
∂t ∂x |1 + gi′ (g̃i−1 (x)|
i=1

which we arrange as
m
∂p ∂(f p) X ρ(g̃i−1 (x), t)
=− + λi −ρ .
∂t ∂x
i=1
|1 + gi′ (g̃i−1 (x)|
This equation describes the time evolution of the probability density function ρ(x, t) via a deter-
ministic PDE.

3.14.1 Example Kolmogorov Calculation

Take the SDE
dx = −xdt + dN1 − dN2
where N1 and N2 are Poisson counter with rate λ. To calculate the probability density function,
we plug in the functions into the Kolmogorov equation to get
∂p ∂(−xp)
=− + λ[p(x − 1, t) − p(x, t)] + λ[p(x + 1, t) − p(x, t)].
∂t ∂x
If we are also given an initial density ρ(x, 0) (say, a Dirac δ-function denoting that all trajecto-
ries start at the same spot), we can calculate the time evolution of the probability density using
computational PDE solvers that we all know and love.

33
4 Introduction to Stochastic Processes: Brownian Motion
In this chapter we will use the properties of Poisson Counting Processes in order to define Brow-
nian Motion and derive a calculus for dealing with stochastic differential equations written with
differential Wiener/Brownian terms.

4.1 Brownian Motion / The Wiener Process

Now we finally define Brownian Motion. Robert Brown is credited for first describing Brownian
motion. Brown observed pollen particles moving randomly and described the motion. Brownian
motion is also sometimes called a Wiener process because Norbert Wiener was the first person to
describe random motion mathematically. It is most commonly defined more abstractly using limits
of random walks or abstract function space definitions. We will define it using the bidirectional
Poisson counter. Recall that we define this using two Poisson processes, dN1 (t) and dN2 (t) with
rates λ2 and define y(t) by the SDE

dy(t) = dN1 (t) − dN2 (t)

We can rescale this equation s.t.

1
xλ (t) = √ y(t)
λ
1 1
dxλ (t) = √ dN1 (t) − √ dN2 (t).
λ λ
Define the Wiener process, W (t) (or equivalently, Brownian motion B(t)) as the limit as both of
the rates go to infinity, that is
lim Xλ (t) → W (t).
λ→∞

4.2 Understanding the Wiener Process

Like before with the Poisson counter, we wish to understand what exactly dWt is. We define
W (0) = 0. Define dWt by its integral
Z t
Wt = dWt .
0
We can once again understand the integral by using the Riemann summation
Z t n−1
X
g(X, t)dWt = lim g(Xti , ti ) dWti+1 − dWti
0 ∆t→0
i=1

where ti = i∆t Notice once again that we are evaluating g using the Left-hand rule as this is the
defining feature of Ito’s Calculus: it does not use information of the future.

34
Recall that this definition is defined by the bidirectional Poisson Counter. Can we then un-
derstand the interval as a number of jumps? Since since the rate of jumps is infinitely high, we
can think of this process as making infinitely many infinitely small jumps in every interval of time.
Thus we cannot understand the interval as the “number of jumps” because infinitely many will
occur! However, given the proof from 3.13.3, we get that W (t) ∼ N (0, t). Thus we can think of
(dWi+1 − dWi ) ∼ N (0, dt), that is, the size of the increment is normally distributed. Algorithmi-
cally solving the integral by taking a normally distributed random number with variance dt and
multiplying it by g to get the value of g(X, t)dWt over the next interval of time. Using Ito’s Rules
for Wiener Processes (which we will derive shortly) we can easily prove that
h i
1. E (W (t) − W (s))2 = t − s for t > s.

2. E[W (t1 )W (t2 )] = min(t1 , t2 ).

Notice that this means E[(W (t + ∆t) − W (t))2 ] = ∆t and thus in the limit as ∆t → 0, then
E[(W (t + ∆t) − W (t))2 ] → 0 and thus the Wiener process is continuous almost surely (with prob-
ability 1). However, it can be proven that it is not differentiable with probability 1! Thus dWt is
some kind of abuse of notation because the derivative of Wt does not really exist. However, we can
still use it to understand the solution to an arbitrary SDE

dXt = f (X, t)dt + g(X, t)dWi

as the integral Z Z
t t
Xt = X0 + f (Xt , t)dt + g(Xt , t)dWt .
0 0

4.3 Ito’s Rules for Wiener Processes

By the definition of the Wiener process, we can write
1
dWi = √ (dNi − dN−i )
λ
in the limit where λ → ∞. We can define the stochastic differential equation (SDE) in terms of the
Wiener process
n
X Xn
gi (x)
dXt = f (X, t)dt + gi (x)dWi = f (X, t)dt + √ (dNi − dN−i )
i=1 i=1
λ

We now define Yt = ψ(x, t). Using Ito’s Rules for Poisson Jump Processes, we get that
Xn Xn
∂ψ ∂ψ(x, t) gi (x) gi (x)
dYt = dψ(x, t) = dt + f (x, t)dt + ψ x+ √ − ψ(x, t) dNi + ψ x− √ − ψ(x, t) dN−i .
∂t ∂x i=1 λ i=1 λ

35
To simplify this, we expand ψ by λ to get

gi (x) gi (x) g 2 (x) 3
ψ x+ √ = ψ(x) + ψ ′ (x) √ + ψ ′′ (x) i + O(λ− 2 )
λ λ λ

gi (x) gi (x) g 2 (x) 3
ψ x− √ = ψ(x) − ψ ′ (x) √ + ψ ′′ (x) i + O(λ− 2 )
λ λ λ

to simplify to (dropping off higher-order terms)

Xn Xn
∂ψ ∂ψ(x, t) gi (x) g 2 (x) gi (x) g 2 (x)
dψ(x) = dt + f (x, t)dt + ψ′ (x) √ + ψ′′ (x) i dNi + −ψ′ (x) √ + ψ′′ (x) i dN−i ,
∂t ∂x i=1 λ λ i=1 λ λ

and thus
Xn Xn
∂ψ ∂ψ(x, t) gi (x) 1
dψ(x) = dt + f (x, t)dt + ψ′ (x) √ (dNi − dN−i ) ψ′′ (x)gi2 (x) (dNi + dN−i ) .
∂t ∂x i=1 λ i=1
λ

Let us take a second to justify dropping off the higher order terms. Recall that the Wiener
3
process looks at the limit as λ → ∞. In the expansion, the terms dropped off are O(λ− 2 ). Recall
that
λ
E [dNi ] = .
2
Thus we see that the expected contribution of these higher order terms is
h 3
i 1
lim E O(λ− 2 )dNi = lim O(λ− 2 ) = 0.
λ→∞ λ→∞

However, we next note that

λ
V [dNi ] = E [dNi ] = .
2
Thus we get that h i
3 1
lim V O(λ− 2 )dNi = lim O(λ− 2 ) = 0.
λ→∞ λ→∞
Therefore, since the variance goes to zero, the contribution of these terms are not stochastic. Thus
the higher order terms deterministically make zero contribution (in more rigorous terms, they make
no contribution to the change with probability 1). Therefore, although this at first glace looked
like an approximation, we were actually justified in only taking the first two terms of the Taylor
series expansion.

Sub-Calculation
To simplify this further, we need to do a small calculation. Define
1
dzi = (dNi + dN−i ) .
λ

36
Notice that
dE [zi ] 1 λ λ
= + =1
dt λ 2 2
which means
E [zi ] = t.
Using Ito’s rules
!
1 2
dzi2 = zi + 2
− zi (dNi + dN−i )
λ

2zi 1
= + 2 (dNi + dN−i )
λ λ

and thus

dE zi2 2E [zi ] 1 λ λ
= + 2 +
dt λ λ 2 2

2t 1
= + 2 λ
λ λ
1
= 2t +
λ
to make
t
E zi2 = t2 + .
λ
This means that
t t
V [zi ] = E zi2 − E [zi ]2 = t2 + − t2 = .
λ λ
Thus look at
Z = lim zi .
λ→∞

This means that

E [Z] = t,
and
V [Z] = 0.
Notice that since the variance goes to 0, in the limit as λ → ∞, Z is a deterministic process that
is equal to t, and thus dZ = dt.

37
Solution for Ito’s Rules
Return to
n n
g 2 (x)

X gi (x) X gi (x) 1
dψ(x) = ψ ′ (x)(f (X, t)dt+ ψ ′ (x) √ (dNi − dN−i )+ ψ ′′ (x)gi2 (x) −ψ ′ (x) √ + ψ ′′ (x) i (dNi + dN−i ) .
i=1 λ i=1 λ λ λ

Notice that the last term is simply zi . Thus we take the limit as λ → ∞ to get that
n
X n
X
dψ(x) = ψ ′ (x)(f (X, t)dt + ψ ′ (x)gi (x)dWt + ψ ′′ (x)gi2 (x)dt
i=1 i=1

which is Ito’s Rules for an SDE. If x is a vector then we would arrive at

n
X n
1X
dφ(x) = hψ(x), f (x)idt + h∇ψ(x), gi (x)idWi + gi (x)T ∇2 ψ(x)gi (x)dt
2
i=1 i=1

where hx, yi is the dot product between x and y and ∇2 is the Hessian.

4.4 A Heuristic Way of Looking at Ito’s Rules

Take the SDE
n
X
dXt = f (X, t)dt + gi (X, t)dWi .
i=1
Define
Yt = ψ(Xt , t).
By the normal rules of calculus, we get
∂ψ ∂ψ 1 ∂2ψ
dψ(Xt , t) = dt + (dXt ) + (dXt )2
∂t ∂Xt 2 ∂Xt2
n
! n
!2
∂ψ ∂ψ X 1 ∂2ψ X
= dt + f (X, t)dt + gi (X, t)dWi + f (X, t)dt + gi (X, t)dWi
∂t ∂Xt i=1
2 ∂Xt2 i=1
n
! n n
!
∂ψ ∂ψ X 1 ∂2ψ X X
2 2
= dt + f (X, t)dt + gi (X, t)dWi + f (X, t)dt + f (X, t) gi (X, t)dWi dt + gi2 (X, t)dWi2 .
∂t ∂Xt i=1
2 ∂Xt2 i=1 i=1

If we let
dt × dt = 0
dWi × dt = 0

dt : i = j
dWi × dWj =
0 : i 6= j
then this simplifies to
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1

38
which is once again Ito’s Rules. Thus we can think of Ito’s Rules is saying that dt2 is sufficiently
small, dt and dWi are uncorrelated, and dWi2 = dt which means that the differential Wiener process
squared is a deterministic process. In fact, we can formalize this idea as the defining property of
Brownian motion. This is captured in Levy Theorem which will be stated in 6.7.

4.5 Wiener Process Calculus Summarized

Take the SDE
n
X
dx = f (x, t)dt + gi (x, t)dWi
i=1

where Wi (t) is a standard Brownian motion. We have showed that Ito’s Rules could be interpreted
as:

dt × dt = 0
dWi × dt = 0

dt : i = j
dWi × dWj =
0 : i 6= j
and thus if y = ψ(x, t), Ito’s rules can be written as

∂ψ ∂ψ 1 ∂2ψ
dy = dt + dx + (dx)2
∂t ∂x 2 ∂x2
where, if we plug in dx, we get
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dy = dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1

The solution is given by the integral form:

Z t Z tX
m
x(t) = x(0) + f (x(s))ds + gi (x(s))dWi .
0 0 i=1

Note that we can also generalize Ito’s lemma to the multidimensional X ∈ Rn case:
m
X m
∂ψ ∂ψ 1X
dψ(X) = , f (X) dt + , gi (X) dWi + gi (X)T ∇2 ψ(X)gi (X)dt
∂X ∂X 2
i=1 i=1

There are many other facts that we will state but not prove. These are proven using Ito’s Rules.
They are as follows:

39
1. Product Rule: d(Xt Yt ) = Xt dY + Yt dX + dXdY .
Rt Rt Rt
2. Integration By Parts: 0 Xt dYt = Xt Yt − X0 Y0 − 0 Yt dXt − 0 dXt dYt .
h i
3. E (W (t) − W (s))2 = t − s for t > s.

4. E[W (t1 )W (t2 )] = min(t1 , t2 ).

5. Independent Increments: E [(Wti − Ws1 ) (Wt2 − Ws2 )] = 0 if [t1 , s1 ] does not overlap [t2 , s2 ].
hR i
t
6. E 0 h(t)dWt = E [h(t)dWt ] = 0.
hR i hR i
T T
7. Ito Isometry: E 0 Xt dWt =E 0 Xt2 dt

4.5.1 Example Problem: Geometric Brownian Motion

Look at the example stochastic process

dx = αxdt + σxdW.

To solve this, we start with our intuition from Newton’s Calculus that this may be an exponential
growth process. Thus we check Ito’s equation on the logarithm ψ(x, t) = ln(x) for this process is,

1 1 2 2 1 1
d (ln x) = 0 + (αx) − σ x 2
dt + (σx) dW
x 2 x x

1 2
d (ln x) = α − σ dt + σdW.
2

Thus by taking the integral of both sides we get

1 2
ln(x) = α − σ t + σW (t)
2

and then expnentiating both sides

1 2
x(t) = e(α− 2 σ )t+σW (t) .

Notice that since the Wiener process W (t) ∼ N (0, t), the log of x is distributed normally as
N ( α − 21 σ 2 t, σ 2 t). Thus x(t) is distributed as what is known as the log-normal distribution.

40
4.6 Kolmogorov Forward Equation Derivation
The Kolmogorov Forward Equation, also known to Physicists as the Fokker-Planck Equation, is
important because it describes the time evolution of the probability density function. Whereas the
stochastic differential equation describes how one trajectory of the stochastic processes evolves, the
Kolmogorov Forward Equation describes how, if you were to be running many different simulations
of the trajectory, the percent of trajectories that are around a given value evolves with time.
We will derive this for the arbitrary drift process:
m
X
dx = f (x)dt + gi (x)dWi .
i=1

Let ψ(x) be an arbitrary time-independent transformation of x. Applying Ito’s lemma for an

arbitrary function we get
n
! n
∂ψ 1 X 2 ∂2ψ ∂ψ X
dψ(x, t) = f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂x 2 ∂x ∂x
i=1 i=1

Take the expectation of both sides. Because expectation is a linear operator, we can move it inside
the derivative operator to get
"m #
d ∂ψ 1 X ∂2ψ
E [ψ(x)] = E f (x) + E g2 (x) .
dt ∂x 2 ∂x2 i
i=1

Notice that E [dWi ] = 0 and thus the differential Wiener terms dropped out.
Recall the definition of expected value is
Z ∞
E[ψ(x)] = ρ(x, t)ψ(x)dx
−∞

where ρ(x, t) is the probability density of equaling x at a time t. Thus we get that the first term as
Z ∞
d ∂ρ
E [ψ(x)] = ψ(x)dx
dt −∞ ∂t

For the others, notice Z ∞

∂ψ ∂ψ
E f (x) = ρ(x, t) f (x)dx.
∂x −∞ ∂x
∂(ρf )
We rearrange terms doing integration by parts. Let u = ρf and dv = ∂ψ ∂x . Thus du = ∂x and
v = ψ. Therefore we get that
Z ∞
∂ψ ∞ ∂(ρf )
E f (x) = [ρf ψ]−∞ − ψ(x)dx.
∂x −∞ ∂x

41
In order for the probability distribution to be bounded (which it must be: hit must integrate
i to 1),
∂ψ
ρ must vanish at both infinities. Thus, assuming bounded expectation, E ∂x f (x) < ∞, we get
that
Z ∞
∂ψ ∂(ρf )
E f (x) = − ψ(x)dx.
∂x −∞ ∂x
The next term we manipulate similarly,

Z ∞
∂ψ 2 2 ∂2ψ 2
E g (x) = g (x)ρ(x, t)dx
∂x2 i −∞ ∂x
2 i
∞ Z ∞
2 ∂ψ ∂ gi2 (x)ρ(x, t) ∂ψ
= ρg − dx
∂x −∞ −∞ ∂x ∂x
∞ " #∞ Z
2 ∂ψ ∂ ρg2 1 ∞ ∂ 2 gi2 (x)ρ(x, t)
= ρg − ψ + ψ(x)dx
∂x −∞ ∂x 2 −∞ ∂x2
−∞
Z ∞ 2 2
1 ∂ gi (x)ρ(x, t)
= 2 ψ(x)dx
−∞ ∂x2
where we note that, at the edges, the derivative of ρ converges to zero since ρ converges to 0 and
thus the constant terms vanish. Thus we get that
Z ∞ Z ∞ Z
∂ρ ∂(ρf ) 1 X ∞ ∂(gi2 ρ)
ψ(x)dx = − ψ(x)dx + ψ(x)dx
−∞ ∂t −∞ ∂x 2 −∞i
∂x

which we can re-write as

Z !
∞
∂ρ ∂(ρf ) 1 X ∂(gi2 ρ)
+ − ψ(x)dx = 0.
−∞ ∂t ∂x 2 ∂x
i

Since ψ(x) is arbitrary, let ψ(x) = IA (x), the indicator function for the set A:
(
1, x ∈ A
IA (x) =
0 o.w.

Thus we get that !

Z
∂ρ ∂(ρf ) 1 X ∂(gi2 ρ)
+ − dx = 0
A ∂t ∂x 2 ∂x
i
for any arbitrary A ⊆ R. Notice that this implies that the integrand must be identically zero. Thus
∂ρ ∂(ρf ) 1 X ∂(gi2 ρ)
+ − = 0,
∂t ∂x 2 ∂x
i

42
which we re-arrange as
m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + 2
gi (x)ρ(x, t) ,
∂t ∂x 2 ∂x
i=1

which is the Forward Kolmogorov or the Fokker-Planck equation.

4.6.1 Example Application: Ornstein–Uhlenbeck Process

Consider the stochastic process

dx = −xdt + dWt ,
where Wt is Brownian motion. The Forward Kolmogorov Equation for this SDE is thus

∂ρ ∂ 1 ∂2
= (xρ) + ρ(x, t)
∂t ∂x 2 ∂x2
Assume that the initial conditions follow the distribution u to give

ρ(x, 0) = u(x)

and the boundary conditions are absorbing at infinity. To solve this PDE, let y = xet and apply
Ito’s lemma
dy = xet dt + et dx = et dW
and notice this follows has the Forward Kolmogorov Equation

∂ρ e2t ∂ 2 ρ
= .
∂t 2 ∂y 2
which is a simple form of the Heat Equation. If ρ(x, 0) = δ(x), the Dirac-δ function, then we know
2t
this solves as a Gaussian with diffusion constant e2 to give us

1 y2
ρ(y, t) = √ e− 2e2t t .
2πe2t t
and thus y ∼ N (0, e2t t). To get the probability density function in terms of x, we would simply do
the pdf transformation as described in 2.9.

4.7 Stochastic Stability

The idea of stochastic stability is that linear stability analysis for deterministic systems generalizes
to stochastic systems. To see this, look at the equation

dx = axdt + dWt .

43
Notice that
dE [x]
= aE [x]
dt
and thus
E [x] = E [x0 ] eat
which converges iff a < 0. Also notice that
dx2 = 2xdx + dt
= 2x(axdt + dWt ) + dt
= (2ax2 + 1)dt + dWt
and thus
dE x2
= 2aE x2 + 1
dt
which gives
2 2 1
E x = E x0 + e2at
2a
which converges iff a < 0. This isn’t a full proof, but it motivates the idea that if the deterministic
coefficient is less than 0, then, just as in the deterministic case, the system converges and is thus
stable.

4.8 Fluctuation-Dissipation Theorem

Take the SDE

dX = f (X, t)dt + g(X, t)dWt .

If there exists a stable steady state, linearize the system around a steady state

dX = Jf (Xss , t)dt + g(Xss , t)dWt ,

where Jf is the Jacobian of f defined as
 ∂f1 ∂f1 ∂f1 
∂x1 ∂x2 ... ∂xn
 .. .. 
Jf =  . . 
∂fm ∂fm ∂fm
∂x1 ∂x2 ... ∂xn
∂fi
where ∂x j
is the partial derivative of the ith component of f by the jth component of x. The
Fluctuation-Dissipation Theorem tells us that the variance-covariance matrix of the variables,
   2

V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn ) σX 1
σ X1 X2 . . . σ X1 Xn
 ..   .. 
 Cov(X1 , X2 ) . Cov(X2 , Xn )   . σ X2 Xn 
Σ=   =  σ X1 X2 
.. .. ..   .. .. .. 
 . . .   . . . 
Cov(X1, Xn ) Cov(X2 , Xn ) ... V ar(Xn ) 2
σ X1 Xn σ X2 Xn . . . σ Xn

44
can be found at the steady state using the formula

Jf (Xss , t)Σ(Xss , t) + Σ(Xss , t)JfT (Xss , t) = −g2 (Xss , t).

5 Computational Simulation of SDEs

In this chapter we outline basic procedures for the computational simulation of SDEs. For a good
introduction to computational simulation of SDEs, refer to Higham’s An Algorithmic Introduction
to Numerical Simulation of Stochastic Differential Equations. For a more complete reference on
higher-order methods, see Kloeden’s Numerical Solution of SDE Through Computer Experiments
and Iacus’s Simulation and Inference for Stochastic Differential Equations. For the reference on
Rossler’s High Strong Order Runge-Kutta methods, see Runge–Kutta Methods for the Strong Ap-
proximation of Solutions of Stochastic Differential Equations. For adaptive timestepping and effi-
cient implementation, see Rackauckas’ Adaptive Methods for Stochastic Differential Equations via
Natural Embeddings and Rejection Sampling with Memory. More efficient high order methods will
be published soon.

5.1 The Stochastic Euler Method - Euler-Maruyama Method

An intuitive way to start the computational simulation of SDEs is to look at a straight-forward
generalization of Euler’s method for ODEs to SDEs. First, let us recap Euler’s method for ODEs.
Take the ODE
x′ (t) = f (x, t).
Writing this out we get
dx
= f (x, t)
dt
which we can represent as
dx = f (x, t)dt.
Euler’s method is to approximate the solution using “discrete-sized dt’s”. Thus we take some
fixed small constant ∆t. We say that

x(t + ∆t) − x(t) = f (x, t)∆t,

and thus
x(t + ∆t) = x(t) + f (x, t)∆t
defines a recursive solution for the value of x at a time t given values of x at previous times. All
that is left for the approximation is some initial condition needs to be started, such as x(0) = y
and thus problem is solved iteratively.
Now we take the stochastic differential equation

dX = f (X, t)dt + g(X, t)dWt .

45
Once again, we define some small fixed constant ∆t and write

X(t + ∆t) − X(t) = f (X, t)∆t + g(X, t)∆Wt ,

X(t + ∆t) = X(t) + f (X, t)∆t + g(X, t)∆Wt .

Recall from 4.2 that an interval ∆t of the Wiener process is distributed as

∆Wt = (dWt+∆t − dWt ) ∼ N (0, dt).

√
Thus let ηi be a standard normal random variable: ηi ∼ N (0, 1). Notice that ∆Wt = ηi dt. Thus
we can write √
X(t + ∆t) = X(t) + f (X, t)∆t + ∆tg(X, t)ηi .
The way to interpret this is that, for each interval i, we sample a standard normal random variable
ηi , and iterate one step using this equation, then sample another standard normal random variable
and iterate again! So this shows the strong connection between Gaussian processes and the Wiener
process. This method for iteratively solving a stochastic differential equation is known as the
Euler-Maruyama Method.

5.2 A Quick Look at Accuracy

Recall that for Euler’s method, when using a time discretization of size ∆t, we have the the accuracy
of the solution is O(∆t), that is, each step approximates the change in x well, give or take an amount
∆t. This is justified using Taylor series expansions, since

x(t + ∆t) = x(t) + x′ (t)∆t + O(∆t2 )

and thus, if x is the real solution and X is the approximation,

x(t + ∆t) − x(t) = x′ (t)∆t + O(∆t2 ),

X(t + ∆t) − X(t) = X ′ (t)∆t.

We can then write that

|x(t) − X(t)| ≤ C∆t
and therefore we are confident that our approximation X converges to the real solution x linearly
by ∆t (we are not confident in our answer for variations as small as ∆t2 because we dropped those
terms off!).
We may think that this simply generalizes. However, it gets more complicated quite quickly.
First of all, how do we define the difference? There are two ways to define the difference. One way
is known as the Strong Convergence. The strong convergence is the expected difference between
“the real solution” and the approximation solution, that is if x is the real solution and X is our
approximation, then the strong convergence is the factor γ defining

E [|x − X|] ≤ C∆tγ .

46
We can think of a different type of convergence, known as the weak convergence, as

|E [x] − E [X] | ≤ C∆tβ .

Notice how different these ideas of convergence are. Weak convergence means that the average
trajectory we computationally simulate does ∆tβ good, where as strong convergence means that
every trajectory does ∆tγ good. Thus if we are looking at properties defined by ensembles of
trajectories, then weak convergence is what we are looking at. However, if we want to know how
good a single trajectory is, we have to look at the strong convergence.
Here comes the kicker. For the Euler-Maruyama method, it has a strong convergence of order
1
2 and a weak convergence of order 1. That’s to say it has a slower convergence than the Euler
method, and the average properties only converge as fast as the Euler method! Thus, in practice,
this method is not very practical given its extremely slow convergence.

5.3 Milstein’s Method

To understand how to make the algorithms better, we have to understand what went wrong.
Why was the convergence of the stochastic method slower than that of the deterministic method?
The answer comes from the Taylor series expansion. Notice that in the stochastic Taylor series
(expansion using Ito’s Rules), the second order terms matter for the first-order effects because
(dWt )2 = dt! Thus, accounting for the second order Taylor series term instead write ∂x
∂g
= gx to
get the method

1 √ ∆t
X(t + ∆t) = X(t) + f (X, t) − g(X, t)gx (X, t) ∆t + ∆tg(X, t)ηi + g(X, t)gx (X, t)ηi2 .
2 2

Notice that we have added a − 21 g(X, t)gx (X, t)dt term in order to cancel out the 12 g(X, t)gx (X, t) (dWt )2
term in the stochastic Taylor series expansion. Notice too that we only sample one random number,
but the second order term uses that random number squared. This method is known as Milstein’s
method. As you may have guessed, since it now accounts for all order 1 effects, it has order 1 strong
and weak convergence.

5.4 KPS Method

As Milstein’s method is still only as good as the deterministic Euler method, I wish to give one last
method for the purpose of simulating SDEs for applications. For most applications, you want you
convergence order greater than 1. The simplest of such methods is the KPS method which as a
strong order of convergence of 1.5 and a weak order of convergence 2. The method can be written
as follows:

47
∆t
X(t + ∆t) = X(t) + f ∆t + g∆Wt + ggx (∆Wt )2 − ∆t
2
1 1 2
+ gfx ∆Ut + f fx + g fxx ∆t2
2 2

1
+ f gx + g2 gxx (∆Wt ∆t − ∆Ut )
2

1 1 2
+ g (ggx )x (∆Wt ) − ∆t ∆Wt
2 3

where √
∆Wt = ∆tηi
and Z t+∆t Z s
∆Ut = dWs ds
t t
can be written as
1
∆Ut = ∆t3 λi
3
where λi ∼ N (0, 1), a standard normal random variable (that is not ηi !).

5.5 High Strong Order Runge-Kutta Methods

Using a colored root tree analysis, Rößler was able to develop a systematic method for developing
order 1.5 multi-step stochastic Runge-Kutta schemes. These resulted in less computational steps
than the KPS schemes, and the number of steps grows much slower as the Ito dimension increases
than in the KPS schemes. They also have the advantage of being more structurally simple, making
them the faster method in both implementation and runtime, and have slower growth in the number
of coefficients as the number of Ito dimensions grows. Generalizations of the Rößler methods
Stratanovich integration are also derived in his paper, and our methods will trivially generalize as
well. He showed that the the Runge-Kutta methods
s
X
(0) (0)
Un+1 = Un + αi f tn + ci ∆t, Hi ∆t + (5.1)
i=1
Xs
(1) (2) I(1,1) (3) I(1,0) (4) I(1,1,1) (1) (1)
βi I(1) + βi √ + βi + βi g tn + ci ∆t, Hi
i=1
∆t ∆t ∆t

48
with stages
s
X
(0) (0) (0) (0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t (5.2)
j=1
s
X
(0) (1) (1) I(1,0)
+ Bij g tn + cj ∆t, Hj ,
∆t
j=1
s
X
(1) (1) (0) (0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t (5.3)
j=1
s
X √
(1) (1) (1)
+ Bij g tn + cj ∆t, Hj ∆t
j=1

if they satisfy a set of order conditions. The coefficients

A0 , B0 , β (i) , α must satisfy the following order conditions to achieve order .5:
T T
1. αT e = 1 3. β (2) e = 0 5. β (4) e = 0
T T
2. β (1) e = 1 4. β (3) e = 0

additionally, for order 1:

T T
1. β (1) B (1) e = 0 3. β (3) B (1) e = 0
T T
2. β (2) B (1) e = 1 4. β (4) B (1) e = 0

and lastly for order 1.5:

1 T 2
1. αT A(0) e = 2 9. β (2) B (1) e = 0

2. αT B (0) e = 1 T 2
10. β (3) B (1) e = −1
2 3
3. αT B (0) e = 2 T 2
11. β (4) B (1) e = 2
T
4. β (1) A(1) e = 1 T
12. β (1) B (1) B (1) e = 0
5. β (2)T A(1) e = 0
T
T 13. β (2) B (1) B (1) e = 0
6. β (3) A(1) e = −1
T
T
7. β (4) A(1) e = 0 14. β (3) B (1) B (1) e = 0
T 2 T
8. β (1) B (1) e = 1 15. β (4) B (1) B (1) e = 1

49
1 (1)T (1) (0) 1 (3)T (1) (0)
16. β A B e + β A B e =0
2 3
These methods are the (Rößler) SRI methods. We will refer to the algorithms by the tuple of
44 coefficients A0 , B0 , β (i) , α . Note that this method can be easily extended to multiple Ito
dimensions in the case of diagonal noise with similar results. We only focus on a single Ito dimension
for simplicity of notation (though our results will extend to higher Ito dimensions as well in the
trivial manner). To satisfy the conditions, Rößler proposed the following scheme known as SRIW1:

c(0) A(0) B (0)

c(1) A(1) B (1)

T T
αT β (1) β (2)

T T
β (3) β (4)

T T
α̃T β̃ (3) β̃ (4)

3 3 3
4 4 2

0 0 0 0 0

0 0 0 0 0 0 0

1 1 1
4 4 2

1 1 0 -1 0

1 1 1
4 0 0 4 -5 3 2

1 2 4 2 4
3 3 0 0 -1 3 3 0 -1 3 − 13 0

2 − 43 − 32 0 2 5
3 − 23 1

1 1
2 2 0 0 0 0 0 0 0 0 0 0

50
In the case where noise is additive, the methods can be vastly simplified to
Xs s
X
(0) (0) (1) (2) I(1,0) (1)
Un+1 = Un + αi f tn + ci ∆t, Hi ∆t + βi I(1) + βi g(tn + ci ∆t) (5.4)
∆t
i=1 i=1

with stages
s
X s
X I
(0) (0) (0) (0) (0) (1) (1,0)
Hi = Un + Aij f tn + cj ∆t, Hj ∆t + Bij g tn + cj ∆t (5.5)
∆t
j=1 j=1

The coefficients A0 , B0 , β (i) , α must satisfy the conditions for order 1:
T T
1. αT e = 1 2. β (1) e = 1 3. β (2) e = 0

and the additional conditions for order 1.5:

2 T
1. αT B (0) e = 1 3. αT B (0) e = 3
2
4. β (1) c(1) = 1
1 T
2. αT A(0) e = 2 5. β (2) c(1) = −1

where c(0) = A(0) e with f ∈ C 1,3 (I × Rd , Rd ) and g ∈ C 1 (I, Rd ). These are the (Rößler) SRA
methods. From these conditions he proposed the following Strong Order 1.5 scheme known as
SRA1:

c(0) A(0) B (0)

T T
αT β (1) β (2)

3 3 1
4 4 2

1 2
3 3
1 0 -1 1

5.6 Timestep Adaptivity

The efficient methods for timestep adaptivity are discussed in Rackauckas & Nie 2017. This section
is pulled almost entirely from the paper Adaptive Methods for Stochastic Differential Equations via
Natural Embeddings and Rejection Sampling with Memory. Two steps are required for building an
adaptive method. First, and error estimate has to be derived. Then, from that error estimate,
one has to choose the accept the step or reject the current step (and change ∆t). For the error

51
estimate, it was shown that a natural error estimator exists for any high-order SRK method. A
simplified version is simply:

E = |δED + EN | (5.6)
Xs
(0) (0)
≤ δ ∆t f tn + ci ∆t, Hi
i=1
s
X
(3) I(1,0) (4) I(1,1,1) (1) (1)
+ βi + βi g tn + ci ∆t, Hi
∆t ∆t
i=1

where s is the number of stages and δ is a user-chosen balance between determinsitic and noise
error in the error estimate. A similar summation gives the estimate for additive noise equations.
With the error estimate, the overall algorithm is depicted as:

Use the future information

Update h:=q*h
Update t:=t+h to determine W(t+qh)

Accept if q≥1
Compute a step of Estimate the error
Use E to
size h using two E using the difference
calculate
different integration between the methods
q (Eq 21)
methods (e.g. Eq 2) (Sections 2 and 3) Reject if q<1

Update h:=q*h Save the information for Use the Brownian Bridge
the change of W over to determine a value
the interval (t+qh,t+h) for W(t+qh) (Eq 22)

As in deterministic adaptive stepping algorithms, we define

ǫh 2
q= (5.7)
γE
where ǫ is the chosen suggested error, γ is a penalty factor (in deterministic methods it is often
taken as γ = 2), and E is an error estimate. The step logic is the following:
1. If q < 1, reject the initial choice of h and repeat the calculation with qh
2. If q ≥ 1, then accept the computed step and change h to min (hmax , qh) where hmax is chosen
by the user.

52
For the acceptance/rejectance of the step, care must be taken to not bias the Wiener process. If
extreme values of ∆W are always thrown out then the sample properties are no longer valid. Thus
we must always keep any calculated value of ∆W . The procedure has to be enhanced as follows.
First, propose a step with ∆W P and ∆Z P for a timestep h. If these are rejected, we wish to instead
attempt a step of size qh. Thus we need to sample a new value at W (tn + qh) using the known
values of W (tn ) and W (tn +h). To do so, we use the result that if W (0) = 0 and W (h) = L, then by
the properties of the Brownian Bridge we calculate that for q ∈ (0, 1), W (qh) ∼ N (qL, (1 − q)qh).
We then propose to step by qh and take the random numbers ∆W = W (qh) and ∆Z = Z(qh)
found via their appropriate distribution from the Brownian bridge. We then store the modified
versions of ∆W P and ∆Z p . Notice that since we have moved ∆W in the qh timestep, what remains
is ∆W = ∆W P − ∆W and ∆Z = ∆Z P − ∆Z as the change in the Brownian path from qh to h.
We then store the values L = 1 − qh, ∆W , ∆Z as a 3-tuple in a stack to represent that after our
current calculations, over the next interval of size L1 , the Brownian process W will change by L2
and the process Z will change by L3 . Thus when we finally get to tn + qh, we look at these values
to tell us how the Brownian path changes over the next 1 − qh time units. By doing so, we will
effectively keep the properties of the Brownian path while taking arbitrary steps. This leads to the
RSwM1 algorithm. More complex handling of the timestep, but using the same general idea, leads
to the more efficient RSwM3 algorithm. Note that these stepping routines are compatible with any
high order SDE method as long as some error estimator exists.

5.7 Simulation Via Probability Density Functions

Another method for simulating SDEs is to use the Forward Kolmogorov Equation. Recall that the
SDE

dX = f (X, t)dt + g(X, t)dWt

has a probability distribution function ρ which satisfies the PDE
m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + 2
gi (x)ρ(x, t) .
∂t ∂x 2 ∂x
i=1

This can thus be a way to solve for the probability density using computational PDE solvers. If we
have an initial condition, X(0) = x0 , then this corresponds to the initial condition ρ(x, 0) = δ(x−x0 )
where δ is the Dirac-δ function. This can be particularly useful for first-passage time problems,
where we can set the boundary conditions as absorbing: ρ(a, t) = ρ(b, t) = 0, and thus ρ is the
total probability distribution of trajectories that have not been absorbed. Likewise, if we want
a boundary condition where we say “trajectories reflect off the point a”, then we simply use a
∂ρ
condition ∂x |x=a = 0.
However, though this method may be enticing to those who are experienced with computational
PDE solvers, this method is not the holy grail because it cannot simulate trajectories. If you need

53
the trajectory of a single instantiation of the process, for example, “what does the stock price over
time look like?”, you cannot use this method.

6 Measure-Theoretic Probability for SDE Applications

Here we will get down and dirty with some measure-theoretic probability in order to define condi-
tional expectations, martingales, and get further into stochastic calculus.

6.1 Probability Spaces and σ-algebras

The idea is that the probability space is simply a set of all of the events, a collection of subsets
of events (where if two events are in a subset, it is the event that both happen), and a measure
which is a function that gives the number of how probable a collection of subsets are. These are
thus define in set-theoretic terms so that they correspond to the normal rules of calculus that you
would expect.
Definition: Let Ω be a nonempty set, and let F be a collection of subsets of Ω. We say F is
a σ-algebra if

1. ∅ ∈ F,

2. A ∈ F ⇒ AC ∈ F,
S
3. A1 , A2 , . . . ∈ F ⇒ ∞n=1 An ∈ F.

Propositions:

1. Ω ∈ F.
T∞
2. A1 , A2 , . . . ∈ F ⇒ n=1 An ∈ F.

To prove proposition 1, notice that since ∅ ∈ F, by property 2 the compliment of the empty set,
Ω, must be in F. Proposition two follows by DeMorgan’s Law. By property 3 the union is in F,
and so the union of the compliments are in F by applying property 2 to each component. Thus
the compliment of the union of the complements is in F by property 2. By DeMorgan’s Law,
the intersection of the complement of the complements, or simply the intersection, must be in F
proving the proposition.
Definition: Given Ω and F, a probability measure P is a function F → [0, 1] such that:

1. P (Ω) = 1
S P∞
2. A1 , A2 , , . . . is a sequence of disjoint subsets in F ⇒ P ( ∞n=1 An ) = i=1 P (An ) (Countable
additivity)

54
Definition: (Ω, F, P ) is a probability space. S T∞
Proposition: A1 ⊂ A2 . . . , ⊂ Ai , . . . ∈ F, then P ( ∞n=1 An ) = limn→∞ P (An ) and P ( n=1 An ) =
limn→∞ P (An ) .
These propositions come directly from the properties of measures. See a measure theory text
for more information if you do not know and you’re Curious George. There are many possibilities
for what kind of sample space, Ω, we may want to consider:
1. Ω can be finite.

2. Ω can be countably infinite.

3. Ω can be uncountably infinite.

To determine F we might want to consider the powerset of Ω, which is composed of all possible
subsets of Ω
1. If Ω is finite: the number of subsets is 2|Ω|
2. If Ω is infinite: the powerset of S is uncountable. Thus, in order to avoid paradoxes such as
the Banach-Tarski paradox, we use a σ-algebra.
Suppose Ω = R = (−∞, ∞) and for each a, b ∈ R where a < b, (a, b) ∈ F. The Borel Set is the
smallest σ-algebra containing open intervals. We almost always assume this set when working with
σ-algebras.

6.1.1 Example: Uniform Measure

Suppose we are given the uniform measure on [0, 1]

P [a, b] = b − a, 0 ≤ a ≤ b ≤ 1.

whereΩ = [0, 1] and F is the set of all closed intervals of Ω. We note here that the Borel σ-algebra,
B[0, 1], which is F, also contains all open intervals as constructed by
∞
[
1 1
(a, b) = a + ,b − .
n n
n=1

Thus the Borel sets on the real line are the closed and opened sets! Thus, all of the sets you would
ever use in practice are in the Borel set σ-algebra.

6.1.2 Coin Toss Example

Take the coin tossing example. Let Ω∞ be the set of infinite sequence of H’s or T ’s; p ∈ (0, 1) be
the probability of H in any coin toss. Let ωi be the ith event, which can take the values H and
T . We can construct a probability measure P > 0 defined as the probability of seeing at least one
heads, assuming each toss is independent. We can also construct a series of σ-algebra F as follows

55
• F0 = {φ, Ω∞ }, P (φ) = 0, P (Ω∞ ) = 1

• F1 = {φ, Ω∞ , AH , AT } where

– AH = {w ∈ Ω∞ : w1 = H}
– AT = {w ∈ Ω∞ : w1 = T }
S S
• F1 = {φ, Ω∞ , AH , AT , AHH , AT T , AHT , AT H , AC C C C
HH , AT T , AHT , AT H , AHH AT T , . . . , AHT AT H }
where

– AHH = {w ∈ Ω∞ : w1 = H, w2 = H}
– AHT = {w ∈ Ω∞ : w1 = T, w2 = T }
..
.

We can see that F0 ⊂ F1 ⊂ F2 . . . and that the cardinality can grow very quickly. That is,
n
|Fn | = 22 . We can define !
∞
[
F∞ = σ Fi
i=1

Notice that

1. Let A = {ω|ωn = H, ∀n = 1, 2, ...}, then A = AH ∩ AHH ∩ ... ∩ AHHH... ⇒ A ∈ F∞ .

2. P (A) = limn→∞ P (see n H’s in a row) = limn→∞ pn = 0.

Hn (ω)
Let A = {ω| limn→∞ n = 0.5}, where Hn (ω) = the number of H’s in the first n tosses. We have
a few questions:

• Question 1: Is A ∈ F∞ ?

• Question 2: If A ∈ F∞ , what’s P (A)?

Question 1: For a given m ∈ N , define An,m = {ω : | Hnn(ω) − 0.5| ≤ m 1

}, then An,m ∈ Fn ⊂ F∞ . If
ω ∈ A, then for all ∀ǫ > 0, there exists an N such that for all n > N , | Hnn(ω) − 0.5| < ǫ. Let ǫ = m
1
.
Thus there exists an N such that for every ω ∈ An,m and for all n > N ,

ω ∈ ∩∞ ∞ ∞ ∞ ∞ ∞
m=1 ∪N =1 ∩n=N An,m ⇒ A = ∩m=1 ∪N =1 ∩n=N An,m

Thus A ∈ F∞ .
Question 2: By Strong Law of Large Numbers, P (A) = 1.

56
6.2 Random Variables and Expectation
Definition: A random variable is a real valued function X : Ω → R ∪ {∞, −∞} with the property
that for Borel subsets B of R, {X ∈ B} = {ω ∈ ω : X(ω) ∈ B}.
Definition: A measure is a nonnegative countably additive set function; that is a function
µ : F → R with

1. µ(A) ≥ µ(∅) = 0 for all A ∈ F, and

P
2. if Ai ∈ F is a countable sequence of disjoint sets, then µ(∪i Ai ) = i µ(Ai ).

If µ(Ω) = 1, we call µ a probability measure. In particular, µX (B) = P {X ∈ B}.

6.2.1 Example: Coin Toss Experiment

Look once again at the infinite coin toss experiment. Take the probability space of the coin toss
experiment (Ω∞ , F∞ , p). Assume that we start with 4 dollars, and every time we a head we double
our money, and if we get a tail we half our money. Let Sn be the random variable that denotes our
value after n tosses.
16
H

8
H T

4 4

T H
2

T
1 Thus

• S0 (ω) = 4, for all ω;

• S1 (ω) = 8 if ω1 = H; 2 otherwise;

• ...

• Sn+1 (ω) = 2Sn (ω) if ωn+1 = H; 12 Sn (ω) otherwise.

57
6.2.2 Example: Uniform Random Variable
A random variable X uniformly distributed on [0, 1] can be simulated based on the example of
infinite independent coin toss with p = 0.5. To do so, let Yn (ω) be the indicator that the nth coin
toss is a heads, that isYn (ω) = 1 if ωn = H and 0 otherwise. Thus we define the random variable
∞
X Yn (ω)
X= .
2n
n=1

Notice if we look at the base-2 decimal expansion of X, that would be the sequence of heads and
tails where a 1 in the ith digit means the ith toss was a heads. Thus we see that the range of X is
[0, 1] and every decimal expansion has equal probability of occurring, meaning that X is uniformly
distributed on [0, 1]. Thus we get that the probability of being between a and b is
µX (a, b) = b − a, 0 ≤ a ≤ b ≤ 1.

6.2.3 Expectation of a Random Variable

Let (Ω, F, P ) be a probability space, we define
P
1. If Ω is finite, then E(X) = ω∈Ω X(ω)P (ω).
P
2. If Ω is countably infinite, then E(X) = ∞ k=1 X(ωk )P (ωk ).
R R
3. If Ω is uncountable, then Ω X(ω)dP (ω), or Ω XdP .

Intuition on Lebesgue Integration The integration in case 3 is based on Lebesgue Integral.

Think of the Lebesgue integral as using a “Reimann sum on the y-axis”, meaning you cut up
increments of the y axis instead of the x. The reason this is done is because it can do better for
functions defined on weird domains. For example, if we let
(
1, x is irrational
f (x) =
0, o.w.

we know that almost all of the time x = 1 (since the rationals are a countable set while the
irrationals are an uncountable set (a larger infinity)). This is however not computable using the
Reimann integral. But, using the Lebesgue integral we get
Z 1
df (x) = 1µ(A)
0

where µ(A) means the measure (the length) of the set where f (x) = 1. Since there are only
countably many holes, µ(A) = µ([0, 1]) = 1. Thus
Z 1
df (x) = 1
0

58
which matches our intuition. This may sound scary at first, but if you’re not comfortable with these
integrals you can understand the following theory by just reminding yourself it’s simply another
way to integrate that works on weirder sets than the Reimannian integral.

Construction of Expectation/Lebesgue Integral:

1. Let X be Ra characteristic function, i.e. X(ω) = 1 if ω ∈ A; 0 otherwise for some A ∈ F, then

we define Ω XdP = P (A).
Pn R
2. If
Pn X is a simple function, i.e. X(ω) = k=1 ck XA k
(ω) ∀Ak ∈ F, then we define Ω XdP =
k=1 c k P (Ak ).

3. If X is nonnegative, then we define

Z Z
XdP = sup{ Y dP : Y is simple, Y (ω) ≤ X(ω), ∀ω ∈ Ω}.
Ω Ω

R R + dP −
R − +
4. For any general X, we define Ω XdP =
R Ω X Ω X dP , where X = max{0, X} and
−
X = max{−X, 0}. X is integrable if Ω |X|dP < ∞.

6.2.4 Properties of Expectations:

R R
• A XdP = Ω IA XdP , where IA is the indicator function on A.
R R R
• Ω (X + Y )dP = Ω XdP + Ω Y dP .
R R
• Ω cXdP = c Ω XdP .
R R
• If X(ω) ≤ Y (ω) a.s., then Ω XdP ≤ Ω Y dP .
R R R
• A∪B XdP = A XdP + B XdP if A ∩ B = ∅.

Note that the definition of almost surely (a.s.) will be explained in soon.

6.2.5 Convergence of Expectation

Let’s start with an example. Let Xn ∼ N (0, n1 ). Thus the probability density of Xn is
r
n − nx2
fn (x) = e 2
2π

Moreover, E(Xn ) = 0 ∀n. Also notice that

Z ∞
fn (x)dx = 1.
−∞

59
Define
f (x) = lim fn (x).
n→∞

Notice that this converges point-wise to the function f (x) = 0. Thus

Z ∞ Z ∞
lim fn (x)dx 6= f (x)dx.
n→∞ −∞ −∞

6.2.6 Convergence Theorems

We will take these theorem without proof. They are based off of famous measure theory theorems
and are useful in proving properties later in this chapter.
For this theorem, we will need a few definitions. Take two random variables X and Y .
Definition: Define the idea that X = Y almost surely (a.s). if P {ω ∈ Ω : X(ω) = Y (ω)} = 1.
Intuitively, this means that X and Y agree on everything except for at most a measure 0 set, but
since no event from a measure 0 set will almost surely occur, they are basically always equivalent.
Proposition: Take {Xn }∞ n=1 be a sequence of random variables. If limn→∞ Xn (ω) = X(ω) for
all ω except on a set of measure 0, then Xn converges a.s. to X.
Proposition: If 0 ≤ X1 (ω) ≤ X2 (ω) ≤ . . . for all ω except on a set of measure 0, then
0 ≤ X1 ≤ X2 ≤ ... a.s.
With these, we can write a few convergence theorems. Note that these results can be directly
obtained from well-known measure-theory theorems. These three theorems are also true for a more
general class of functions: measurable functions.
Theorem: Monotone Convergence Theorem. Take {Xn }∞ n=1 be a sequence of random variables
converging almost surely (a.s.) to a random variable X. Assume 0 ≤ X1 ≤ X2 ≤ ... a.s.. Thus
Z Z
XdP = lim Xn dP,
Ω n→∞ Ω

or equivalently
E(X) = lim E(Xn ).
n→∞

Theorem: Dominated Convergence Theorem. Take {Xn }∞ n=1 be a sequence of random vari-
ables converging almost surely (a.s.) to a random variable X. Assume that there exists a random
variable Y s.t. |Xn | ≤ Y a.s. for all n, and E(Y ) < ∞, then E(X) = limn→∞ E(Xn ).
Theorem: Fatou’s Lemma. Take {Xn }∞ n=1 be a sequence of random variables converging
almost surely (a.s.) to a random variable X. Thus E(X) ≤ lim inf n→∞ E(Xn ).

6.3 Filtrations, Conditional Expectations, and Martingales

6.3.1 Filtration Definitions
For the following definitions, take the probability space (Ω, f, p).

60
Definition: Let Ω 6= ∅. Let T be a fixed positive number. Assume that for all t ∈ [0, T ], there
exists a σ-algebra f (t). Assume that if s < t, f (s) ⊂ f (t). We define the collection of σ-algebras
FT = {f (t)}0≤t≤T as the filtration at time T . It is understood intuitively as the complete set of
information about the stochastic process up to time T .
Now we need to bring random variable definitions up to our measure-theoretic ideas.
Definition: Let X be a random variable in (Ω, f, p). The σ-algebra generated by X is the set

σ(X) = X −1 (A) : A ∈ f

Definition: Given (Ω, f, p) with a filtration Gt and a random variable X, we call X Gt -

measurable if σ(X) ⊆ Gt .
Definition: Let X(t) be a stochastic process of (Ω, f, p) and let Ft be a filtration defined on
t ∈ [0, T ]. X(t) is an adapted stochastic process if for each t ∈ [0, T ], X(t) is Ft -measurable.
What these definitions have setup is the idea of “information about a random process”. A
random variable X is Gt -measurable if, given the information of Gt , X is already known. A stochastic
process is adapted if a filtration Fs gives us all of the information about the stochastic process up
to the time s, and thus X(t) is known for all t ∈ [0, s].

6.3.2 Independence
Definition: Take two events A,B∈ f . The events A and B are independent if P (A ∩ B) =
P (A)P (B).
Definition: Let G, H ⊂ f be two sub-σ-algebras of f . G and H are independent if, for all
A ∈ G and B ∈ H, P (A ∩ B) = P (A)P (B).
Definition: Take the random variables X and Y of the probability space (Ω, f, p). X and Y
are independent if σ(X) and σ(Y )are independent.
These are straight-forward and obvious to make yourself feel better about all this and make it
feel easy.

6.3.3 Conditional Expectation

Definition: Let G be a sub-σ-algebra of f (G ⊂ f ). Let X be a random variable that is either
non-negative or integrable. The conditional expectation of X given G, E[X|G], is a random variable
that satisfies the following properties:

1. E[X|G] is G-measurable.

2. Partial Averaging Property: For all A ∈ G,

Z Z
E[X|G]dp = Xdp.
A A

61
Let us understand what this definition means. We can interpret E[X|G] to be the random variable
that is the “best guess” for the values of X given the information in G. Since the only information
that we have is the information in G, this means that E[X|G] is G-measurable. Our best guess for
the value of X is the expectation of X. Notice
R that in measure-theoretic probability, the expectation
of E[X|G] for the event A is defined as A E[X|G]dp. Thus the partial averaging property is simply
saying that we have adapted the random variable E[X|G] such that its expectation for every event
that has happened in G is the same as the expectation of X, which means that for any thing that
has happened, our best guess is simply what X was!
Definition: Take the random variables X and Y . We define E[X|Y ] = E[X|σ(Y )].
This gives us our link back to the traditional definition of conditional expectations using random
variables. Notice that this is how we formally define functions of random variables. Writing the
conditional expectation as E[Y |X = xi ] = yi , we can think of this as a mapping from xi to yi . Thus
we can measure-theoretically interpret f (Y ) = E[X|Y ].

6.3.4 Properties of Conditional Expectation

1. Theorem: E[X|G] exists and is unique except on a set of measure 0. This is a direct result
of the Radon-Nikodym theorem.

2. E [E[X|G]] = E[X]. Our best estimate of E[X|G] is E[X] if we have no information.

3. Linearity: E [aX + bY |G] = aE [X|G] + bE[Y |G].

4. Positivity: If X ≥ 0 a.s., then E[X|G] ≥ 0 a.s.

(a) This can be generalized: For all A ⊂ R, if X ∈ A a.s., then E[X|G] ∈ A a.s.

5. Taking out what is known: If X is G-measurable, then E[XY |G] = XE[Y |G].Notice that
this is because if X is G-measurable, it is known given the information of G and thus can be
treated as a constant.

6. Iterated Conditioning: If H ⊂ G ⊂ F, then E [E[X|G]|H] = E[X|H].

7. Independence of G: If X is independent of G, then E[X|G] = E[X]. If G gives no information

about X, then the expectation condition on the information of G is simply the expectation
of X.

6.3.5 Example: Infinite Coin-Flipping Experiment

Consider the infinite coin-flipping experiment (Ω∞ , F, p). Recall that this has a probability p of
being heads and 1− p of being tails. You start with S0 money, and for each heads you multiply your
money by u, and for each tails you multiply your money by d. The experiment can be explained
by the following tree:

62
S0 u2 = p2
H
S0 u
H T
S0 ud = p(1 − p)
S0
S0 ud = p(1 − p)
T H
S0 d

T
S0 d2 = (1 − p)2
↑ ↓ ↑ ↓ ↑
S0 ω1 S1 ω2 S2

Let’s say we have the information given by the filtration G and we let X be our earnings from
the game after two coin-flips. We explore the properties in the following scenarios.

1. G = F0 . Recall that F0 = {Ω∞ , ∅} and thus it literally contains no information. Thus we have
that X is independent of the information in F0 . E[X|F0 ] =E[X]. We can calculate this using the
traditional measures:
X
E[X] = xi Pr(X = xi ) = S0 p2 u2 + 2p(1 − p)ud + (1 − p)2 d2 .
i

2. G = F1 . Recall that F1 is the information contained after the first event. Thus we know the
result of the first event, ω1 . Thus we can calculate what we would expect X to be given what we
know about the first event. Thus
(
S0 pu2 + (1 − p)ud , ω1 = H
E[X|F1 ] = .
S0 pud + (1 − p)d2 , ω1 = T

3. G = F2 . Notice that if we have the information of F2 , then X is determined (X is G-

measurable). Thus 
2
S0 u , ω1 ω2 = HH

E[X|F2 ] = S0 ud, ω1 ω2 = HT or T H .


S0 d2 ω1 ω2 = T T

63
6.4 Martingales
Definition: Take the probability space (Ω, F, p) with the filtration {Ft }0≤t≤T , and let {Mt }0≤t≤T
is an adaptive stochastic process w.r.t. {Ft }. Mt is a martingale if E[Mt |Fs ] = Ms for all 0 ≤ s ≤ t.
If E[Mt |Fs ] ≥ Ms then we call Mt a sub-martingale, while if E[Mt |Fs ] ≤ Ms then we call Mt a
super-martingale.
These definitions can be interpreted as follows. If, knowing the value at time s, our best guess
for Mt is that it is the value Ms , then Mt is a martingale. If we assume that it will grow in
expectation in time, it’s a sub-martingale while if we assume that its value will shrink in time, it is
a super-martingale.

6.4.1 Example: Infinite Coin-Flipping Experiment Martingale Properties

Looking back at the infinite coin-toss experiment, if Sn is the money we have after event n, then
(
uSn , ωk+1 = H
Sn+1 = .
dSn , ωk+1 = T

Thus we have that

E[Sn+1 |Fn ] = puSn + (1 − p)dSn = Sn (pu + (1 − p)d) = αSn .

Recalling that Sn is a martingale if we would expect that our value in the future is the same as
now, then

1. If α = 1, then Sn is a martingale.

2. If α ≥ 1, then Sn is a sub-martingale.

3. If α ≤ 1, then Sn is a super-martingale.

6.4.2 Example: Brownian Motion

Theorem: A Brownian motion Bt is a martingale.
Proof:

E[Bt |Fs ] = E[Bt − Bs + Bs |Fs ] = E[Bt − Bs |Fs ] + E[Bs |Fs ] = 0 + Bs = Bs

since on average Bt − Bs = 0.

64
6.5 Martingale SDEs
Theorem: Take the SDE
dXt = a(Xt , t)dt + b(Xt , t)dWt .
If a(Xt , t) = 0, then Xt is a martingale.
Proof: If we take the expectation of the SDE, then we see that

dE[Xt ]
= a(Xt , t)
dt
and thus if a(Xt , t) = 0, the expectation does not change and thus Xt is a martingale.

6.5.1 Example: Geometric Brownian Motion

Take the stochastic process
1 2
Zt = f (Wt , t) = e−θWt − 2 θ t .
Notice using Ito’s Rules that

∂f ∂f 1 ∂2f
dZt = dt + dWt + (dWt )2
∂t ∂Wt 2 ∂Wt2
1 1
= − θ 2 Zt dt − θZtdWt + θ 2 Zt dt
2 2
= −θZtdWt

since (dWt )2 = dt. Thus since there is no deterministic part, Zt is a martingale. Thus since Z0 = 1,
we get that
E[Zt |Fs ] = Zs ,
and
E[Zt |F0 ] = 1.

6.6 Application of Martingale Theory: First-Passage Time Theory

Take the SDE
n
X
dXt = f (Xt , t)dt + gi (Xt , t)dWi .
i=1

Here we will consider first-passage time problems. These are problems where we start in a set
and want to know at what time we he the boundary. Fix x > 0. Let τ = inf{t ≥ 0 : Xt = x}
which is simply the first time that Xt hits the point x.

65
6.6.1 Kolmogorov Solution to First-Passage Time
Notice that we can solve for the probability definition using the Forward Kolmogorov Equation
m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + 2
gi (x)ρ(x, t) .
∂t ∂x 2 ∂x
i=1

Given the problem, we will also have some initial condition ρ(x, 0) = ρ0 (x). If we are looking for
first passage to some point x0 , then we put an absorbing condition there ρ(x0 , t) = 0. Thus the
probability that you have not hit x0 by the time t is given by the total probability that has not
been absorbed by the time t, and thus
Z ∞
Pr(t ≥ τ ) = ρ(x, t)dx
−∞

to get the cumulative probability distribution

Z ∞
Pr(t ≤ τ ) = 1 − ρ(x, t)dx.
−∞

Thus the probability density function for τ is simply

Z ∞
∂
fτ (t) = ρ(x, t)dx .
∂t −∞

This method will always work, though many times the PDE will not have an analytical solution.
However, this can always be solved using computational PDE solvers.

6.6.2 Stopping Processes

If Mt is a martingale, then the stopped martingale is defined as
(
M (t), t ≤ τ
M (t ∧ τ ) =
M (τ ), t ≥ τ

where M (t ∧ τ ) = min(t, τ ). This simply the process that we stop once it hits x.
Theorem: If Mt is a martingale, then Mt∧τ is a martingale.
The proof is straight-forward. Since the stopped martingale does not change after τ , then the
expectation definitely will not change after τ . Since it is a martingale before τ , the expectation
does not change before τ . Thus Mt∧τ is a martingale.

66
6.6.3 Reflection Principle
Another useful property is known as the Reflection Principle. Basically, if we look at any Brownian
motion at the time T , there is just as much of a probability of it going up and there is of it going
down. Thus the trajectory that is reflected after a time τ is just as probable as the non-reflected
path.

Wiener_process_reflection.png

(Picture taken from Oksendal Stochastic Differential Equations: An Introduction with Applica-
tions) We can formally write the Reflection Principle as

P (τ ≤ t, Bt < x) = P (Bt ≥ x).

This leads to the relation

P (τ ≤ t) = P (τ ≤ t, Bt < x) + P (τ ≤ t, Bt ≥ x)
= 2P (Bt ≥ x)
Z ∞
1 − u2
= 2 √ e 2t du.
x 2πt

6.7 Levy Theorem

Theorem: Levy Theorem. Take (Ω, F, p). If W (t) is a martingale under p and dW (t)×dW (t) =
dt, then W (t) is Brownian Motion.
We state this without proving it since this takes a lot more detailed treatment of Brownian
motion.

6.8 Markov Processes and the Backward Kolmogorov Equation

6.8.1 Markov Processes
Take 0 ≤ t0 < t and let h(y) be a function. Define

Et0 ,x [h(x(t)] = v(t0 , x)

as the expectation of h(x(t)) given that x(t0 ) = x.

Definition: Take the random variable x. If

E0,ζ [h(x(t)|Ft0 ] = Et0 ,x [h(x(t)] = v(t0 , x),

then we say x satisfies the Markov property and is thus a Markov process. Another way of stating
this is that
E [f (Xt )|Fs ] = E [f (Xt )|σ(Xs )] ,

67
or
P (xt ∈ A|Fs ) = P (xt ∈ A|xs ).
Intuitively, this property means that the total information for predicting the future of xt is com-
pletely encapsulated by the current state xt .

6.8.2 Martingales by Markov Processes

Take the SDE dXt = a(Xt )dt + b(Xt )dWt .
Theorem: Xt is a Markov Process.
Theorem: Given Et0 ,x [h(x(t)] = v(t0 , x), {v(x, t)}0≤t≤T is a martingale w.r.t. {Ft }.
Proof: Look at E [v(x, t)|Fs ] for 0 < s ≤ t. Notice

E [v(x(t), t)|Fs ] = E Et0 ,x [h(x(t))|Ft ] |Fs = E [h(x(t))|Fs ] = v(x(s), s)

since Fs ⊂ Ft . Thus v is a martingale.

6.8.3 Transition Densities and the Backward Kolmogorov

We can use this in the following proof. Notice that

∂v ∂v 1 ∂2v
dv = dt + dx + (dx)2
∂t ∂x 2 ∂x2
∂v ∂v 1 ∂2v 2
= dt + (a(Xt )dt + b(Xt )dWt ) + b (x(t), t)dt
∂t
∂x 2 ∂x2
∂v ∂v 1 ∂2 v ∂v
= + a(Xt ) + b2 (x(t), t) 2 dt + b(Xt ) dWt .
∂t ∂x 2 ∂x ∂x

Since v is a martingale, the deterministic part must be zero. Thus we get the equation

∂v ∂v 1 ∂2v
+ a(Xt ) + b2 (x(t), t) 2 = 0.
∂t ∂x 2 ∂x
We can solve this using the initial condition v(x, T ) = h(x) to solve for v(x(t), t).
Define the transition density function as

ρ(t0 , t, x, y)dy = P {x(t) ∈ (y, y + dy)|x(t0 ) = x}

which is the probability of transitioning from x to y between t0 and t. Notice that the probability
of transitioning from h(x) to h(y) from the time t to T is
Z ∞
t,x
v(x, t) = E [h(x(T ))] = ρ(t, T, x, y)h(y)dy.
−∞

68
Now suppose we know h(z) = δ(x − y), meaning that we know that the trajectory ends at y. Thus
Z ∞
v(x, t) = ρ(t, T, x, y)δ(x − y)dy
−∞
= ρ(t, T, x, y).

Thus we plug this into the PDE for v to get

∂ρ ∂ρ 1 2 ∂2 ρ
+ a(x(t), t) + b (x(t), t) 2 = 0.
∂t ∂x 2 ∂x
with the terminal condition ρ(T, x) = δ(x − y) where y is the place the trajectory ends. Thus this
equation, the Kolmogorov Backward Equation, tells us that if we know the trajectory ends at y at
a time T , what is the probability that it was x and a time t < T . Notice that this is not the same
as the Kolmogorov Forward Equation. This is because diffusion forward does not equal diffusion
backwards. For example, say you place dye in water. It will diffuse to spread out uniformly around
the water. However, if we were to play that process in reverse it would look distinctly different.

6.9 Change of Measure

The purpose of the change of measure is that is can greatly simplify equations by taking out
deterministic parts and making random variables into martingales. This will be crucial when
understanding the solution to the Black-Scholes equation.

6.9.1 Definition of Change of Measure

Take the probability space (Ω, F, p). Let z be a non-negative random variable s.t. E [z] = 1. Define
the new measure p̃ as Z
p̃(A) = zdp.
A
Thus for any random variable x in the probability space (Ω, F, p̃) we get that

Ẽ [x] = E [xz]

and if z > 0 a.s. we get hxi

E [x] = Ẽ
z
where Ẽ is the expectation using the measure p̃.
Definition: p and p̃ are equivalent if they agree on which sets have probability zero.
Corollary: They agree on which events will not occur a.s.
Theorem: Radon-Nikodym Theorem. Let p and p̃ be equivalent measures on (Ω, F). There
R dp̂ dp̂
exists a random variable z > 0 a.s. such that p̂(A) = A zdp where z = . We call the
dp dp
Radon-Nikodym derivative.

69
6.9.2 Simple Change of Measure Example
Take (Ω, F, p) and let X be a standard normal random variable under p. Define y = x + θ. Notice
that under p, y ∼ N (θ, 1). However, if we define
θ2
z(ω) = e−θX(ω)− 2

and define Z
p̃(A) = zdp,
A
in this reweighing of the probability space y is standard normal. This is because by definition the
pdf of y using the measure p̃ is

fe(y) = z(θ)N (x + θ, 1)
2 /2 2 /2 1
= eθx−θ e−(x+θ) √
2π
1 2
= √ e−x /2 ∼ N (0, 1)
2π

6.9.3 Radon-Nikodym Derivative Process

Take any random variable X. Using this Radon-Nikodym derivative and a filtration, we can use it
to define a unique process with no deterministic part Z, and use this random variable to define a
measure p̃ to transform one stochastic process to another.
Definition: Take(Ω, F, P ) with Ft as a filtration of this space up to t with 0 ≤ t ≤ T where T
e
is some fixed finish time. Let ζ = ddPP and satisfy the conditions required to be a Radon-Nikodym
derivative. We can define a Radon-Nikodym derivative process (RNDP) Zt as

Zt = E[ζ|Ft ].
Notice that

1. Z(t) is a martingale.

2. If we let y be an Ft -meausrable random variable, then

Ẽ[y] = E[yZt ] = E [yE[ζ|Ft ]] = E [E[yζ|Ft ]] = E[yζ],

and
E[yZt |Fs ]
Ẽ[y|Fs ] = ,
Zs
for 0 ≤ s ≤ t ≤ T .

70
6.9.4 Girsanov Theorem
Using the change of measure determined by the RNDP, we can always construct a Brownian motion
from an adapted stochastic process. This is known as Girsanov Theorem.
Theorem: Girsanov Theorem. Let Wt be a Brownian motion in (Ω, F, P ) and Ft be a
filtration with 0 ≤ t ≤ T . Let θ(t) be an adapted stochastic process on Ft . Define

Rt Rt
θ(u)dW (u)− 12 θ 2 (u)du
Z(t) = e− 0 0
Z t
W̃t = Wt + θ(u)du
0
dW̃t = dWt + θ(t)dt

Define the probability measure p̃ using Z as

Z
p̃(A) = Zdp.
A

Then the following statements hold:

1. E [Z] = 1

2. W̃t is a Brownian motion under p̃.

Proof: First we show that Zt is a martingale. Notice

1
dlnZt = −θ(t)dWt − θ 2 (t)dt
2
and thus if we let ψ(Z) = eZ , then we use Ito’s Rules to get

dZt = ψ(ln Zt )
1
= Zt d(ln Zt ) + Zt (d(ln Zt ))2
2
1 1
= Zt [−θ(t)dW − θ 2 (t)dt + θ 2 (t)dt]
2 2
= −Zt θ(t)dW

and thus since the deterministic changes are zero, Zt is a martingale. Notice that since Z(0) = 1,
E [Zt ] = Z0 = 1. Thus we have proven the first property.
In order to prove the second property, we employ the Levy Theorem from 6.7. By the theorem
we have that if W̃t is a martingale under p̃ and dW̃t × dW̃t = dt, then Wt is Brownian Motion under
p̃. Notice that
dW̃t × dW̃t = (dWt + θ(t)dt)2 = dWt2 = dt.

71
Thus in order to show W̃t is a Brownian motion under p̃, we simply need to show that W̃t is a
martingale. Since Zt is an RNDP, we use property 2 of RNDPs to see
h i
E W̃t Zt |Fs
Ẽ[W̃t |Fs ] = .
Zs
Thus we use Ito’s Rules to get

d W̃t Zt = W̃t dZt + Zt dW̃t + dW̃t dZ
= −W̃t Zt θ(t)dWt + Zt dWt + Zt θ(t)dt

= Zt Zt − W̃t θ(t) dWt

and thus, since the deterministic changes are zero, W̃t Zt is a martingale under p. Thus we use the
definition of a martingale to get
W̃s Zs
Ẽ[W̃t |Fs ] = = W̃s
Zs
and thus W̃s is a martingale under p̃. Therefore by Levy Theorem we get that W̃t is a Brownian
motion under p̃, completing the proof.

7 Applications of SDEs
Using the theory we have developed, we will now look into some applications of SDEs.

7.1 European Call Options

Let S(t) be the value of the stocks at the time t. Assume that it follows Geometric Brownian
Motion:
dS(t) = µS(t)dt + σS(t)dW.
The European call option is the right to buy the stock at a set price k at a future time T . Thus
the option only has value if S(T ) > k, whereas if S(T ) ≤ k the call option has no value. Thus let
(
S(T ) − k, S(T ) > k
g(S(T )) = = [S(T ) − k]+
0, S(T ) ≤ k

be the value of the option at the time T . Let v(t, S(t)) be the value of the European call option at
the time t. What we wish to do is find out how to evaluate v.

72
7.1.1 Solution Technique: Self-Financing Portfolio
Let X(t) be the value of our portfolio. Let we assume there are only two things: this stock and
the money market. By finding out the optimal amount of money we should be investing into the
stock we can uncover its intrinsic value and thus determine v. Note that if we put our money in
the money market, the it occurs interest at a rate r. Thus we can write the value of our portfolio
as

X(t) = ∆(t)S(t) + X(t) − ∆(t)S(t)

| {z } | {z }
Total amount invested in the stock Money market

and thus by Ito’s Rules

dX(t) = ∆(t)dS(t) + (X(t) − ∆(t)S(t))rdt

= (rX(t) + (µ − r)∆(t)S(t)) dt + σ∆(t)S(t)dW.

Now we expand using Ito’s Rules to get

∂v ∂v 1 ∂2v
dv(t, S(t)) = dt + dS(t) + (dS)2
∂t ∂x 2 ∂x2
∂v ∂v ∂v 1 ∂2v 2 2
= dt + µS(t) dt + σ S(t)dW + σ S (t)dt
∂t
∂x 2
∂x 2 ∂x2
∂v ∂v 1∂ v 2 2 ∂v
= + µS(t) + σ S (t) dt + σ S(t)dW
∂t ∂x 2 ∂x2 ∂x

We now assume that there is no arbitrage. This can be rooted in what is known as the “Efficient
Market Hypothesis” which states that a free-market running with complete information and rational
individuals will operate with no “arbitrage” where arbitrage is the “ability to beat the system”.
For example, if one store is selling a certain candy bar for $1 and another store is buying it for
$2, there is an arbitrage here of $1 and you can make money! But, if these people had complete
information and were rational, we assume that they will have worked this out and no opportunity
like this will be available. In stock market terms, this means that the price of a good equates with
the value of the good. This means that we can assume that the value of the option at the time
t, v(X, t), will equate with its market price. Notice that, since the only way we could have made
money with our portfolio is that the value of the invested stock has increased in order to make our
option more valuable, the value of the option at time t is equal to the value of our portfolio. Since
price equates with value we get the condition

v(X, t) = X(t)

and thus
dv = dX.

73
Thus we solve for v by equating the coefficients between the dX and the dv equations. Notice that
the noise term gives us that
∂v
σ S(t) = σ∆(t)S(t)
∂x
∂v
= ∆(t)
∂x
where we interpret the differential value of v to be associated with the stock price at t
∂v ∂v
= .
∂x ∂x S(t)

This is known as the hedging. The matching the dt coefficients we receive the equation
∂v ∂v 1 ∂2v 2 2
rX(t) + (µ − r)∆(t)S(t) = + µS(t) + σ S
∂t ∂x 2 ∂x2
where we replace X(t) = v(X, t) to get

∂v ∂v 1 ∂2v 2 2
rv + (µ − r)∆(t)S(t) = + µS(t) + σ S
∂t ∂x2 2 ∂x2
∂v ∂v 1 2 2 ∂ v
rv − r x = + σ x
∂x ∂t 2 ∂x2
as a PDE for solving for v. Since we have no arbitrage, the value of the option, v, equals the price
of the option, g(T, S(T ), at the time T . This gives us the system

 ∂v ∂v 1 ∂2v
+ rx + σ 2 x2 2 = rv linear parabolic PDE
∂t
 v(T, ∂x 2 ∂x
x) = [x − k]+
whose solution gives us the evaluation of the option at the time t. This is known as the Black-
Scholes-Merdin Equation. This equation can be solved using a change of variables, though we will
use another method.

7.1.2 Solution Technique: Conditional Expectation

Note that in the Black-Scholes equation, v is independent of µ. This is because the current evalu-
ation of the stock will have already incorporated its long term expected returns. Thus we notice
v(t, S(t)) 6= E[[S(t) − k]+ |Ft ].
Instead we notice that, since v is independent of r, we can let µ take any value to arrive at the
same conclusion for v. Thus let µ = r. Thus we would get
v(t, S(t)) = e−r(T −t) E[[S(t) − k]+ |Ft ]
v(t, S(t)) = e−r(T −t) E[v(T, S(T ))|Ft ]
e−rt v(t, S(t)) = E[e−rT (v(T, S(T ))|Ft ]

74
and thus by definition v is a martingale. Thus we use Ito’s Rules on e−rt v to get
2
−rt −rt ∂v −rt ∂v −rt 1 ∂ v 2
d(e v(t, S(t))) = e −rv + ]dt + e dS + e (dS)
∂t ∂x 2 ∂x2

and use the martingale property to equate the deterministic changes with zero to once again uncover
the Black-Scholes equation.
Thus, since we know that it’s equivalent to say µ = r and evaluate the equation

v(t, S(t)) = e−r(T −t) E[v(T, S(T ))|Ft ],

we note that
v(t, S(t)) = e−r(T −t) E t,S(t) [[S(t) − k]+ |Ft ].
Recall from the SDE that
dS(t) = µS(t)dt + σS(t)dW
which is Geometric Brownian Motion whose solution is
σ2
)(T −t)+σ(W (T )−W (t)
S(T ) = S(t)e(r− 2 .

Thus, noting that W (T ) − W (t) = ∆W ∼ N (0, T − t), we use the definition of expected value to
get

Z ∞ 2
σ2
)(T −t)+σu 1 − u
v(t, S(t)) = [S(t)e(r− 2 − k]+ p
e 2(T −t) du
−∞ 2π(T − t)
Z ∞ 2
1 2
(r− σ2 )(T −t)+σu − u
= p S(t)e − k e 2(T −t) du
2π(T − t) α

where
k σ2
ln − r− (T − t)
S(t) 2
α= .
σ
We can solve this to get
v(t, x) = xN (d+ ) − ke−rτ N (d− )
where τ = T − t and
1 x σ2
d± (τ, x) = √ ln + r ± τ
σ τ k 2
and Z y
1 u2
N (y) , √ e− 2 du = −erf c(y).
2π −∞

75
7.1.3 Justification of µ = r via Girsanov Theorem
Take the generalized stock price SDE

dS(t) = µ(t)S(t)dt + σ(t)S(t)dW

and let Rt
r(s)ds
D(t) = e 0 .
Thus we use Ito’s Rules to get

d (D(t)S(t)) = σ(t)D(t)S(t) [dWt + θ(t)dt]

where
µ(t) − r(t)
θ(t) = .
σ(t)
Thus we define dW̃t = dWt + θ(t)dt to get

d (D(t)S(t)) = σ(t)D(t)dW̃t .

Therefore if we define Rt Rt
θ(u)dW (u)− 12 θ 2 (u)du
Zt = e− 0 0 ,
and Z
p̃(A) = Zt dp,
A

then dW̃t is a Brownian motion under p̃. Therefore, since there is no deterministic change, D(t)S(t)
is a martingale under p̃. We note that D(t)X(t) is also a martingale under p̃. We call p̃ the Risk-
Neutral World. Note that in the Risk-Neutral World that, we can set the price of the option,
discounted by D(t), as the expectation conditioned on the totality of information that we have.
Thus for V (S(T ), T ) as the payoff of a derivative for a security for stock S(T ) at time T , we get

D(t)V (S(t), t) = Ẽ [D(t)V (S(t), t)|Ft ] .

This is an equivalent expression as the conditional expectation from before, saying that we can let
µ = r because this is simply a change of measure into the Risk-Neutral World.

7.2 Population Genetics

We consider a new case study involving population genetics under the Wright-Fisher model. The
Wright-Fisher model is a discrete stochastic model of genetic variance which was first introduced to
formally develop the ideas of genetic drift. We will start by considering the classical Wright-Fisher
model and show how it can be approximated using SDEs. This will give us an intuitive way of
generalizing and simulating Wright-Fisher models in a way that is useful for applications. This
type of analysis is taken from Richard Durrett’s Probability Models for DNA.

76
7.2.1 Definitions from Biology
First we will introduce some definitions from biology. A gene locus is simply a location on a
chromosome or a portion of a chromosome. It can represent a gene, a SNP (single base pair
polymorphism) or simply a location. An allele is one of a number of alternative forms of the locus.
A dominant allele is usually capitalized. For human and other diploid animals, there are typically
two alleles of paternal and maternal origin. A genotype is the genetic makeup of an individual.
In this case it will denote the types of alleles an individual individual carries. For example, if the
individual has one dominant allele and one recessive allele, its genotype is Aa. Having the same
pair of alleles is called homozygous while having different alleles is called heterozygous. A mutation
is a random genetic change. Here we refer to it as the random spontaneous change of one allele to
another. Fitness is the measure of survivability and ability to reproduce of an individual possessing
a certain genotype (this is wrong for many philosophical / scientific reasons, but we can take this
as a working definition since this is a major topic worth its own book) . Neutral evolution is the
case where all genotypes of interest have the same fitness. This implies that there is no selection
from such genetic variations and all change is inherently random.

7.2.2 Introduction to Genetic Drift and the Wright-Fisher Model

Here we introduce the Wright-Fisher model. Consider a model with the following assumptions:

1. There is a population with finite size N .

2. The size of the population is constant throughout evolution.

3. There are discrete, non-overlapping generations.

4. Mating is completely random and is determined by random sampling with replacements. This
means any individual can randomly give rise to 0, 1, . . . many offsprings.

From this, the time evolution of the Wright-Fisher model is defined as:

1. At generation 0, there are 2N alleles some A, some a.

2. At generation 1, each A or a allele from generation 0 may result in one, more or zero copies
of that allele.

The following theorem holds:

Theorem: Genetic Drift. Because the population size is finite and fixed, due to uneven
passing of alleles by chance, eventually there will be only one allele, A or a, left. This is known as
Genetic Drift.
Notice the implication of the theorem. This means that, even in the absence of selection, you
can have alleles fix in the population and thus have a form of evolution occur! We can intuitively
understand this as simply due to first-passage problems.

77
7.2.3 Formalization of the Wright-Fisher Model
Let us characterize the Wright-Fisher model in probabilistic terms. Let Xn be the total number of
A alleles at generation n. Considering that there are a total of 2N alleles in any given generation,
we can define Pn = X 2N to be the percentage of alleles that are A. Let X0 be some known initial
n

condition. Because there are 2N independent alleles that could be passed and the probability of
generating an A allele at generating n is Pn−1 (because of sampling with replacement), we can derive
the distribution for Xn using indicator functions. Order the alleles in the population. Let Xn,i be
1 if the ith allele is an A and 0 if the ith allele is a 0. Since we are sampling with replacement, the
choice of allele i is independent of the choice of allele j for any i 6= j. Thus each Xn,i ∼ Bern(Pn−1 ),
that is each P is a Bernoulli random variable with probability of heads equal to Pn−1 . Therefore,
since Xn = i Xn,i and each Xn,i are independent, Xn is modeled by the result of 2N coin-flips
each with a probability Pn−1 of heads. Thus

Xi ∼ Bin(2N, Pi−1 ),

which makes
2N
Pr(Xk = k, 0 ≤ k ≤ 2N ) = k
Pn−1 (1 − Pn−1 )2N −k .
k
Thus we can show that
E [Xn |Xn−1 ] = 2N Pn−1 = Xn−1
which implies Xn is a martingale. Also

V [Xn ] = 2N Pn−1 (1 − Pn−1 ).

Notice that we can think of this process as some form of a random walk for Xn with probability of
going right as Pn−1 . One can rigorous show then that for the 1-dimension random walk of this sort

lim Pn = 0 or 1
n→∞

which means that the process will fix to one of the endpoints in a finite time. Since Xn is a
martingale, we note that
lim E [Xn ] = X0 ,
n→∞

which means

Pr(X∞ = 2N ) ∗ (2N ) + Pr(X∞ = 0) ∗ 0 = X0

and thus by dividing by 2N we get

Pr(X∞ = 2N ) = P0 .

78
7.2.4 The Diffusion Generator
If we assume N is large, then it is reasonable to assume that the gene frequencies Pn will change in
a manner that is almost continuous and thus can be approximated by an SDE. Consider an SDE
of the form p
dXt = a(x)dt + b(x)dWt
with the initial condition X0 = x. Applying Ito’s rule gives us

df 1 d2 f
df (Xt ) = dXt + (dXt )2 ,
dX 2 dX 2
df df p 1 d2 f
= a(x)dt + b(x)dW + b(x)dt.
dX dX 2 dX 2
∂f ∂2f
Writing fX = ∂X and fXX = ∂X 2
, we get that

E [f (Xt )] 1
= fX a(x) + fXX b(x).
dt 2
Thus we define L to be the operator

E [f (Xt )] 1
Lf = = fX a(x) + fXX b(x).
dt 2
L is known as the diffusion generator. If we let f (x) = x then

dE [f (x)]
Lf = a(x) =
dt

where a(x) defines the infinitesimal mean changes. If we define f (y) = (y − x)2 for a fixed x, then

d h i
b(x) = E (Xt − x))2
dt
makes b(x) define the infinitesimal variance. Therefore, we can generally relate discrete stochastic
processes to continuous stochastic processes by defining the SDE with the proper a and b such
that the mean and variance match up. This can formally be shown to be the best continuous
approximation to a discrete Markov process using the Kolmogorov Forward Equation.

7.2.5 SDE Approximation of the Wright-Fisher Model

Recall that in the Wright-Fisher Model we have that

V [Xn ] = 2N Pn−1 (1 − Pn−1 ).

79
Notice too then that
V [Pn ] = Pn−1 (1 − Pn−1 ).
Because
E [Xn |Xn−1 ] = Xn−1
we get that
dE [Xn ]
= 0.
dt
This means that we define the SDE approximation of the Wright-Fisher model to be the one that
matches the mean and variances, that is
p
dXt = Xt (1 − Xt )dWt .

Notice that this approximation converges as N → ∞.

7.2.6 Extensions to the Wright-Fisher Model: Selection

Now we extend the Wright-Fisher model to incorporate more of the population dynamics. For
example, assume that there is selection. Assume that the fitness of A is 1 and that the fitness of a
is 1 − s for some constant s. We define this using a re-sampling idea. We say that if A is sampled,
will accept it with probability 1, while if a is sampled, we will accept it with a probability 1 − s.
This means that the probability of choosing A is
Xn−1
.
Xn−1 + (1 − Xn−1 )(1 − s)

Assuming s is small, the probability of choosing A is

Xn−1
≈ = Xn−1 1 + (1 − Xn−1 ) s + O(s2 )
1 − (1 − Xn−1 )s

and thus
∆Xn = Xn − Xn−1 ≈ Xn−1 (1 − Xn−1 )s.
Therefore we make
dE [Xn ]
= sXn−1 (1 − Xn−1 ) .
dt
To make this continuous we note
dE [Pn ] s
= Pn−1 (1 − Pn−1 ) .
dt 2N
= γPn−1 (1 − Pn−1 )
s
where γ = 2N . Therefore we make this continuous by matching the diffusion operator as

80
d 1 d2
L = γx(1 − x)
+ x(1 − x) 2
dx 2 dx
and thus we approximate the Wright-Fisher model with selection as the SDE
p
dXt = γXt (1 − Xt ) dt + Xt (1 − Xt )dWt

when N is large.

7.2.7 Extensions to the Wright-Fisher Model: Mutation

β1 β2
Let µ1 = 2N be the chance of a allele randomly changing to A allele, and µ2 = 2N be the chance
of the reverse mutation. One can derive the differential generator for this process to be

d 1 d2
+ x(1 − x) 2 .
L = (γx(1 − x) + β1 (1 − x) + β2 x)
dx 2 dx
Therefore the best SDE approximation to the Wright-Fisher Model with selection and mutation is
p
dXt = (γx(1 − x) + β1 (1 − x) + β2 x) dt + Xt (1 − Xt )dWt .

7.2.8 Hitting Probability (Without Mutation)

Let h(t) = Pr(τ ≤ t) where τ = inf t {X(t) = a}. For example, a = 1 would mean that τ is the time
for fixation by A. If there is no selection in the model, then h(t) is a martingale and thus

E [h(t)|Fs ] = h(s)

and therefore, in the limit as t → 0,

Z 1
h(x) = Pr(X t = y|X0 = x)h(y)dy
0
Z 1 Z 1
δ(x − y)h(y)dy = Pr(X t = y|X0 = x)h(y)dy
0 0
Z 1
0 = [Pr(Xt = y|X0 = x) − δ(x − y)] h(y)dy.
0

This, in the limit as t → 0,

∂ρ
Pr(Xt = y|X0 = x) − δ(x − y) →
∂t
Recall that from the Kolmogorov Forward Equation that
∂ρ ∂(aρ) 1 2 ∂ 2 ρ
= + b .
∂t ∂x 2 ∂x2

81
Thus
Z 1
∂ρ
0 = h(y)dy
0 ∂t
Z 1
∂(aρ) 1 2 ∂ 2 ρ
0 = − + b h(y)dy
0 ∂x 2 ∂x2

To solve this we use integration by parts to get the PDE

Lh = 0
h(a) = 1
h(b) = 0

where
d 1 d2
L = a(x) + b(x) .
dx 2 dx
dh
To solve for the probability density function of the first-passage time, define ψ = dx and thus
ψ ′ = − a(x)
b(x) . Therefore we get
Rxa(y)
− 0 b(y) dy
ψ(x) = ψ(0)e
and Z x
h(x) = ψ(y)dy.
0
This is expanded in the book by Kimora and Crow.

7.2.9 Understanding Using Kolmogorov

Notice that, given the SDE, we can understand the dynamics using the Kolmogorov equation.
The fixation time distribution can be calculated by using the forward Kolmogorov with absorbing
conditions at 0 and 1, that is ρ(0, t) = ρ(1, t) and thus we can solve for the fixation times using
the first-passage time distributions as examined in 6.6.1. Also note that when we have mutation,
there exists a non-trivial steady state distribution. Using the Kolmogorov equation, we can solve
for this steady state distribution as the distribution π s.t. ∂π
∂t = 0 and thus

∂ 1 ∂2 2
0=− [a(x)π(x, t)] + 2
b (x)π(x, t) .
∂x 2 ∂x
For the Wright-Fisher model with selection and mutation, we note that this can be solved so that

π(x) = x2β1 −1 (1 − x)2β2 −1 = Γ(2β − 1).

1
Notice that this solution is only valued when β1 , β2 > 2, giving us the necessary and sufficient
conditions for a non-trivial steady state.

82
7.3 Stochastic Control
The focus of this lecture will be on stochastic control. We begin by looking at optimal control in
the deterministic case.

7.3.1 Deterministic Optimal Control

Suppose you have a deterministic ODE

ẋ = F (x(t), u(t))

where x(t) is the state variable and u(t) is the control variable. We wish to apply optimal control
over a fixed time period [0, T ] such that
Z T
V (x(t), t) = min C[x(t), u(t)]dt + D[x(T )]
u 0

where C is the cost function, D is the terminal cost, and V (x(t), t) is the optimal or minimal cost at
a time t. Thus what we are looking to do is, given some cost function C, we are looking to minimize
the total cost. Let’s say for example that we want to keep x at a value c, but sending control signals
costs some amount α. Thus an example could be that C(x(t), u(t)) = (x(t) − c)2 + αu(t). What
we want to do is solve for the best u(t).

7.3.2 Dynamic Programming

The typical way of trying to solve this general problem is using a dynamic programming technique.

0 t ← dt → T

At point t there is an optimal value V (x(t), t). One way to try to control the system is by
taking small steps and observing how the system evolves. We will do this by stepping backwards.
Divide the cost into two components. Notice that

V (x(t), t) = min {C(x(t), u(t))dt + V (x(t + dt), t + dt)}

Essentially the goal is to find the u(t) for the increment dt that minimizes the growth of V . Do a
Taylor expansion on V to get
∂V ∂V
V (x(t + dt), t + dt) ≈ V (x(t), t) + + · ẋ(t)dt
∂t ∂X

83
∂V
The third part of the above equation ∂X · ẋ(t)dt is not necessarily scalar, but will be the dot product
between two scalars. The importance of this is because it is recursive, and this will lead to the
minimal solution. By plugging this in we get

∂V ∂V
min C(x(t), u(t)) + + , F (x(t), u(t)) = 0.
u ∂t ∂X
∂V
since V (x(t), t) does not depend on u(t) in the range (t, t + dt). We move ∂t outside to get

∂V ∂V
+ min C(x, u) + , F (x, u) =0
∂t u ∂x
The initial condition is the at the cost at the end must equal the terminal cost:
V (x, T ) = D(x).
This defines an ODE where, counting back from T , the terminal cost is the initial condition and
we solve backwards using the stepping equation. However, the PDE that needs to be solved to find
the optimal control of the system which may be hard because the minimization may be nontrivial.
This is known as deterministic optimal control, where the best u will be found. These types of
equations are known as Hamilton-Jacobian-Bellman (HJB) equations, and are famous if you are
studying optimal control.

7.3.3 Stochastic Optimal Control

Now we look at the same problem where the underlying process is the stochastic process
dXt = b(xt , ut )dt + σ(xt , ut )dWt .
Let Xt ∈ Rn and Wt be an m-dimensional Brownian motion. For this to work, we need σ ∈ Rn×m .
Define the value equation as the minimum expected cost, that is
Z T
V (x(t), t) = min E C(X(t), u(t))dt + D[X(T )]
u t

What type of control is being used? That is the question that needs to be addressed, because there
are many types of controls:
1. Open Loop Control (Deterministic Control).
Suppose u(t, ω) = u(t). This case has no random events and thus it will be deterministic
control (open looped control).
2. Open Looped Control (Feedback Control).
Suppose Ut is Mt -adapted, where Mt is the σ-algebra generated by Xs , 0 ≤ s ≤ t . Essentially
for this σ-algebra you have all the information about the trajectory from 0 up to a certain
time point, the history must be known.

84
3. Markov Control
U (t, ω) = u0 (t, xt (ω)). Markov control uses less control than the open looped control. It only
uses what is current, there is nothing beyond that. This is commonly used in programming
applications because only the last state needs to be saved, leading to iterative solutions that
do not require much working memory or RAM. An example of where this would be used is a
control theory about robots. The robot has to decide to walk or stop, and this decision only
depends on the current state.
These types of controls explain what type of information is allowed to be used. Regardless, we
solve the problem using dynamic programming but this time we use the stochastic Taylor series
expansion:
∂V ∂V 1 ∂2V
V (x(t + dt), t + dt) = · dt + · dX(t) + (dxt )T (dxt )
∂t ∂x 2 ∂x2
∂2V
where ∂x2 is the Hessian of V . Thus

∂V ∂V ∂V 1X ∂2V
V (x(t+dt), t+dt) = V (x(t), t)+ dt+ , b(xt , ut ) dt+ , σ(xt , ut )dBt + aij
∂t ∂x ∂x 2 ∂xi ∂xj
i,j

where
aij = (σσ T )ij .
After obtaining the expectation we solve as before to get
 
∂V  ∂V 1 X ∂ V 
2
+ min C(x, u) + h , b(xt , ut )i + aij =0
∂t u  ∂x 2 ∂xi ∂xj 
ij

with the same initial (final) condition

V (x, T ) = D(x).

Whether or not the solution exists, or is unique are hard questions. In general the goal is a practical
solution to the HJB equations. This type of equation does not necessarily have an analytical
solution. Though caution needs to be exercised as numerical solutions have issues as well. For
example, how could one find the u that minimize this? Some form of the calculus of variations?
There is no clear way of how to do this.

7.3.4 Example: Linear Stochastic Control

Suppose
dXt = (Ht Xt + Mt Ut )dt + σt dWt
where
x0 = x, t ≥ 0. Ht ∈ Rn/n , Ut ∈ Rk ,

85
σt Rn×m , Mt ∈ Rn×k .
This minimizes u over the expectation,
Z T
a x,0
V (x, 0) = E (xTt Ct Xt + uTt Dt ut )dt + XTT RXT
0

where Ct are the costs of controlling at t and Dt is the terminal cost. Try to make the u small. We
denote
ψ(t, x) = min V u
u

Using s as the time stand-in variable, we plug the equation into the HJB equation to get
 
∂ψ  Xn
∂ψ 1 X ∂ ψ 
2
+ min xT Cs X + uT Ps v + (Hs x + Ms )i + (σs σsT )ij =0
∂s u  ∂xi 2 ∂xi ∂xj 
i=1 ij

To solve this, try the solution

ψ(x, t) = XtT St xt + at
where St is positive semi-definite. Thus we get
 
 X 
xT Ṡt x + ȧt + min xT Ct x + V T Dt v + v T + < Ht x + Mt v, 2St x > + (σs σsT )ij (St )ij = 0
u  
ij

xṠt x + ȧt + min xT ct x + V T Dt v+ < Ht x + Mt vm2S)tx > tr[(σt σtT )St ] = 0
u

We note that the optimal u is

u∗ = −Dt−1 MtT St x.
Plugging in this we solve to

xT (ṡ + Ct − St Mt Dt−1 MtT St + 2HtT St )x + ȧt + tr(σσ T S)t = 0.

By matching coefficients, we get the equations

Ṡt + Ct − St Mt Dt−1 MtT St + 2HtT St = 0 ST = R

ȧt = −tr(σσ T s)t , aT = 0
The first equation is called a Ricarti equation.

Ṡt + St At St + Bt St + Ct = 0

These two equations give you St and at such that we arrive at the optimal value.

86
7.4 Stochastic Filtering
The general problem involves a system of equations
(System) dXt = b(t, xt )dt + σ(t, xt )dUt
(Observations) dZt = Ct , xt )dt + γ(t, xt )dVt
Ut : p-dim Brownian motion
Vt : r-dim Brownian motion
where Xt is the state, Zt is the observation, and Ut and Vt are two independent Brownian motions.
Assume F , G, C, D are bounded on bounded intervals, Z0 = 0.
In this problem, we seek to estimate the value of the system X at a future time t based on the
observations U up to the present time s < t, that is, conditional on Gt , the σalgebra generated by
{Zs }0≤s≤t .

7.4.1 The Best Estimate: E [Xt |Gt ]

First you cannot disrespect the intuitive argument. The intuitive argument is that the best predic-
tion is to find the value you would expect given all of the information you have. The information
you have is Gt , and so the best prediction given the totality of information is E [Xt |Gt ].
More formally, we take that the best estimate based on the observations would be a function
that only needs to use the information, implying that X̂(◦) is Gt -measurable. We define the best
estimate as the one that minimizes the Euclidean distance, that is

X̂(◦) = inf E (Xt − Y )2
Y ∈K

where
Kt := Y : Ω → Rn : Y ∈ L2 (P ) and Y is Gt -measurable .
with L2 (P ) being the set of L2 integrable functions by the measure P .
Theorem: Let Gt ⊂ Ft be a sub-σ-algebra and let X ∈ L2 (p) be Ft -measurable. Let N =
Y ∈ L2 (P ) : Y isGt -measurable . It follows that
PN (Xt ) = E [Xt |Gt ]
where PN is the orthogonal projection of X onto N .
Proof: To prove that PN (Xt ) = E [Xt |Gt ], we simply need to show it satisfies the two properties
of the conditional expectation. Notice that it is trivial that PN (Xt ) is Gt -measurable since every
X ∈ N is Gt -measurable. Thus we just need to check the Partial Averaging Property. Since PN is
an orthogonal projection onto N , we get that X − PN (X) is orthogonal to N . Now take IA ∈ N
as an arbitrary indicator function for A ∈ Gt . This means that we define IA as
(
1, ω ∈ A
IA (ω) = .
0, o.w.

87
Since IA ∈ N , X − PN (X) is orthogonal to IA . Thus we from the Hilbert-space dot product that
Z Z
hX − PN (X), IA i = 0 = (X − PN (X)) IA dp = (X − PN (X)) dp
Ω A

and thus, since IA was arbitrary, for all A ∈ Gt ,

Z Z
Xdp = PN (X)dp.
A A

Thus the partial averaging property is satisfied completed our proof.

However, the Hilbert Projection Theorem (from Wikipedia or Rudin) states that there is a
unique Y in the projection space N such that (Xt − Y )2 is minimized, which gives the necessary
and sufficient condition that the vector x − y is orthogonal to N . This means that the vector Y
which minimizes (Xt − Y )2 is also the projection of Xt onto the space N ! Thus we have that

inf E (Xt − Y )2 = PN (Xt )
Y ∈K

to get the relation

X̂t = E [Xt |G] .

7.4.2 Linear Filtering Problem

In order to obtain a solution, we will look at an easier case: the Linear Filtering Problem. Consider
the following 1-dimensional linear system with linear observations:

dXt = F (t)Xt dt + C(t)dUt ; F (t), C(t) ∈ R

dZt = G(t)Xt dt + D(t)dVt ; G(t), D(t) ∈ R

where Xt is the state, Zt is the observation, and Ut and Vt are two independent Brownian motions.
Assume F , G, C, D are bounded on bounded intervals, Z0 = 0, and X0 is normally distributed.
For this problem, we will simply outline the derivation for the Kalman-Bucy Filter and provide
intuition for what the derivation means (for the full proof, see Oksendal). The derivation proceeds
as follows:

Step 1: Show It’s A Gaussian Process Let L be the closure (the set including its limit points)
of the set L2 (p) of random variables that are linear combinations of the form

c0 + c1 Zs1 (ω1 ) + c2 Zs2 (ω) + . . . + ck Zsk (ω)

with sj ≤ t and each cj ∈ R. Let PL be the projection from L2 (p) onto L. It follows that

X̂t = PL (Xt ).

88
We can interpret this step as saying that the best estimate for Xt can be written as a linear
combination of past values of Zt . Notice that since the variance term in the SDE is not dependent
on Z and X, the solution will be a Gaussian distribution. Since the sum of Gaussian distributions
is a Gaussian distribution, this implies that X̂ is Gaussian distributed! This gives the grounding for
our connection between estimating Xt from Zt by using Brownian motions and Gaussian processes.
Because this step is so important, we include a proof.
Theorem: Take X, Zs s ≤ t be random variables in L2 (p) and assume that (X, Zs1 , . . . , Zsn ) ∈
Rn+1 has a normal distribution. For all s1 , . . . , sn ≤ t with n ≥ 1, it follows that
PL (X) = E [X|Gt ] = PK (X).
Proof: Define X̌ = PL (X) and X̃ = X − X̌. This means that X̃ ⊥ L. Thus we can conclude
PL (X) = PK (X) if X̃ must be trivial, that is X̃ = 0. We do this in steps:
1. If (y1 , . . . , yk ) ∈ Rn is normally distributed, then c1 y1 + . . . + ck yk is normally distributed.
We leave out the proof that this means that in the limit as k → ∞ this is still normally
distributed. Thus, since (X, Zs1 , . . . , Zsn ) is normally distributed, X̌ is normally distributed.
h i
2. Since X̃ ⊥ X̌ and each Zsj ∈ L, X̃ is orthogonal to each Zsj . Thus E X̃Zsj = 0. Since
the Zs ’s are jointly Gaussian, non-correlation implies independence. X̃ is independent of
Zsi , . . . , Zsn . Thus denoting G as the σ-algebra generated by the Zs , we get X̃ is independent
of Gt .
3. Take IG to be the indicator function for events ω in the any arbitrary set G ⊂ Gt . Since
X̃ = X − X̌, we multiply both sides by the indicator and take expectations to get
h i
E X − X̌ IG = E IG X̃ .

Since X̃ is independent of G,
h i
E X − X̌ IG = E [IG ] E X̃

= E [IG ] E X − X̌ .
Since the probability of any individual event is measure zero, E [IG ] = 0. Thus

E X − X̌ IG = 0
which gives Z Z
Xdp = X̌dp
G G
for any G ⊂ Gt . Thus the partial averaging property is satisfied, meaning
X̌ = E [X|Gt ]
completing the proof.

89
Step 2: Estimate Using a Gram-Schmidt-Like Procedure To understand this step, recall
the Gram-Schmidt Procedure. What the Gram-Schmidt procedure does is, given a countable set
of vectors, it finds a basis set of orthogonal vectors that will span the same space. It does this
by iteration. First, it takes the first vector as the first basis vector. Then it recursively does the
following. You take the next vector v, find its projection of this vector onto the current basis space
by using the dot product (call it vp ), and then, knowing that this implies v − vp is orthogonal to
the basis space, we add v − vP as a basis vector.
Here we do a similar procedure. Since there are countably many Zt that are used to span PL ,
order them. We replace Zt by the innovation process Nt defined as
Z t
Nt = Zt − (GX)∧ s ds
0

where
(GX)∧
s = PL(X,s) (G(s)X(s) = G(s)X̂(s)
or equivalently
dNt = dZt − G(t)X̂t dt

= G(t) X − X̂ dt + D(t)dVt .

Note that (GX)∧ s basically the “basis set spanned by the t < s” and thus the next set we add to the
basis set is kind of Zt minus that the dot product with that value. Thus this is a type of continuous
version of the Gram-Schmidt procedure. We can prove that the following properties hold:
1. Nt has orthogonal increments: E [(Nt1 − Ns1 ) (Nt2 − Ns2 )] = 0 for every non-overlapping
[s1 , t1 ] and [s2 , t2 ]. So each time increment is orthogonal (since time is the basis of this
Gram-Schmidt-Like procedure).
2. L(N, t) = L(Z, t). That is simply that N and Z span the same space, as guaranteed by the
Gram-Schmidt process.

dNt
Step 3: Find the Brownian Motion Define dRt = D(t) . Using the non-overlapping inde-
pendence property, we can show that Rt is actually a Brownian motion. Notice trivially that
L(N, t) = L(R, t) and thus the space spanned by this Brownian motion is sufficient for the estima-
tion of X̂. Therefore we get that
X̂ = PL(R,t) (X(t))
Z t
∂
= E [Xt ] + E [Xt Rs ] dRs .
0 ∂s

To solve this, simply derive the SDE

G2 (t)S(t) G(t)S(t)
dX̂t = F (t) − 2
X̂t dt + dRt
D (t) D 2 (t)

90
where
X̂0 = E [X0 ]
2
and S(t) = E X − X̂ satisfies the Riccardi equation

G2 (t)
S ′ (t) = 2F (t)S(t) − 2 S 2 (t) + C 2 (t)
D (t)
h i
S(0) = E (X0 − E [X0 ])2 .

This solution is known as the Kalman-Bucy Filter.

7.5 Discussion About the Kalman-Bucy Filter

Notice that the Kalman Filter can be though of as the best way to estimate an unobservable
process X(t) from observable variables Z(t). In robotics, this could be like estimating the location
you are currently at given your measurements of how much your wheels have turned (the connection
between this and actual location is stochastic because of jittering, skidding, etc.). Notice that in
the linear equation

dXt = F (t)Xt dt + C(t)dUt ; F (t), C(t) ∈ R

dZt = G(t)Xt dt + D(t)dVt ; G(t), D(t) ∈ R

that, if we do not know the form of the equation, we can think of F , C, G, and D as unobservables
as well. Thus using a multidimensional version of the Kalman filter, we can iteratively estimate
the “constants” too! Thus, if we discretize this problem, we can estimate the future values by
estimating G, D, F , and C, and then using these to uncover X. Notice that since these constants
are changing, this type of a linear solution can approximate any non-linear interaction by simply
making the time-step small enough. Thus even though this is just the “linear approximation”, the
linear approximation can computationally solve the non-linear problem! For a detailed assessment
of how this is done, refer to Hamilton’s Time Series Analysis.

91
8 Stochastic Calculus Cheat-Sheet
Basic Probability Definitions
Binomial Distribution ∼ Bin(n, p)
n

• Distribution function:P (X = k) = pk (1 − p)n−k
k
P
• Cumulative Distribution: P (X ≤ k) = i≤k P (X = i)
n
P
• Expectation: E[X] = kP (X = k) = np
i=1

• Variance: V [X] = E [X − E [X]] = np(1 − p)

Poisson Distribution ∼ P oisson(λ)

• Density function: ρ(X) = λe−λt

Ra
• Cumulative Distribution: P (X ≤ a) = −∞ ρ(x)dx
R∞
• Expectation: E[X] = −∞ xρ(x)dx = λ
R∞
• Variance: V[X] = −∞ (x − E[x])2 ρ(x)dx = λ

Gaussian Distribution ∼ N (µ, σ 2 )

(x−µ)2
• Density function: ρ(X) = √ 1 e− 2σ 2
2πσ2
Ra
• Cumulative Distribution: P (X ≤ a) = −∞ ρ(x)dx
R∞
• Expectation: E[X] = −∞ xρ(x)dx = µ
R∞
• Variance: V[X] = E[(X − µ)2 ] = −∞ (x − µ)2 ρ(x)dx = σ 2

Useful Properties

• E[aX + bY ] = aE[X] + bE[Y ]

• E[XY ] = E[X]E[Y ] if X, Y are independent

• V[aX + bY ] = a2 V[X] + b2 V[Y ] if X, Y are independent

92
Poisson Counter SDEs

m
X
dx = f (x, t)dt + gi (x, t)dNi
i=1

Ni (t) ∼ P oisson(λi t)
E [Ni (t)] = λt
Z t n−1
X
dNt = lim (N (ti+1 ) − N (ti )) , ti = i∆t
0 ∆t→0
i=0

(λdt)k −λdt
P (k jumps in the interval (t, t + dt)) = e
k!
Ito’s Rule
X m
∂ψ ∂ψ
dYt = dψ(x, t) = dt + f (x)dt + [ψ(x + gi (x), t − ψ(x, t)]dNi
∂t ∂x
i=1

Forward Kolmogorov Equation

m
∂p ∂(f p) X ρ(g̃i−1 (x), t)
=− + λi − ρ g̃i (x) = x + gi (x)
∂t ∂x
i=1
|1 + gi′ (g̃i−1 (x)|

Wiener Process SDEs

n
X
dx = f (x, t)dt + gi (x, t)dWi
i=1

Wt ∼ N (0, t)
(
n dt, n = 1
(dt) =
0, o.w
dWin dtm = 0

dt : i = j
dWi × dWj =
0 : i 6= j
Z t n−1
X
g(X, t)dWt = lim g(Xti , ti ) dWti+1 − dWti , ti = i∆t
0 ∆t→0
i=1

∂ψ ∂ψ 1 ∂2ψ
dy = dψ(x, t) = dt + dx + (dx)2
∂t ∂x 2 ∂x2

93
Ito’s Rule
n
! n
∂ψ ∂ψ 1 X 2 ∂2ψ ∂ψ X
dy = dψ(x, t) = + f (x, t) + gi (x, t) 2 dt + gi (x, t)dWi
∂t ∂x 2 ∂x ∂x
i=1 i=1

Multidimensional X ∈ Rn Ito’s Rule

m
X m
∂ψ ∂ψ 1X
dψ(X) = , f (X) dt + , gi (X) dWi + gi (X)T ∇2 ψ(X)gi (X)dt
∂X ∂X 2
i=1 i=1

Forward Kolmogorov Equation (Fokker-Planck Equation)

m
∂ρ(x, t) ∂ 1 X ∂2 2
= − [f (x)ρ(x, t)] + gi (x)ρ(x, t)
∂t ∂x 2 ∂x2
i=1

Backward Kolmogorov Equation

∂ρ ∂ρ 1 2 ∂2 ρ
+ a(x(t), t) + b (x(t), t) 2 = 0
∂t ∂x 2 ∂x
Fluctuation-Dissipation Theorem

Jf (Xss , t)Σ(Xss , t) + Σ(Xss , t)JfT (Xss , t) = −g2 (Xss , t).

Useful Properties

1. Product Rule: d(Xt Yt ) = Xt dY + Yt dX + dXdY .

Rt Rt Rt
2. Integration By Parts: 0 Xt dYt = Xt Yt − X0 Y0 − 0 Yt dXt − 0 dXt dYt .
h i
3. E (W (t) − W (s))2 = t − s for t > s.

4. E[W (t1 )W (t2 )] = min(t1 , t2 ).

94
Simulation Methods
n
X
dx = f (x, t)dt + gi (x, t)dWi ηi , λi ∼ N (0, ∆t) ηi and λi are independent
i=1

Euler-Maruyama (Strong Order 1/2, Weak Order 1)

√
X(t + ∆t) = X(t) + f (X, t)∆t + ∆tg(X, t)ηi .

Milstein’s Method (Strong Order 1, Weak Order 1)

1 √ ∆t
X(t + ∆t) = X(t) + f (X, t) − g(X, t)gx (X, t) ∆t + ∆tg(X, t)ηi + g(X, t)gx (X, t)ηi2 .
2 2

KPS Method (Strong Order 1.5, Weak Order 1.5)

∆t
X(t + ∆t) = X(t) + f ∆t + g∆Wt + ggx (∆Wt )2 − ∆t
2
1 1 2
+ gfx ∆Ut + f fx + g fxx ∆t2
2 2

1
+ f gx + g2 gxx (∆Wt ∆t − ∆Ut )
2

1 1 2
+ g (ggx )x (∆Wt ) − ∆t ∆Wt
2 3

√ 1 3
∆Wt = ∆tηi ∆Ut = ∆t λi
3

Other Properties
Properties of Conditional Expectation
1. E[X|G] exists and is unique except on a set of measure 0.

2. E [E[X|G]] = E[X]. Our best estimate of E[X|G] is E[X] if we have no information.

3. Linearity: E [aX + bY |G] = aE [X|G] + bE[Y |G].

4. Positivity: If X ≥ 0 a.s., then E[X|G] ≥ 0 a.s.

(a) This can be generalized: For all A ⊂ R, if X ∈ A a.s., then E[X|G] ∈ A a.s.

95
6. Iterated Conditioning: If H ⊂ G ⊂ F, then E [E[X|G]|H] = E[X|H].

7. Independence of G: If X is independent of G, then E[X|G] = E[X]. If G gives no information

about X, then the expectation condition on the information of G is simply the expectation
of X.

8. If Mt is a martingale, then E[Mt |Fs ] = Ms .

An Introduction To Stochastic Processes in Continuous Time
No ratings yet
An Introduction To Stochastic Processes in Continuous Time
145 pages
(Ebook PDF) Theory and Statistical Applications of Stochastic Processes PDF Download
100% (4)
(Ebook PDF) Theory and Statistical Applications of Stochastic Processes PDF Download
52 pages
D18NN
No ratings yet
D18NN
372 pages
Stochasticprocess
100% (1)
Stochasticprocess
316 pages
(Ebook PDF) Theory and Statistical Applications of Stochastic Processesinstant Download
100% (5)
(Ebook PDF) Theory and Statistical Applications of Stochastic Processesinstant Download
47 pages
ASME A17.1/CSA B44 Handbook
0% (1)
ASME A17.1/CSA B44 Handbook
7 pages
(Ebook PDF) Theory and Statistical Applications of Stochastic Processes Download
No ratings yet
(Ebook PDF) Theory and Statistical Applications of Stochastic Processes Download
58 pages
Stochastic Processes by Joseph T Chang
0% (1)
Stochastic Processes by Joseph T Chang
233 pages
2002 - Stochastic Calculus PDF
100% (4)
2002 - Stochastic Calculus PDF
784 pages
Continuous-Time Finance: Lecture Notes
No ratings yet
Continuous-Time Finance: Lecture Notes
141 pages
Applied Stochastis
No ratings yet
Applied Stochastis
286 pages
Stochastic Models
No ratings yet
Stochastic Models
279 pages
Integral Stochastic
No ratings yet
Integral Stochastic
260 pages
Stochastic Processes
100% (2)
Stochastic Processes
233 pages
Stoch Proc Notes PDF
No ratings yet
Stoch Proc Notes PDF
239 pages
CMU Prob-Grad-Notes - Tomasz Tkocz
No ratings yet
CMU Prob-Grad-Notes - Tomasz Tkocz
226 pages
Stochastic Notes
100% (1)
Stochastic Notes
143 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
SDE Book
No ratings yet
SDE Book
119 pages
Little Notes Grammar
No ratings yet
Little Notes Grammar
316 pages
GDCTG 4TH Edition 2018
No ratings yet
GDCTG 4TH Edition 2018
226 pages
Pitman Yor A Guide To Brownian Motion
No ratings yet
Pitman Yor A Guide To Brownian Motion
101 pages
STAT0007 Course Notes
No ratings yet
STAT0007 Course Notes
93 pages
ProbStochProc 1.42 NoSolns PDF
No ratings yet
ProbStochProc 1.42 NoSolns PDF
241 pages
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
No ratings yet
(Mathematics and Its Applications) Malempati M. Rao, Randall J. Swift - Probability Theory With Applications - Springer (2006)
536 pages
Probability
No ratings yet
Probability
141 pages
An Introduction To Stochastic Calculus With Applications To Finance
No ratings yet
An Introduction To Stochastic Calculus With Applications To Finance
196 pages
Lecture notesM4A42 PDF
No ratings yet
Lecture notesM4A42 PDF
136 pages
Notes On Stochastics
No ratings yet
Notes On Stochastics
287 pages
Pavliotis Book
No ratings yet
Pavliotis Book
155 pages
Auckland - 325book
No ratings yet
Auckland - 325book
195 pages
Driver
No ratings yet
Driver
211 pages
Introduction To Probability Theory and Stochastic Processes (Stats)
No ratings yet
Introduction To Probability Theory and Stochastic Processes (Stats)
178 pages
Stochastic Control Notes
No ratings yet
Stochastic Control Notes
173 pages
Driver PDF
No ratings yet
Driver PDF
211 pages
ProbStochProc 1.42 NoSolns PDF
No ratings yet
ProbStochProc 1.42 NoSolns PDF
241 pages
Stoch PDF
No ratings yet
Stoch PDF
84 pages
Almost None
No ratings yet
Almost None
347 pages
Lecture 2006
No ratings yet
Lecture 2006
222 pages
Paragraph Structure Fun Activities Games Reading Comprehension Exercis - 28099
100% (1)
Paragraph Structure Fun Activities Games Reading Comprehension Exercis - 28099
3 pages
Stochastic Calculus - An Introduction With Applications PDF
No ratings yet
Stochastic Calculus - An Introduction With Applications PDF
246 pages
Course Notes STATS 325 Stochastic Processes: Department of Statistics University of Auckland
No ratings yet
Course Notes STATS 325 Stochastic Processes: Department of Statistics University of Auckland
195 pages
An Introductory Course in Stochastic Processes
100% (2)
An Introductory Course in Stochastic Processes
105 pages
Analisi Stocastica
No ratings yet
Analisi Stocastica
142 pages
Stochastic Processes
100% (1)
Stochastic Processes
264 pages
Stochastic Calculus, Filtering, and Stochastic Control
100% (2)
Stochastic Calculus, Filtering, and Stochastic Control
265 pages
Finbook 2
No ratings yet
Finbook 2
246 pages
Stochastic Calculus With Applications by Ovidiu Calin
100% (1)
Stochastic Calculus With Applications by Ovidiu Calin
372 pages
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
No ratings yet
A Historical Introduction - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - .
10 pages
Yagna Phala Stotram Details - Blossom Magazine
No ratings yet
Yagna Phala Stotram Details - Blossom Magazine
4 pages
Project Procurement Management Plan (PPMP) : Crossing Bayabas National High School
No ratings yet
Project Procurement Management Plan (PPMP) : Crossing Bayabas National High School
6 pages
Jupiler League Plbrandguidelinescompressed
No ratings yet
Jupiler League Plbrandguidelinescompressed
88 pages
1.11 Geometry Pagenumber
No ratings yet
1.11 Geometry Pagenumber
22 pages
Accounting: Carl S. Warren James M. Reeve Jonathan E. Duchac
100% (1)
Accounting: Carl S. Warren James M. Reeve Jonathan E. Duchac
12 pages
11vip90 Tuanso08bodudoandacbiet Phattriendethiminhhoanam2024 Deso04
No ratings yet
11vip90 Tuanso08bodudoandacbiet Phattriendethiminhhoanam2024 Deso04
10 pages
Coingecko Q1 2021 Cryptocurrency Report
100% (2)
Coingecko Q1 2021 Cryptocurrency Report
56 pages
Chapter 2 Fundamental 1
No ratings yet
Chapter 2 Fundamental 1
28 pages
Basic Principles Lecture
No ratings yet
Basic Principles Lecture
7 pages
16 Herramienta Caterpillar Precios
No ratings yet
16 Herramienta Caterpillar Precios
14 pages
GenRocket Integrates With Test Automation Frameworks To Enable Continuous Integration and Delivery
No ratings yet
GenRocket Integrates With Test Automation Frameworks To Enable Continuous Integration and Delivery
3 pages
Thesis Topics Environmental Education
100% (2)
Thesis Topics Environmental Education
9 pages
Report Vietfood Beverages 2023
No ratings yet
Report Vietfood Beverages 2023
19 pages
Saurav Updated Resume
No ratings yet
Saurav Updated Resume
1 page
Lecture 5 Toplogy and Spatial Relationship
No ratings yet
Lecture 5 Toplogy and Spatial Relationship
38 pages
Reading Effectively PowerPoint Slides
No ratings yet
Reading Effectively PowerPoint Slides
10 pages
Test Gisela FCE - KEYS
No ratings yet
Test Gisela FCE - KEYS
5 pages
Entrepreneurship Lesson Plan
No ratings yet
Entrepreneurship Lesson Plan
4 pages
LP-P55 User Guide
No ratings yet
LP-P55 User Guide
22 pages
Blast Vibration Monitoring: Independent Acoustic Consultancy Practice
No ratings yet
Blast Vibration Monitoring: Independent Acoustic Consultancy Practice
22 pages
GGSIPU Cutoff Round2 18 Jul 2024
No ratings yet
GGSIPU Cutoff Round2 18 Jul 2024
37 pages
Synerji
No ratings yet
Synerji
9 pages
Extreme Homeopathic Dilutions Retain
No ratings yet
Extreme Homeopathic Dilutions Retain
12 pages
Checklist For WOUND CARE
No ratings yet
Checklist For WOUND CARE
3 pages
2013 Annual Medical Schedule
No ratings yet
2013 Annual Medical Schedule
8 pages
What Is Caste
No ratings yet
What Is Caste
2 pages
Venu Palapati Resume
No ratings yet
Venu Palapati Resume
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ChrisRackauckas IntuitiveSDEs

Uploaded by

ChrisRackauckas IntuitiveSDEs

Uploaded by

An Intuitive Introduction For Understanding and Solving

Stochastic Differential Equations

May 28, 2017

3 Introduction to Stochastic Processes: Jump Processes 19

5 Computational Simulation of SDEs 45

6 Measure-Theoretic Probability for SDE Applications 54

dg(x) 6= g ′ (x)dx = g ′ (x)f (x)dt + g ′ (x)σ(x)dt

2.1 Discrete Random Variables

For the Bernoulli Trial, we get that

2.1.2 Example 2: Binomial Random Variable

Pr(s ∈ S) = pk (1 − p)n−k , where kis the number of heads in s

This is all solved using the Binomial theorem:

2.2 Probability Generating Functions.

we let a = pz and b = 1 − p to get

2.3 Moment Generating Functions

M ′ (k)|t=0 = npet (et p + 1 − p)n−1 |t=0 = np = E [X]

which is the the first moment of X, or just the expectation of X.

G(z) = (zp + 1 − p)n

Recall that the definition of ex is  x n

G(0) = Pr(X = 0) = e−λt

2.4.1 The Poisson Distribution

2. The sum of all the probabilities is 1:

• The expectation is:

P (T ∈ [t, t + ∆t]) = λe−λt ∆t

2.5 Continuous Random Variables

If we assume µ = 0, the pth moment function of X can be simplified as

2.5.2 Generalization: The Multivariate Gaussian Distribution

2.5.3 Gaussian in the Correlation-Free Coordinate System

det(Q) = det(U ΛU T ) = det(U )2 det(Λ) = λ1 λ2 ...λn

2.6 Gaussian Distribution in PDEs

The solution to this problem is a Gaussian distribution

µ(P1 ∩ P2 ) = µ(P1 )µ(P2 ).

and ρy is defined likewise.

2.8 Conditional Probability

2.9 Change of Random Variables

Pr (x ∈ (x, x + dx)) = Pr (y ∈ (y, y + dy))

2.9.1 Multivariate Changes

2.10 Empirical Estimation of Densities

Sometimes, it’s more useful to use the Log-Likelihood

θ̂ = max l(x; θ).

3.1 Stochastic Processes

3.2 The Poisson Counter

Pr (N (t + τ ) − N (t)|N (s), s < t) = Pr (N (t + τ ) − N (t))

3.3 Markov Process

Pr (X(t + τ ) = X|X(δ), δ ≤ t)) = Pr X(t + τ ) = X|X(t))

ṗ0 (t) = −λp0 (t) =⇒ p0 (t) = e−λt

ṗ1 (t) = −λp1 (t) + λp0 (t)

and thus we solve the linear differential equation to get

For all the others we note that

3.5 Bidirectional Poisson Counter

However, to notice the importance of this, go back to the definition of g:

where In (x) is the nth Bessel function.

3.6 Discrete-Time Discrete-Space Markov Process

Pr (Xn+1 = j|Xn = in , . . . , X0 = i0 ) = Pr (Xn+1 = j|Xn = in ) = Pij .

3.7 Continuous-Time Discrete-Space Markov Chain

2. Transition probability from i and i aij = qij h;

3.7.1 Example: DNA Mutations

P (t) = (pA (t), pG (t), pC (t), pT (t))T

Since the expected value makes a constant, we intuit that

N (t + dt) − N (t) = dNt

3.9 Generalized Poisson Counting Processes

dX(t) = f (X(t), t)dt + g(X(t), t)dN (t)

3.11 Example Counting Process

dψ(x, t) = ∆Deterministic + ∆Jumps

which is Ito’s Rule for Poisson counter processes.

3.12.1 Example Problem

dx(t) = −x(t)dt + dN1 (t) − dN2 (t)

dx2 (t) = 2x(−x(t))dt + ((x + 1)2 − x2 )dN1 + ((x − 1)2 − x2 )dN2

E [Ni (t)] = λt.

and thus we take the limit as h → 0 to get

3.13.1 Example Calculations

3.13.2 Another Example Calculation

3.13.3 Important Example: Bidirectional Poisson Counter

3.14 Poisson Jump Process Kolmogorov Forward Equation

Here we let u = f ρ and dv = ∂ψ

We collect all of the terms to one side to get

Recall that the definition of ex is x n

For the others, notice Z ∞