0% found this document useful (0 votes)

18 views64 pages

Lecture 11 - 14 Computational Techniques

The document discusses Bayesian statistics and computational techniques, focusing on the Laplace approximation for integrals involving smooth functions. It covers examples, error analysis, and methods such as Monte Carlo, independent Monte Carlo, and importance sampling for approximating posterior expected values. Additionally, it introduces Markov Chain Monte Carlo (MCMC) as a method for sampling from complex posterior distributions.

Uploaded by

Lavy Koilpitchai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views64 pages

Lecture 11 - 14 Computational Techniques

Uploaded by

Lavy Koilpitchai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Bayesian Statistics

Computational Techniques

Shaobo Jin
Department of Mathematics

Shaobo Jin (Math) Bayesian Statistics 1 / 64

Computational Techniques Laplace Approximation

Laplace Approximation to Integral

Suppose that we want to approximate the integral
ˆ
h (θ) exp {−ℓ (θ)} dθ,

where θ has the dimension d × 1, p is a known constant, and ℓ (θ) and

h (θ) are smooth functions. If ℓ (θ) is uniquely minimized at θ̂ such that

∂ℓ θ̂ ∂ 2 ℓ θ̂
= 0, > 0.
∂θ ∂θ∂θT
Then, the Laplace approximation to the above integral is
v
u  −1 
2
 ∂ ℓ θ̂
u n o
u
(2π)d/2 u
tdet  ∂θ∂θT   exp −ℓ θ̂ h θ̂ .


Shaobo Jin (Math) Bayesian Statistics 2 / 64

Computational Techniques Laplace Approximation

Laplace Approximation: Example

Example
Suppose that posterior is
Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1

where n = 20, i=1 yi = 40.4, and = 93.2. Approximate

Pn Pn 2
i=1 yi
E [β | y] by Laplace approximation.

Shaobo Jin (Math) Bayesian Statistics 3 / 64

Computational Techniques Laplace Approximation

Error Analysis

Suppose that we can express ℓ (θ) as pq (θ) such that the integral is
ˆ
h (θ) exp {−pq (θ)} dθ,

where h (θ) and q (θ) do not include p, and h (θ) and q (θ) are smooth
functions.
Let H def
∂ 2 q (θ̂)
= ∂θ∂θT
> 0. The Laplace approximation satises
r n o h
d/2 i
(2π) det H −1 θ̂ /p exp −pq θ̂ h θ̂ + O p−1 .

Shaobo Jin (Math) Bayesian Statistics 4 / 64

Computational Techniques Laplace Approximation

Ratio of Integrals

In practice, we often want to approximate a ratio of integrals such that

ˆ ´
h (θ) f (x | θ) π (θ) dθ
h (θ) π (θ | x) dθ = ´ .
f (x | θ) π (θ) dθ

The naive approach is to approximate both the numerator and

denominator separately by Laplace approximation and take the ratio of
approximations. This yields
ˆ
h (θ) π (θ | x) dθ ≈ h θ̂ ,

which is not recommended.

Shaobo Jin (Math) Bayesian Statistics 5 / 64

Computational Techniques Laplace Approximation

Moment Generation Function

Consider the moment generation function

ˆ
E [exp {th (θ)} | x] = exp {th (θ)} π (θ|x) dθ
´
exp {th (θ)} f (x | θ) π (θ) dθ
= ´ .
f (x | θ) π (θ) dθ

We apply the Laplace approximation both the denominator and

numerator, and take the ratio of approximations. Using the property of
the moment generation function,
d log E [exp {th (θ)} | x]
E [h (θ) | x] = .
dt t=0

Shaobo Jin (Math) Bayesian Statistics 6 / 64

Computational Techniques Laplace Approximation

Fully Exponential Laplace Approximation

Let

ℓ (θ, t) = −th (θ) − log f (x | θ) − log π (θ) .

The fully exponential Laplace approximation is

1∂ ∂ 2 ℓ θ̃ (t) , t
E [h (θ) | x] ≈ h θ̂ − log ,
2 ∂t ∂θ∂θT t=0

where θ̃ (t) maximizes ℓ (θ, t) for a given t, and θ̂ = θ̃ (0).

Under the same assumptions as for the Laplace approximation, the
error rate is O p .
−2

Shaobo Jin (Math) Bayesian Statistics 7 / 64

Computational Techniques Monte Carlo

Expectation Under Posterior

For given data x, we often need to compute the posterior expected
value of a function h (θ, x),
ˆ
µ (x) = h (θ, x) π (θ | x) dθ.

However, it is not always the case that we can nd the closed form
expression of µ (x). Approximations are more often needed.
Suppose that we want to approximate
ˆ
E [h (x)] = h (x) f (x) dx,

where f (x) is the density of random variable/vector X . A natural

approximation is to approximate it by the sample mean
n
1X
h̄ = h (xi ) .
n
i=1

Shaobo Jin (Math) Bayesian Statistics 8 / 64

Computational Techniques Monte Carlo

Approximate Expectation by Sample Mean

Under mild conditions, the sample mean

n
1X
h̄ = h (xi )
n
i=1

has nice properties.

1 Unbiasedness: E h̄ = E [h (x)] for any n.

2 Consistency: h̄ → E [h (x)] in probability, as n → ∞.

3 Strong consistency: h̄ → E [h (x)] almost surely, as n → ∞.

√
4 Asymptotic normality: n h̄ − E [h (x)] → N (0, Var [h (x)]) in

distribution, as n → ∞.
The classic methods (e.g., independent Monte Carlo and importance
sampling) have these properties.

Shaobo Jin (Math) Bayesian Statistics 9 / 64

Computational Techniques Independent Monte Carlo

Sample From Posterior

For given data x, suppose that we want to compute the posterior

expected value
ˆ
µ (x) = h (θ, x) π (θ | x) dθ.

If π (θ | x) is a well-known distribution such that we can easily sample

from it, then we draw n independent samples from π (θ | x) and the
independent Monte Carlo approximation is
n
1X
µ̂IMC = h (θi , x) .
n
i=1

Shaobo Jin (Math) Bayesian Statistics 10 / 64

Computational Techniques Independent Monte Carlo

Independent Monte Carlo: Example

Example
Suppose that posterior is
Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
Pn 2
n 1 ( y )
σ 2 | y ∼ InvGamma 2 + , 2 +
i
X
yi2 − i=1
,
2 2 n+1
i=1

where n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. We want to

P P
approximate E [β | y] by independent Monte Carlo. In this example, we
know the true value
Pn
i=1 yi
E [β | y] = .
n+1

Shaobo Jin (Math) Bayesian Statistics 11 / 64

Computational Techniques Importance Sampling

Importance Distribution

It is common that it is not straightforward to sample directly from

π (θ | x). Suppose that it is easy for us to sample directly from another
distribution with density g (θ | x) such that g (θ | x) > 0 whenever
h (θ, x) π (θ | x) ̸= 0.
We can rewrite µ (x) as
ˆ
π (θ | x) π (θ | x)
µ (x) = h (θ, x) g (θ | x) dθ = E h (θ, x) |x ,
g (θ | x) g (θ | x)

where the expectation is taken with respect to θ | x ∼ g (θ | x).

We call g (θ | x) an importance distribution or instrumental
distribution.

Shaobo Jin (Math) Bayesian Statistics 12 / 64

Computational Techniques Importance Sampling

Importance Sampling Approximation

The importance sampling approximation is
n
1X π (θi | x)
µ̂IS = h (θi , x) .
n g (θi | x)
i=1

Example
Suppose that posterior is
Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1

where
2 n = 20, i=1 yi = 40.4, and = 93.2. Approximate
Pn Pn 2
i=1 yi
E σ | y by importance sampling.
Shaobo Jin (Math) Bayesian Statistics 13 / 64
Computational Techniques Normalized Importance Sampling

We can apply the importance sampling trick to both integrals:

 
ˆ  f (x | θ) π (θ) 
h (θ, x) f (x | θ) π (θ) dθ = E h (θ, x)
 
g (θ | x)

 
| {z }
importance weight w(θ,x)

where g (θ | x) > 0 whenever π (θ | x) ̸= 0, stronger than IS.

Shaobo Jin (Math) Bayesian Statistics 14 / 64
Computational Techniques Normalized Importance Sampling

Normalized Importance Sampling

The importance sampling approximations to the numerator and

denominator are
n n
1X 1 X f (x | θi ) π (θi )
w (θi , x) h (θi , x) = h (θi , x) ,
n n g (θi | x)
i=1 i=1
n n
1 X 1 X f (x | θi ) π (θi )
w (θi , x) = .
n n g (θi | x)
i=1 i=1

The ratio is the normalized importance sampling estimator

Pn
w (θ , x) h (θi , x)
µ̂NIS = Pn i
i=1
.
i=1 w (θi , x)

Shaobo Jin (Math) Bayesian Statistics 15 / 64

Computational Techniques Normalized Importance Sampling

Normalized Importance Sampling: Example

We can even ignore the constants in f (y | θ) π (θ) in normalized

importance sampling.
Example
Consider an iid sample of size n from Y | β, σ 2 ∼ 2 . The prior

N β, σ
of σ 2 is InvGamma (2, 2), and β | σ 2 is N 0, σ 2 . Then,

(n+1)β 2 −2β n
n P Pn 2
o
i=1 yi +4+ i=1 yi
exp − 2σ 2
f (y | θ) π (θ) ∝
(σ 2 )(n+1)/2+3

We observe n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. Approximate
P P
E σ 2 | y by normalized importance sampling.

Shaobo Jin (Math) Bayesian Statistics 16 / 64

Computational Techniques Normalized Importance Sampling

Randomness
In independent Monte Carlo, importance sampling, and normalized
importance sampling, we simulate random numbers from π (θ | x) or
g (θ | x), then µ̂ is a random variable. This means that we can
construct condence interval for µ̂ using the central limit theorem.

Monte Carlo approximation

8
6
Density

4
2
0

1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10

MCInt
Shaobo Jin (Math) Bayesian Statistics 17 / 64
Computational Techniques MCMC

Markov Chain Monte Carlo

We often want to get a sample from the posterior.

If the posterior follows some well known distribution, we can
generate a sample easily.
If the posterior does not follow any well known distribution, the
Markov Chain Monte Carlo (MCMC) is a very popular choice.
The idea of MCMC relies on the Markov property.
Denition
A Markov chain is a sequence of random variables Xi that satisfy the
Markov property:
P (Xi+1 ∈ A | Xj = xj , 0 ≤ j ≤ i) = P (Xi+1 ∈ A | Xi = xi ) .

Shaobo Jin (Math) Bayesian Statistics 18 / 64

Computational Techniques MCMC

Transition Kernel

The transition kernel describes how the Markov chain moves from Xn−1
to Xn .
If {Xn } is discrete, the transition kernel is a matrix K with
elements P (Xn = y | Xn−1 = x).
If {Xn } is continuous, the Markov property means that
ˆ
P (Xn ∈ A | Xn−1 = x, · · · X0 ) = K (x, y) dy,
y∈A
f (Xn = y | Xn−1 = x, · · · X0 ) = f (Xn = y | Xn−1 = x) = K (x, y) ,

where the transition kernel K (x, y) is the conditional density of Y

given X = x.

Shaobo Jin (Math) Bayesian Statistics 19 / 64

Computational Techniques MCMC

Stationary Distribution

Denition
The distribution p on Ω is a stationary distribution (or invariant
distribution) of the Markov chain with the transition kernel K , if

P (y) = P (x) K (x, y) , discrete case,

x∈X
ˆ
f (y) = f (x) K (x, y) dx, continuous case,
x∈X

where P and f are not generic symbols.

The stationary distribution means that if the initial state

X0 ∼ π (θ | data), then Xn ∼ π (θ | data) for all n ≥ 0, the same
distribution.
Shaobo Jin (Math) Bayesian Statistics 20 / 64
Computational Techniques MCMC

Long-Run Property

Theorem
Let π () be the stationary distribution of the Markov chain. Under some
regularity conditions,

lim sup |P (Xn ∈ A | X0 = x) − π (A)| = 0, almost surely,

n→∞ A

regardless of the initial state X0 = x.

Since the limiting distribution does not depend on the initial state x,
the marginal distribution of Xn is approximately the stationary
distribution, after large enough iterations.

Shaobo Jin (Math) Bayesian Statistics 21 / 64

Computational Techniques MCMC

Choose the Transition Kernel

Our goal is to simulate data from π (θ | x). We need to choose the

transition kernel K such that the stationary distribution is π (θ | x).
Fact
If π (θ | x) and K (θ, θ∗ | x) satises the detailed balance condition, i.e,

K (θ, θ∗ ) π (θ | x) = K (θ∗ , θ) π (θ∗ | x) ,

for any θ, θ∗ ∈ Θ, then π (θ | x) is the stationary distribution of the

Markov chain with the transition kernel K .

Shaobo Jin (Math) Bayesian Statistics 22 / 64

Computational Techniques Metropolis-Hastings

Proposal Distribution
When we simulate random numbers from a Markov chain, we need a
proposal distribution
T (θ, θ∗ ) = f (θ∗ | θ) .

Find a proposal distribution T (θ, θ∗ ) that satises the detailed balance

condition is dicult.
So with probability A (θ, θ∗ ) we let θ(n+1) = θ∗ (accept), and
probability 1 − A (θ, θ∗ ) we let θ(n+1) = θ (reject).
For θ(n+1) ̸= θ, the transition is
K (θ, θ∗ ) = T (θ, θ∗ ) A (θ, θ∗ ) .

Hence, we should seek A such that the detailed balance condition

is fullled.
Shaobo Jin (Math) Bayesian Statistics 23 / 64
Computational Techniques Metropolis-Hastings

Deriving A (θ, θ∗ )
The detailed balance condition is fullled, if we choose the acceptance
probability to be
A (θ, θ∗ ) = λ (θ, θ∗ ) π (θ∗ | x) T (θ∗ , θ) ≤ 1,
A (θ∗ , θ) = λ (θ, θ∗ ) π (θ | x) T (θ, θ∗ ) ≤ 1.

The value λ that maximizes the probability A (·, ·) ≤ 1 is

∗ 1 1
λ (θ, θ ) = min , .
π (θ | x) T (θ , θ) π (θ | x) T (θ, θ∗ )
∗ ∗

Hence,
π (θ∗ | x) T (θ∗ , θ)

∗ ∗ ∗ ∗
A (θ, θ ) = λ (θ, θ ) π (θ | x) T (θ , θ) = min 1, .
π (θ | x) T (θ, θ∗ )

Shaobo Jin (Math) Bayesian Statistics 24 / 64

Computational Techniques Metropolis-Hastings

Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm allows proposal distributions such
that T (θ, θ∗ ) > 0 if and only if T (θ∗ , θ) > 0.

Algorithm 1: Metropolis-Hastings Algorithm

1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
Sample a candidate θ ∗ from

3 T θ(t) , θ | x ;
π(θ ∗ |x)T (θ ∗ ,θ (t) )
R θ(t) , θ∗ = π θ(t) |x T θ(t) ,θ∗

4 Calculate the ratio ;
( ) ( )
5 Draw U ∼ U [0, 1] ;
6 Update
(
θ∗ , U ≤ R θ(t) , θ∗ ,

(t+1) if
θ =
θ(t) , otherwise.

7 end
Shaobo Jin (Math) Bayesian Statistics 25 / 64
Computational Techniques Metropolis-Hastings

Metropolis-Hastings Algorithm: Example

∗
Since the ratio R θ(t) , θ∗ includes ππ(θ |x)
, we only need to know

(θ(t) |x)
π (· | x) up to a normalizing constant.
Example
Consider an iid sample of size n from Y | β, σ 2 ∼ 2 . The prior

N β, σ
of σ 2 is InvGamma (2, 2), and β | σ 2 is N 0, σ 2 . Then,

Pn Pn
exp − (n + 1) β 2 − 2β i=1 yi + 4 + i=1 yi2 / 2σ 2
f (y | θ) π (θ) ∝ (n+1)/2+3
.
(σ 2 )

We observe n = 20, ni=1 yi = 40.4, and = 93.2. Obtain a

P Pn 2
i=1 yi
sample from the posterior.

Shaobo Jin (Math) Bayesian Statistics 26 / 64

Computational Techniques Metropolis-Hastings

Detailed Balance: Symmetric Proposal

The Metropolis-Hastings algorithm allows asymmetric proposal

distributions.
If the proposal distribution is symmetric, i.e., T (θ, θ∗ ) = T (θ∗ , θ),
then the Metropolis-Hastings algorithm reduces to the Metropolis
algorithm.
Example
θ∗ | θ ∼ N θ, σ 2 is symmetric, since

( )
2
∗ 1 (θ − θ∗ )
T (θ, θ ) = √ exp − .
2σ 2 2σ 2

Shaobo Jin (Math) Bayesian Statistics 27 / 64

Computational Techniques Metropolis-Hastings

Metropolis Algorithm

Algorithm 2: Metropolis Algorithm

1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
Sample a candidate θ ∗ from

3 T θ(t) , θ | x ;
∗
|x)
R θ(t) , θ∗ = ππ(θ

4 Calculate the ratio ;
(θ(t) |x)
5 Draw U ∼ U [0, 1] ;
6 Update
(
θ∗ , U ≤ R θ(t) , θ∗ ,

(t+1) if
θ =
θ(t) , otherwise.

7 end

Shaobo Jin (Math) Bayesian Statistics 28 / 64

Computational Techniques Metropolis-Hastings

Some Examples of Metropolis-Hastings Algorithms

Many dierent MCMC algorithms dier mainly in how the candidate y

is sampled.
In the random-walk Metropolis algorithm, θ∗ = θ(t) + ϵ, where ϵ is
sampled from some distribution, e.g., Uniform [−a, a], Normal, etc.
In independence sampler, θ∗ is sampled from g (·) that does not
depend on θ(t) .
The Langevin Metropolis-Hastings algorithm explores the shape of
the posterior distribution by θ∗ = θ(t) + d(t) + τ ϵ, where
ϵ ∼ N (0, I) and

τ 2 ∂ log π θ(t) | x

(t)
d = .
2 ∂θ

Shaobo Jin (Math) Bayesian Statistics 29 / 64

Computational Techniques Gibbs Sampler

Gibbs Sampler: Conditioning

It can be the case that it is much easier to sample from the conditional
distributions than using Metroplis-Hastings from the joint distribution
of θ ∈ Θ ⊂ Rd .
Suppose that θ = (θ1 , ..., θp ), where θi ∈ Rdi .
Let πi|−i (θi | θ−i , x) be the conditional distribution of θi given θ−i
and x, where θ−i = θ1 · · · θi−1 θi+1 · · · θp .

Algorithm 3: Basic Gibbs Sampler

1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
3 for i = 1 in 1 : p do
(t+1) (t+1) (t+1) (t) (t)
4 Draw θi ∼ πi|−i θi | θ1 , ..., θi−1 , θi+1 , ..., θp , x ;

5 end
6 end

Shaobo Jin (Math) Bayesian Statistics 30 / 64

Computational Techniques Gibbs Sampler

Gibbs Sampler: Example

Example
Suppose that our data X1 , ..., Xn are iid from N µ, λ−1 . The prior

distributions of µ and λ are

µ ∼ N µ0 , λ−1

0 ,
λ ∼ Exp (b0 ) .

Use Gibbs sampler to sample random numbers from the posterior

distribution of µ, λ.

Shaobo Jin (Math) Bayesian Statistics 31 / 64

Computational Techniques Gibbs Sampler

Why Does Gibbs Sampler Work?

In order to show the Gibbs sampler generate random numbers from the
desired stationary distribution, we only need to show
ˆ
∗
π (θ | x) = K (θ, θ∗ ) π (θ | x) dθ,

where π (· | x) is not a generic symbol.

For simplicity, we consider p = 2 and continuous posterior.
The transition kernel K (θ, θ∗ ) is
K ((θ1 , θ2 ) , (θ1∗ , θ2∗ )) = π1|2 (θ1∗ | θ2 , x) π2|1 (θ2∗ | θ1∗ , x) .

This transition kernel satises

ˆ ˆ
K ((θ1 , θ2 ) , (θ1∗ , θ2∗ )) π (θ1 , θ2 | x) dθ1 dθ2 = π (θ1∗ , θ2∗ | x) .

Shaobo Jin (Math) Bayesian Statistics 32 / 64

Computational Techniques Gibbs Sampler

Collapsed Gibbs Sampler

Suppose that θ can be partitioned into three groups of parameters

(θ1 , θ2 , θ3 ).
The Gibbssampler samples from the full
conditional distributions

∼ π θ1 | θ2 , θ3 , x , θ2 , θ3 , x , and
(t+1) (t) (t) (t+1) (t+1) (t)
θ1 ∼ π θ2 | θ1

,x .
(t+1) (t+1) (t+1)
θ3 ∼ π θ3 | θ1 , θ2

In collapsed Gibbs sampler, we can integrate out θ3 analytically and

work with (θ1 , θ2 ) ∼ π (θ1 , θ2 | x).

We sample θ1(t+1) ∼ π θ1 | θ2(t) , x and θ2(t+1) ∼ π θ2 | θ1(t+1) , x
by Gibbs sampler.

We then sample θ3(t+1) ∼ π θ3 | θ1(t+1) , θ2(t+1) , x .

Shaobo Jin (Math) Bayesian Statistics 33 / 64

Computational Techniques Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

The Metropolis algorithm and the Gibbs sampler often move too
slowly through the target distribution when the dimension of the
target distribution is high.
Hamiltonian Monte Carlo (HMC) moves much quicker through the
target distribution.
For each component in the target distribution, HMC adds a
momentum variable and the proposal distribution largely depends
on the momentum variable.
Both the component in the target distribution and the momentum
are updated in the MCMC algorithm.

Shaobo Jin (Math) Bayesian Statistics 34 / 64

Computational Techniques Hamiltonian Monte Carlo

Hamiltonian Dynamics

The idea of HMC originates from the Hamiltonian dynamics in physics.

The state of a system consists of the position θ ∈ Rd and the
momentum ϕ ∈ Rd of same dimension.
The Hamiltonian is a function of θ and ϕ, denoted by H (θ, ϕ).
The position and the momentum can change over time t. The
change is described by the Hamilton's equations:
dθi ∂H dϕi ∂H
= , and =− ,
dt ∂ϕi dt ∂θi
for i = 1, ..., d.

Shaobo Jin (Math) Bayesian Statistics 35 / 64

Computational Techniques Hamiltonian Monte Carlo

Potential and Kinetic Energy

For HMC, the Hamiltonian is usually
H (θ, ϕ) = U (θ) + V (ϕ) ,

where U (θ) = − log π (θ | x) is called the potential energy and V (ϕ) is

called the kinetic energy.
We want to sample from π (θ | x). Hence, ϕ is articial.
We often let ϕ ∼ N (0, M ), independent of θ | x, for a prespecied
covariance matrix M , and V (ϕ) the negative log density of ϕ.
The Hamilton's equations become
dθ dϕ ∂ log π (θ | x)
= M −1 ϕ, and = ,
dt dt ∂θ
arranged as column vectors.

Shaobo Jin (Math) Bayesian Statistics 36 / 64

Computational Techniques Hamiltonian Monte Carlo

Augmentation

Since θ and ϕ are independent, their joint density is

f (θ, ϕ | x) = π (θ | x) p (ϕ | x) = exp {−U (θ) − V (ϕ)}
= exp {−H (θ, ϕ)} .

We have augmented the problem from sampling θ from π (θ | x) to

sampling (θ, ϕ) form exp {−H (θ, ϕ)}.
1 We rst sample ϕ from N (0, M ), independent of current θ .

Since ϕ ∼ N (0, M ), we already sample ϕ from the desired

distribution.
2 We then sample θ, where the new state is proposed by Hamiltonian
dynamics by solving the dierential equations.

Shaobo Jin (Math) Bayesian Statistics 37 / 64

Computational Techniques Hamiltonian Monte Carlo

Solve Dierential Equation

To solve the dierential equations, we consider an approximation

known as the leapfrog method. For some stepsize ϵ > 0, we perform
half-step updates as
ϵ ϵ ∂ϕ (t) ϵ ∂ log π (θ (t) | x)
ϕ t+ = ϕ (t) + = ϕ (t) + ,
2 2 ∂t 2 ∂x
∂θ t + 2ϵ
ϵ
θ (t + ϵ) = θ (t) + ϵ = θ (t) + ϵM −1 ϕ t + ,
∂t 2
ϵ ϵ ∂ϕ (t + ϵ) ϵ ϵ ∂ log π (θ (t + ϵ) | x)
ϕ (t + ϵ) = ϕ t + + =ϕ t+ + .
2 2 ∂t 2 2 ∂θ
Starting from t = 0, we can get a trajectory at times {ϵ, 2ϵ, ..., Lϵ}, and
approximate the values for θ (Lϵ) and ϕ (Lϵ).

Shaobo Jin (Math) Bayesian Statistics 38 / 64

Computational Techniques Hamiltonian Monte Carlo

Leapfrog Method to Sample θ

Suppose that the current state is (θ, ϕ).
1 Update ϕ with a half-step update by

ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
2 For ℓ = 1, ..., L − 1,
1 Update the position: θ ← θ + ϵM −1 ϕ.
2 Update the momentum:

∂ log π (θ | x)
ϕ ← ϕ+ϵ .
∂θ
3 Make one last update on the position: θ ← θ + ϵM −1 ϕ.
4 Make one last half-step update of the momentum
ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
Shaobo Jin (Math) Bayesian Statistics 39 / 64
Computational Techniques Hamiltonian Monte Carlo

Metropolis Step

Suppose that the state after such L updates is (θ∗ , ϕ∗ ). We negate the
momentum and the new proposal state is (θ∗ , −ϕ∗ ).
We determine whether to accept the proposal using the Metropolis
algorithm, where the acceptance probability is
exp {−H (θ∗ , −ϕ∗ )}

∗ ∗
A ((θ, ϕ) , (θ , −ϕ )) = min 1, .
exp {−H (θ, ϕ)}

If the proposed state is accepted, then we accept θ∗ as a new state

for θ, but don't care about ϕ∗ .
No matter we accept or reject the proposal, we will draw a new
momentum in the next iteration, independent of previous
momentum.

Shaobo Jin (Math) Bayesian Statistics 40 / 64

Computational Techniques Hamiltonian Monte Carlo

Properties of HMC
Some crucial properties of the Hamiltonian dynamics for MCMC
updates include
1 deterministic updates. The Hamiltonian dynamics is deterministic.

After running the leapfrog loop L times, we always move the initial
state (θ0 , ϕ0 ) to the same proposal (θ∗ , ϕ∗ ).
2 reversible. The mapping from the state at time t, denoted by

(θ (t) , ϕ (t)), to the state at time t + s, denoted by

(θ (t + s) , ϕ (t + s)), is one-to-one and has an inverse mapping. If
we negate the momentum, we will come back from
(θ (t + s) , −ϕ (t + s)) to (θ (t) , −ϕ (t)).
3 connection between momentum and position. The momentum is

changed based on the position since

dϕi ∂H ∂ log π (θ | x)
=− = .
dt ∂θi ∂θi
Shaobo Jin (Math) Bayesian Statistics 41 / 64
Computational Techniques Hamiltonian Monte Carlo

Tuning Parameters

Tuning of HMC can occur in several places such as

1 the distribution for the momentum,

2 the scaling factor ϵ,

3 the number of leapfrog steps L per iteration.

Some theory suggest that we can tune HMC such that the acceptance
probability is around 65%.

Shaobo Jin (Math) Bayesian Statistics 42 / 64

Computational Techniques Hamiltonian Monte Carlo

No-U-Turn Sampler
The no-U-turn sampler (NUTS) allows us to automatically tune the
number of steps L: we increases L until the simulated dynamics is long
enough such that the proposed position θ∗ starts to move back towards
the initial position θ if we run more steps.
This is measured by the angle between θ∗ − θ and current
momentum ϕ∗ .

A basic NUTS works as follows. Given the initial status,

1 Sample u | θ, ϕ ∼ Uniform [0, exp {−H (θ, ϕ)}].

2 Apply the leapfrog method (with some modication) until a

U-turn occurs.
3 Sample uniformly from the points in

{(θ, ϕ) : exp {−H (θ, ϕ)} ≥ u} that the leapfrog step has visited
and the detailed balance condition is fullled.
Shaobo Jin (Math) Bayesian Statistics 43 / 64
Computational Techniques Hamiltonian Monte Carlo

Adaptively Tune ϵ

A too small ϵ will waste computation by taking needlessly tiny steps,

and a too large will cause high rejection rates.
In HMC, we tune ϵ in the warm-up stage of MCMC such that the
average acceptance probability δ is the user specied value.
In NUTS, there is no Metropolis accept/reject step. But we can
still compute the ratio as if we were using the accept/reject step
and set ϵ such that the pseudo acceptance probability is the user
specied value.
In stan, the default is δ = 0.8.

Shaobo Jin (Math) Bayesian Statistics 44 / 64

Computational Techniques Convergence Diagnostics

Burn-In Period

The stationary distribution is reached after large enough iterations.

If the iterations have not proceeded long enough, the simulated
numbers may be unrepresentative of the target distribution.

To diminish the inuence of the starting values, we can discard the

early simulations, known as the burn-in.
There is no golden standard on how long the burn-in period should
be.
Hereafter, if the Markov chain has length n, we mean that after
the burn-in period, the length is n.

Shaobo Jin (Math) Bayesian Statistics 45 / 64

Computational Techniques Convergence Diagnostics

Mixing
We want the Markov chain to show good mixing.
Bad
2

−1

−2
x

Good
2

−1

−2

0 500 1000 1500 2000

iteration

Shaobo Jin (Math) Bayesian Statistics 46 / 64

Computational Techniques Convergence Diagnostics

Several Markov Chains

5.0

2.5

chain
chain 1
x

0.0
chain 2

−2.5

−5.0
0 500 1000 1500 2000
iteration

One suggestion is to generate several independent Markov chains,

starting from widely separated places.
Shaobo Jin (Math) Bayesian Statistics 47 / 64
Computational Techniques Convergence Diagnostics

Gelman-Rubin R̂ Statistic: Variation

One way to assess convergence is the Gelman-Rubin R̂ statistic.
Suppose that we have simulated m chains
each with n iterations. Say
we have a univariate quantity yij = f θj , where θj(i) is the ith value
(i)

in the j th chain.
The variation within the chains is measured by
m n
" #
1 X 1 X
W = (yij − ȳ·j )2 ,
m n−1
j=1 i=1

where ȳ·j is the average of {yij }ni=1 .

The variation between the chains is measured by
m
n X
B = (ȳ·j − ȳ·· )2 ,
m−1
j=1

where ȳ·· is the average of all ȳ·j .

Shaobo Jin (Math) Bayesian Statistics 48 / 64
Computational Techniques Convergence Diagnostics

Gelman-Rubin R̂ Statistic: Expression

If the Markov chains have reached stationary, then we expect W to be

close to B . The Gelman-Rubin R̂ statistic is then
s
n−1
n W + n1 B
R̂ = ,
W

which declines to 1 as n → ∞.
It is suggested that we keep simulating the Markov chain until
R̂ < 1.1 or even < 1.01.

Shaobo Jin (Math) Bayesian Statistics 49 / 64

Computational Techniques Convergence Diagnostics

Variants of Gelman-Rubin R̂

Several dierent versions of R̂ have been proposed. One suggestion is to

split each chain into two parts, yields 2m chains of length n/2 each.
Then compute the R̂, pretending that we have simulated 2m chains of
length n/2.
This can be useful to detect the case where each chain does not
reach stationary but the chains cover a common distribution, e.g,
two chains exhibit an X -shape.

Shaobo Jin (Math) Bayesian Statistics 50 / 64

Computational Techniques Convergence Diagnostics

Serial Correlation
It is obvious that θ(t+1) and θ(t) are not independent draws. Inference
from autocorrelated draws is generally less precise than from the same
number of independent draws.
However, such serial correlation is not necessarily a problem.
Remember that, at convergence, we reach the stationary
distribution.

Algorithm 4: General MCMC Integral

1 Sample a Markov chain for a given stationary distribution π (θ | x): θ(1) ,
..., θ(R) (after burn-in) ;
2 Approximate µ (x) by

n
µ̂MCMC
1X
= h (θi , x) .
n i=1

Shaobo Jin (Math) Bayesian Statistics 51 / 64

Computational Techniques Convergence Diagnostics

Long-Run Property

Theorem
Under some conditions, for all starting state θ0 ∈ Θ,
1 ergodic theorem: For any initial state,

n
1X a.s.
h (θi , x) → E [h (θ, x) | x] = µ (x) .
n i=1

2 central limit theorem: Let σ 2 = Var

[h (θ, x) | x] and
ρj = corr h θ , x , h θ
(1) (j+1) , x | x . Then,

  
∞
" n
#
√ 1X d
X
n h (θi , x) − µ (x) → N 0, σ 2 1 + 2 ρj  .
n i=1 j=1

Shaobo Jin (Math) Bayesian Statistics 52 / 64

Computational Techniques Convergence Diagnostics

Eective Sample Size

If we have an iid sample of size n, then
n
" #
√ 1X d
→ N 0, σ 2 .

n h (θi , x) − µ (x)
n
i=1

If we have a converged Markov chain of length n,

  
n ∞
" #
√ 1 X d
X
n h (θi , x) − µ (x) → N 0, σ 2 1 + 2 ρj  .
n
i=1 j=1

The variance of µ̂MCMC is larger than the variance of µ̂IMC . We dene

n
ne = P∞
1+2 j=1 ρj

as the eective sample size of this Markov chain sample.

Shaobo Jin (Math) Bayesian Statistics 53 / 64
Computational Techniques Convergence Diagnostics

Thinning
Some prefer thinning the sequence by only keeping every kth draw from
a sequence in order to reduce serial correlation.
But whether or not the Markov chain is thinned, it can be used for
inferences, provided that it has reached convergence.
Suppose that the length of the Markov chain is n. We discard k − 1
out of every k observations and the chain after thinning is n/k.
Under some assumptions,
√ d
n [µ̂ − µ (x)] → N 0, τ 2 ,

p d
n/k [µ̂k − µ (x)] → N 0, τk2 ,

where µ̂ and µ̂k are the estimators without and with thinning,
respectively.
In fact, it has been proved that, for any k > 1, kτk2 > τ 2 , indicating
that discarding k − 1 out of every k observations will increase the
variance.
Shaobo Jin (Math) Bayesian Statistics 54 / 64
Computational Techniques Convergence Diagnostics

Simulation Under Posterior

Using MCMC and other methods, we can simulate n random numbers
from the posterior distribution π (θ | x). Using the simulated θ, we can
1 approximate the posterior mean: n−1 → E [θ | x].
Pn (i)
i=1 θ
2 approximate the posterior probability:

n
1 X (i)
1 θ ∈A → E [1 (θ ∈ A) | x] = P (θ ∈ A | x) .
n
i=1

3 approximate predictive density:

n ˆ
1X
f xnew | x, θ(i) → f (xnew | x, θ) π (θ | x) dθ.
n
i=1

4 approximate mean of predictive distribution:

n−1 ni=1 xnew → E [xnew | x], where xnew is simulated from
P (i) (i)

f xnew | x, θ(i) .
Shaobo Jin (Math) Bayesian Statistics 55 / 64
Computational Techniques Variational Inference

Approximate Posterior
If the posterior distribution family is dicult to handle, it can be useful
to approximate it by another distribution family that is easier to
handle.
The Kullback-Leibler divergence for distributions P and Q with
respective densities p and q are
ˆ
q (θ)
KL (q, p) = q (θ) log dθ ≥ 0.
p (θ)

We choose a model D for the posterior, called the variational

family.
The variational density is
q ∗ (θ | x) = arg min KL (q (θ | x) , π (θ | x)) .
q∈D

Shaobo Jin (Math) Bayesian Statistics 56 / 64

Computational Techniques Variational Inference

Variational Bayesian Inference

The idea of variational inference (VI) is to use q ∗ (θ | x) ∈ D instead of

π (θ | x) and to explore the properties of D.

We need to choose D ourselves.

Trade-o: too simple D poorly approximates π (θ | x) but too
complex D is hard to handle.
One choice is the mean-eld variational family DMF , where
m
Y
q (θ | x) = qj (θj | x) ,
j=1

that is, the components in θ are independent. We call qj (θj | x)

the j th variational factor.

Shaobo Jin (Math) Bayesian Statistics 57 / 64

Computational Techniques Variational Inference

Evidence Lower Bound

The Kullback-Leilber divergence satises

ˆ
p (θ, x)
KL (q (θ | x) , π (θ | x)) = log [m (x)] − q (θ | x) log dθ.
q (θ | x)
| {z }
evidence lower bound ELBO(q)

Since KL (q (θ | x) , π (θ | x)) ≥ 0, the ELBO satises

ELBO (q) ≤ log [m (x)] ,
a lower bound of the log-marginal likelihood of x.
Minimizing the KL divergence is the same as maximization of
ELBO.

Shaobo Jin (Math) Bayesian Statistics 58 / 64

Computational Techniques Variational Inference

Variational Inference in Linear Regression

Example
Suppose that y | β ∼ Nn (Xβ, Σ) and β ∼ Np µ0 , Λ−1 , where Σ is

0
known. The posterior is β | y ∼ N µn , Λn , where
−1

Λn = Λ0 + X T Σ−1 X,
µn = Λ−1 Λ0 µ0 + X T Σ−1 y .

n

Consider the mean-eld variational family

Np µ, Λ−1 : µ ∈ Rp , Λ is diagonal .

DMF =

Find the ELBO and the variational density.

Shaobo Jin (Math) Bayesian Statistics 59 / 64

Computational Techniques Variational Inference

Explicit Expression of DMF

Theorem
Consider the mean eld variational family DMF , where
m
Y
q (θ | x) = qj (θj | x) .
j=1

Let θk be the kth group in θ and

qk∗ (θk | x) = arg min KL (q (θ | x) , π (θ | x)) .

Then,
ˆ
qk (θk | x) ∝ exp q−k (θ−k | x) log π (θk | θ−k , x) dθ−k .

Shaobo Jin (Math) Bayesian Statistics 60 / 64

Computational Techniques Variational Inference

Coordinate Ascent Variational Inference Algorithm

The previous theorem suggests the following stepwise conditioning to
approximate q ∗ (θ | x).

Algorithm 5: Coordinate ascent variational inference (CAVI) Algo-

5 end
6 end
Shaobo Jin (Math) Bayesian Statistics 61 / 64
Computational Techniques Variational Inference

CAVI Algorithm: Example

Example
Suppose that we have an iid sample Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n.

The priors are µ | σ 2 ∼ N µ0 , σ 2 /λ0 and σ 2 ∼ InvGamma (a0 , b0 ).

The posterior is
µ | x, σ 2 ∼ N µn , σ 2 /λn , and σ 2 | x ∼ InvGamma (an , bn ) .

where λn = λ0 + n, µn = λ−1 i=1 xi ), an = a0 + 2 , and

Pn n
n (λ0 µ0 +

n
!
1 X 2 2 2
bn = b0 + xi + λ0 µ0 − λn µn .
2
i=1

Shaobo Jin (Math) Bayesian Statistics 62 / 64

Computational Techniques Stan

Stan

Stan is a c++ library for Bayesian inference using HMC to obtain

posterior simulations.
Rstan is the R interface to Stan.
PyStan is the Python interface to Stan.
It is the state-of-the-art library for doing Bayesian statistics.
A Stan model consists of
1 data,

2 parameters,

3 statistical model.

Shaobo Jin (Math) Bayesian Statistics 63 / 64

Computational Techniques Stan

R Package rstanarm

The R package rstanarm emulates the R syntax but uses Stan via the
rstan package to t models in the background. So you skip writing the
Stan syntax.
Various common regression models have been implemented in
rstanarm.
Another benet is that various visualization tools in R can be used.

Shaobo Jin (Math) Bayesian Statistics 64 / 64

Brochure - Titrator - T50
No ratings yet
Brochure - Titrator - T50
2 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Cours MC
No ratings yet
Cours MC
456 pages
Dignaga's Philosophy of Language Dignaga On Anyapoha
No ratings yet
Dignaga's Philosophy of Language Dignaga On Anyapoha
374 pages
MCMC
No ratings yet
MCMC
76 pages
Class19 Approxinf
No ratings yet
Class19 Approxinf
45 pages
Machine Learning Econometrics Bayesian Algorithms
No ratings yet
Machine Learning Econometrics Bayesian Algorithms
33 pages
19-Bayesian 2
No ratings yet
19-Bayesian 2
39 pages
Phy 421 Note
No ratings yet
Phy 421 Note
27 pages
II Sem Syllabus
No ratings yet
II Sem Syllabus
12 pages
Bayesian - Lec - 3
No ratings yet
Bayesian - Lec - 3
24 pages
20 Bayesian2
No ratings yet
20 Bayesian2
50 pages
Bayes 2 V
No ratings yet
Bayes 2 V
32 pages
Chapter 1 B
No ratings yet
Chapter 1 B
35 pages
Air, Atmospheric Pressure and Winds
100% (1)
Air, Atmospheric Pressure and Winds
42 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
FSMLecture 4
No ratings yet
FSMLecture 4
49 pages
Admission Test For 6th Class CBSE Answers
No ratings yet
Admission Test For 6th Class CBSE Answers
6 pages
02 Solution Bayes Example
No ratings yet
02 Solution Bayes Example
2 pages
Optimal Capital Allocation
No ratings yet
Optimal Capital Allocation
37 pages
Bayesian
No ratings yet
Bayesian
26 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
BayesianCourse - Session13 (Ajuste Bayesiano)
No ratings yet
BayesianCourse - Session13 (Ajuste Bayesiano)
16 pages
Sewing Symbols in Tailoring
No ratings yet
Sewing Symbols in Tailoring
12 pages
Productfiche Manitou Heftruck Me 425
No ratings yet
Productfiche Manitou Heftruck Me 425
6 pages
What Is Trip Circuit Supervision (TCS) Protection
No ratings yet
What Is Trip Circuit Supervision (TCS) Protection
7 pages
Object Oriented File
No ratings yet
Object Oriented File
62 pages
Bayesian - Lec - 4
No ratings yet
Bayesian - Lec - 4
25 pages
Lecture 5 - 8 Bayesian Estimation
No ratings yet
Lecture 5 - 8 Bayesian Estimation
65 pages
Unit 6
No ratings yet
Unit 6
16 pages
STAT 830 Bayesian Estimation: Richard Lockhart
No ratings yet
STAT 830 Bayesian Estimation: Richard Lockhart
23 pages
Presentation of NARP TRG Team
No ratings yet
Presentation of NARP TRG Team
29 pages
Examples On Magnetic Circuits
No ratings yet
Examples On Magnetic Circuits
9 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Bayesian Inference
No ratings yet
Bayesian Inference
22 pages
Computation
No ratings yet
Computation
11 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Conceptual Introduction To MCMC
No ratings yet
Conceptual Introduction To MCMC
56 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Stat513 l11
No ratings yet
Stat513 l11
17 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
Bayes 2021 Part1
No ratings yet
Bayes 2021 Part1
44 pages
Bayes Intro PT 2
No ratings yet
Bayes Intro PT 2
13 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Importance Sampling
No ratings yet
Importance Sampling
13 pages
Computing Bayes: Bayesian Computation From 1763 To The 21st Century
No ratings yet
Computing Bayes: Bayesian Computation From 1763 To The 21st Century
47 pages
Jawaban Exam
No ratings yet
Jawaban Exam
26 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Psychological Statistics Assignment
No ratings yet
Psychological Statistics Assignment
4 pages
Estimation and Detection: Lecture 6: The Bayesian Philosophy
No ratings yet
Estimation and Detection: Lecture 6: The Bayesian Philosophy
19 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
50% (2)
Basic Fuse Terminology: Fuse: Fuse Elements: Rated Current: Fusing Current: Fusing Factor
3 pages
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
No ratings yet
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
19 pages
Algebra 2 Lesson 5.7 Final
No ratings yet
Algebra 2 Lesson 5.7 Final
4 pages
Lecture 8: Bayesian Estimation of Parameters in State Space Models
No ratings yet
Lecture 8: Bayesian Estimation of Parameters in State Space Models
33 pages
Basic Principles of Air Springs: General Discussion
No ratings yet
Basic Principles of Air Springs: General Discussion
4 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
2 Statistical Definitions: 2.1 Probability Density Function
No ratings yet
2 Statistical Definitions: 2.1 Probability Density Function
9 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
No ratings yet
Bayesian Monte Carlo: Carl Edward Rasmussen and Zoubin Ghahramani
8 pages
Square Wave Generator
100% (2)
Square Wave Generator
6 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
4G CN PDF
No ratings yet
4G CN PDF
2 pages
Moisture in The Atmosphere
No ratings yet
Moisture in The Atmosphere
43 pages
Bayesian Statistics (Szábo & V.d.vaart)
No ratings yet
Bayesian Statistics (Szábo & V.d.vaart)
146 pages
Bayesian Modelling Tuts-4-9
No ratings yet
Bayesian Modelling Tuts-4-9
6 pages
Acetic Acid As Solvent and Its Chemical Reactions
No ratings yet
Acetic Acid As Solvent and Its Chemical Reactions
21 pages
Unit Conversion Table: Distance Foot (FT) Inch (In) Meter (M) Centimeter (CM) Mile (Mi)
No ratings yet
Unit Conversion Table: Distance Foot (FT) Inch (In) Meter (M) Centimeter (CM) Mile (Mi)
2 pages
Quiz 2 - Statistics Coursera
No ratings yet
Quiz 2 - Statistics Coursera
1 page
Tutorial 3 - Revised Solution
No ratings yet
Tutorial 3 - Revised Solution
11 pages
JR Inter Maths 1A AP EM 01022025
No ratings yet
JR Inter Maths 1A AP EM 01022025
11 pages
Air Master Catalog
100% (2)
Air Master Catalog
191 pages
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
No ratings yet
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
24 pages
Bayesian Statistics 01
100% (1)
Bayesian Statistics 01
22 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
No ratings yet
Bayesian Networks: Machine Learning, Lecture (Jaakkola)
8 pages
HCHD 300
No ratings yet
HCHD 300
12 pages
Markov Chain Monte Carlo Methods: Christian P. Robert
No ratings yet
Markov Chain Monte Carlo Methods: Christian P. Robert
456 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Back To Top: Types of Logic
No ratings yet
Back To Top: Types of Logic
6 pages
Question Bank
No ratings yet
Question Bank
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 11 - 14 Computational Techniques

Uploaded by

Lecture 11 - 14 Computational Techniques

Uploaded by

Bayesian Statistics

Shaobo Jin (Math) Bayesian Statistics 1 / 64

Laplace Approximation to Integral

where θ has the dimension d × 1, p is a known constant, and ℓ (θ) and

Shaobo Jin (Math) Bayesian Statistics 2 / 64

Laplace Approximation: Example

where n = 20, i=1 yi = 40.4, and = 93.2. Approximate

Shaobo Jin (Math) Bayesian Statistics 3 / 64

Shaobo Jin (Math) Bayesian Statistics 4 / 64

In practice, we often want to approximate a ratio of integrals such that

The naive approach is to approximate both the numerator and

which is not recommended.

Shaobo Jin (Math) Bayesian Statistics 5 / 64

Moment Generation Function

Consider the moment generation function

We apply the Laplace approximation both the denominator and

Shaobo Jin (Math) Bayesian Statistics 6 / 64

Fully Exponential Laplace Approximation

ℓ (θ, t) = −th (θ) − log f (x | θ) − log π (θ) .

The fully exponential Laplace approximation is

where θ̃ (t) maximizes ℓ (θ, t) for a given t, and θ̂ = θ̃ (0).

Shaobo Jin (Math) Bayesian Statistics 7 / 64

Expectation Under Posterior

where f (x) is the density of random variable/vector X . A natural

Shaobo Jin (Math) Bayesian Statistics 8 / 64

Approximate Expectation by Sample Mean

Under mild conditions, the sample mean

has nice properties.

2 Consistency: h̄ → E [h (x)] in probability, as n → ∞.

3 Strong consistency: h̄ → E [h (x)] almost surely, as n → ∞.

Shaobo Jin (Math) Bayesian Statistics 9 / 64

Sample From Posterior

For given data x, suppose that we want to compute the posterior

If π (θ | x) is a well-known distribution such that we can easily sample

Shaobo Jin (Math) Bayesian Statistics 10 / 64

Independent Monte Carlo: Example

where n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. We want to

Shaobo Jin (Math) Bayesian Statistics 11 / 64

It is common that it is not straightforward to sample directly from

where the expectation is taken with respect to θ | x ∼ g (θ | x).

Shaobo Jin (Math) Bayesian Statistics 12 / 64

Importance Sampling Approximation

We can apply the importance sampling trick to both integrals:

where g (θ | x) > 0 whenever π (θ | x) ̸= 0, stronger than IS.

Normalized Importance Sampling

The importance sampling approximations to the numerator and

The ratio is the normalized importance sampling estimator

Shaobo Jin (Math) Bayesian Statistics 15 / 64

Normalized Importance Sampling: Example

We can even ignore the constants in f (y | θ) π (θ) in normalized

Shaobo Jin (Math) Bayesian Statistics 16 / 64

Monte Carlo approximation

1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10

Markov Chain Monte Carlo

We often want to get a sample from the posterior.

Shaobo Jin (Math) Bayesian Statistics 18 / 64

where the transition kernel K (x, y) is the conditional density of Y

Shaobo Jin (Math) Bayesian Statistics 19 / 64

P (y) = P (x) K (x, y) , discrete case,

where P and f are not generic symbols.

The stationary distribution means that if the initial state

lim sup |P (Xn ∈ A | X0 = x) − π (A)| = 0, almost surely,

regardless of the initial state X0 = x.

Shaobo Jin (Math) Bayesian Statistics 21 / 64

Choose the Transition Kernel

Our goal is to simulate data from π (θ | x). We need to choose the

K (θ, θ∗ ) π (θ | x) = K (θ∗ , θ) π (θ∗ | x) ,

for any θ, θ∗ ∈ Θ, then π (θ | x) is the stationary distribution of the

Shaobo Jin (Math) Bayesian Statistics 22 / 64

Find a proposal distribution T (θ, θ∗ ) that satises the detailed balance

Hence, we should seek A such that the detailed balance condition

The value λ that maximizes the probability A (·, ·) ≤ 1 is

Shaobo Jin (Math) Bayesian Statistics 24 / 64

Algorithm 1: Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm: Example

We observe n = 20, ni=1 yi = 40.4, and = 93.2. Obtain a

Find a proposal distribution T (θ, θ∗ ) that satises the detailed balance

Many dierent MCMC algorithms dier mainly in how the candidate y

This transition kernel satises

Solve Dierential Equation

To solve the dierential equations, we consider an approximation

2 Apply the leapfrog method (with some modication) until a

To diminish the inuence of the starting values, we can discard the

Several dierent versions of R̂ have been proposed. One suggestion is to