Lecture 11 - 14 Computational Techniques
Lecture 11 - 14 Computational Techniques
Computational Techniques
Shaobo Jin
Department of Mathematics
Example
Suppose that posterior is
Pn
σ2
2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1
Error Analysis
Suppose that we can express ℓ (θ) as pq (θ) such that the integral is
ˆ
h (θ) exp {−pq (θ)} dθ,
where h (θ) and q (θ) do not include p, and h (θ) and q (θ) are smooth
functions.
Let H def
∂ 2 q (θ̂)
= ∂θ∂θT
> 0. The Laplace approximation satises
r n o h
d/2 i
(2π) det H −1 θ̂ /p exp −pq θ̂ h θ̂ + O p−1 .
Ratio of Integrals
Let
However, it is not always the case that we can nd the closed form
expression of µ (x). Approximations are more often needed.
Suppose that we want to approximate
ˆ
E [h (x)] = h (x) f (x) dx,
distribution, as n → ∞.
The classic methods (e.g., independent Monte Carlo and importance
sampling) have these properties.
Example
Suppose that posterior is
Pn
σ2
2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
Pn 2
n 1 ( y )
σ 2 | y ∼ InvGamma 2 + , 2 +
i
X
yi2 − i=1
,
2 2 n+1
i=1
Importance Distribution
Example
Suppose that posterior is
Pn
σ2
2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1
where
2 n = 20, i=1 yi = 40.4, and = 93.2. Approximate
Pn Pn 2
i=1 yi
E σ | y by importance sampling.
Shaobo Jin (Math) Bayesian Statistics 13 / 64
Computational Techniques Normalized Importance Sampling
Normalizing Constant
Since we often derive the posterior using π (θ | x) ∝ f (x | θ) π (θ) by
ignoring the normalizing constant, we cannot always evaluate π (θ | x).
It is easy to evaluate f (x | θ) π (θ), but not m (x).
We can rewrite µ as
ˆ ´
h (θ, x) f (x | θ) π (θ) dθ
µ (x) = h (θ, x) π (θ | x) dθ = ´ .
f (x | θ) π (θ) dθ
(n+1)β 2 −2β n
n P Pn 2
o
i=1 yi +4+ i=1 yi
exp − 2σ 2
f (y | θ) π (θ) ∝
(σ 2 )(n+1)/2+3
We observe n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. Approximate
P P
E σ 2 | y by normalized importance sampling.
Randomness
In independent Monte Carlo, importance sampling, and normalized
importance sampling, we simulate random numbers from π (θ | x) or
g (θ | x), then µ̂ is a random variable. This means that we can
construct condence interval for µ̂ using the central limit theorem.
4
2
0
MCInt
Shaobo Jin (Math) Bayesian Statistics 17 / 64
Computational Techniques MCMC
Transition Kernel
The transition kernel describes how the Markov chain moves from Xn−1
to Xn .
If {Xn } is discrete, the transition kernel is a matrix K with
elements P (Xn = y | Xn−1 = x).
If {Xn } is continuous, the Markov property means that
ˆ
P (Xn ∈ A | Xn−1 = x, · · · X0 ) = K (x, y) dy,
y∈A
f (Xn = y | Xn−1 = x, · · · X0 ) = f (Xn = y | Xn−1 = x) = K (x, y) ,
Stationary Distribution
Denition
The distribution p on Ω is a stationary distribution (or invariant
distribution) of the Markov chain with the transition kernel K , if
x∈X
ˆ
f (y) = f (x) K (x, y) dx, continuous case,
x∈X
Long-Run Property
Theorem
Let π () be the stationary distribution of the Markov chain. Under some
regularity conditions,
Proposal Distribution
When we simulate random numbers from a Markov chain, we need a
proposal distribution
T (θ, θ∗ ) = f (θ∗ | θ) .
Deriving A (θ, θ∗ )
The detailed balance condition is fullled, if we choose the acceptance
probability to be
A (θ, θ∗ ) = λ (θ, θ∗ ) π (θ∗ | x) T (θ∗ , θ) ≤ 1,
A (θ∗ , θ) = λ (θ, θ∗ ) π (θ | x) T (θ, θ∗ ) ≤ 1.
Hence,
π (θ∗ | x) T (θ∗ , θ)
∗ ∗ ∗ ∗
A (θ, θ ) = λ (θ, θ ) π (θ | x) T (θ , θ) = min 1, .
π (θ | x) T (θ, θ∗ )
Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm allows proposal distributions such
that T (θ, θ∗ ) > 0 if and only if T (θ∗ , θ) > 0.
7 end
Shaobo Jin (Math) Bayesian Statistics 25 / 64
Computational Techniques Metropolis-Hastings
∗
Since the ratio R θ(t) , θ∗ includes ππ(θ |x)
, we only need to know
(θ(t) |x)
π (· | x) up to a normalizing constant.
Example
Consider an iid sample of size n from Y | β, σ 2 ∼ 2 . The prior
N β, σ
of σ 2 is InvGamma (2, 2), and β | σ 2 is N 0, σ 2 . Then,
Pn Pn
exp − (n + 1) β 2 − 2β i=1 yi + 4 + i=1 yi2 / 2σ 2
f (y | θ) π (θ) ∝ (n+1)/2+3
.
(σ 2 )
( )
2
∗ 1 (θ − θ∗ )
T (θ, θ ) = √ exp − .
2σ 2 2σ 2
Metropolis Algorithm
7 end
τ 2 ∂ log π θ(t) | x
(t)
d = .
2 ∂θ
5 end
6 end
Example
Suppose that our data X1 , ..., Xn are iid from N µ, λ−1 . The prior
The Metropolis algorithm and the Gibbs sampler often move too
slowly through the target distribution when the dimension of the
target distribution is high.
Hamiltonian Monte Carlo (HMC) moves much quicker through the
target distribution.
For each component in the target distribution, HMC adds a
momentum variable and the proposal distribution largely depends
on the momentum variable.
Both the component in the target distribution and the momentum
are updated in the MCMC algorithm.
Hamiltonian Dynamics
Augmentation
ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
2 For ℓ = 1, ..., L − 1,
1 Update the position: θ ← θ + ϵM −1 ϕ.
2 Update the momentum:
∂ log π (θ | x)
ϕ ← ϕ+ϵ .
∂θ
3 Make one last update on the position: θ ← θ + ϵM −1 ϕ.
4 Make one last half-step update of the momentum
ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
Shaobo Jin (Math) Bayesian Statistics 39 / 64
Computational Techniques Hamiltonian Monte Carlo
Metropolis Step
Suppose that the state after such L updates is (θ∗ , ϕ∗ ). We negate the
momentum and the new proposal state is (θ∗ , −ϕ∗ ).
We determine whether to accept the proposal using the Metropolis
algorithm, where the acceptance probability is
exp {−H (θ∗ , −ϕ∗ )}
∗ ∗
A ((θ, ϕ) , (θ , −ϕ )) = min 1, .
exp {−H (θ, ϕ)}
Properties of HMC
Some crucial properties of the Hamiltonian dynamics for MCMC
updates include
1 deterministic updates. The Hamiltonian dynamics is deterministic.
After running the leapfrog loop L times, we always move the initial
state (θ0 , ϕ0 ) to the same proposal (θ∗ , ϕ∗ ).
2 reversible. The mapping from the state at time t, denoted by
Tuning Parameters
Some theory suggest that we can tune HMC such that the acceptance
probability is around 65%.
No-U-Turn Sampler
The no-U-turn sampler (NUTS) allows us to automatically tune the
number of steps L: we increases L until the simulated dynamics is long
enough such that the proposed position θ∗ starts to move back towards
the initial position θ if we run more steps.
This is measured by the angle between θ∗ − θ and current
momentum ϕ∗ .
U-turn occurs.
3 Sample uniformly from the points in
{(θ, ϕ) : exp {−H (θ, ϕ)} ≥ u} that the leapfrog step has visited
and the detailed balance condition is fullled.
Shaobo Jin (Math) Bayesian Statistics 43 / 64
Computational Techniques Hamiltonian Monte Carlo
Adaptively Tune ϵ
Burn-In Period
Mixing
We want the Markov chain to show good mixing.
Bad
2
−1
−2
x
Good
2
−1
−2
2.5
chain
chain 1
x
0.0
chain 2
−2.5
−5.0
0 500 1000 1500 2000
iteration
in the j th chain.
The variation within the chains is measured by
m n
" #
1 X 1 X
W = (yij − ȳ·j )2 ,
m n−1
j=1 i=1
which declines to 1 as n → ∞.
It is suggested that we keep simulating the Markov chain until
R̂ < 1.1 or even < 1.01.
Variants of Gelman-Rubin R̂
Serial Correlation
It is obvious that θ(t+1) and θ(t) are not independent draws. Inference
from autocorrelated draws is generally less precise than from the same
number of independent draws.
However, such serial correlation is not necessarily a problem.
Remember that, at convergence, we reach the stationary
distribution.
n
µ̂MCMC
1X
= h (θi , x) .
n i=1
Long-Run Property
Theorem
Under some conditions, for all starting state θ0 ∈ Θ,
1 ergodic theorem: For any initial state,
n
1X a.s.
h (θi , x) → E [h (θ, x) | x] = µ (x) .
n i=1
∞
" n
#
√ 1X d
X
n h (θi , x) − µ (x) → N 0, σ 2 1 + 2 ρj .
n i=1 j=1
Thinning
Some prefer thinning the sequence by only keeping every kth draw from
a sequence in order to reduce serial correlation.
But whether or not the Markov chain is thinned, it can be used for
inferences, provided that it has reached convergence.
Suppose that the length of the Markov chain is n. We discard k − 1
out of every k observations and the chain after thinning is n/k.
Under some assumptions,
√ d
n [µ̂ − µ (x)] → N 0, τ 2 ,
p d
n/k [µ̂k − µ (x)] → N 0, τk2 ,
where µ̂ and µ̂k are the estimators without and with thinning,
respectively.
In fact, it has been proved that, for any k > 1, kτk2 > τ 2 , indicating
that discarding k − 1 out of every k observations will increase the
variance.
Shaobo Jin (Math) Bayesian Statistics 54 / 64
Computational Techniques Convergence Diagnostics
n
1 X (i)
1 θ ∈A → E [1 (θ ∈ A) | x] = P (θ ∈ A | x) .
n
i=1
f xnew | x, θ(i) .
Shaobo Jin (Math) Bayesian Statistics 55 / 64
Computational Techniques Variational Inference
Approximate Posterior
If the posterior distribution family is dicult to handle, it can be useful
to approximate it by another distribution family that is easier to
handle.
The Kullback-Leibler divergence for distributions P and Q with
respective densities p and q are
ˆ
q (θ)
KL (q, p) = q (θ) log dθ ≥ 0.
p (θ)
Example
Suppose that y | β ∼ Nn (Xβ, Σ) and β ∼ Np µ0 , Λ−1 , where Σ is
0
known. The posterior is β | y ∼ N µn , Λn , where
−1
Λn = Λ0 + X T Σ−1 X,
µn = Λ−1 Λ0 µ0 + X T Σ−1 y .
n
Theorem
Consider the mean eld variational family DMF , where
m
Y
q (θ | x) = qj (θj | x) .
j=1
Then,
ˆ
qk (θk | x) ∝ exp q−k (θ−k | x) log π (θk | θ−k , x) dθ−k .
5 end
6 end
Shaobo Jin (Math) Bayesian Statistics 61 / 64
Computational Techniques Variational Inference
Example
Suppose that we have an iid sample Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n.
n
!
1 X 2 2 2
bn = b0 + xi + λ0 µ0 − λn µn .
2
i=1
Stan
2 parameters,
3 statistical model.
R Package rstanarm
The R package rstanarm emulates the R syntax but uses Stan via the
rstan package to t models in the background. So you skip writing the
Stan syntax.
Various common regression models have been implemented in
rstanarm.
Another benet is that various visualization tools in R can be used.