Lec33 MetropolisHastings
Lec33 MetropolisHastings
Hastings Algorithm
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
November 1, 2020
1
Asymptotically: 𝑋𝑛 ~𝒩 0, .
1−𝜌2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Autoregressive Model: Example
Case: 𝜌 = 0.5, initial state: 𝑋0 ~𝒩(0, 1). Asymptotic variance: 4/3
2.5
0.4
0.35
2
0.3
1.5 0.25
probability density
0.2
0.15
0.5 0.1
0.05
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
-6 -4 -2 0 2 4 6
0.35
Since the initial
value 𝑋0 here has a 0.3
probability density
on the estimated
“variance” (𝑋0 − 𝑋𝑛 0.2
presented. 0
-6 -4 -2 0 2 4 6
More importantly, we can use the realizations of the Markov Chain in Monte
Carlo estimators i.e. we can average across the path.
However note that even if 𝑋𝑛 were exact draws, they are not independent
anymore!
𝑉𝑎𝑟𝜋 𝑓(𝑥)
𝑉𝑎𝑟 𝐼መ =
𝑁
N f ( X i ) f , and var I
I
i
2
1 N
1 N
1 N
N N
) f ( X m ) f
1
f ( X
2
f (Xn) f ( X m ) f ( X n ) 2 n
N n 1 N m 1 N n 1 N n 1 m 1
2 2 N 1 2
N
2 N (0) 2( N 1) (1) ...2 N N 1 ( N 1)
N j 1 N
j
1 2 (1 ) ( j )
N
f
ff ( s ) C ff ( s ) / C ff (0)
C ff ( s ) / var( f )
: autocovariance time
f
For some 𝑀 sufficiently large 𝜌𝑓𝑓 (𝑠) ≈ 0 when 𝑠 ≥ 𝑀.
For 𝑁 ≫ 𝑀, the 𝑋0 and 𝑋𝑁 samples are totally uncorrelated.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Markov Chain Monte Carlo
Objective: Given an arbitrary distribution 𝜋 𝒙 , we want to construct a Markov
Chain that asymptotically converges to the target independently of the initial
state.
We want to use the Markov Chain paths in estimators
f X I f x x dx
N
1
I n
N n 1
The first successful attempt was the Metropolis algorithm proposed in 1953 by
N. Metropolis, AW Rosenbluth, MN Rosenbluth, AH Teller and E. Teller in “
Equations of State calculations by fast computing machines”, J. Chem Phys,
21 pp 1087. This paper has been cited 42,782 times since then!
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Metropolis-Hastings Algorithm
This is another way to sample from 𝜋 𝜃 known up to a normalizing constant.
W. Hastings, Monte Carlo Sampling Methods using Markov Chains and their Applications, Biometrica,
Vol. 57(1), pp. 97-109 (1970).
M-H is a stochastic algorithm. Even if you draw the same 𝜃 ′ twice, this is
accepted with a certain probability.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Metropolis Algorithm
The original version of the algorithm considers a random walk proposal
𝜋(𝜃 ′ ) 𝑓 𝜃 − 𝜃 ′ 𝜋(𝜃 ′ )
𝛼 𝜃, 𝜃 ′ = min 1, = min 1,
𝜋(𝜃 ) 𝑓 𝜃 ′ − 𝜃 𝜋(𝜃 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Metropolis Algorithm
Let 𝜋(𝜃) the target and 𝑞(𝜃, 𝜃 ′ ) a symmetric proposal distribution such
𝑞(𝜃, 𝜃 ′ ) = 𝑞(𝜃 ′ , 𝜃).
This is not the case because the proposals are not always accepted!
This works if the effect of the data is not significant – i.e. the posterior is close
to the prior.
Note that the transition kernel 𝐾 𝜃, 𝜃 ′ is not the same as the proposal
distribution 𝑞(𝜃, 𝜃 ′ )!
In addition the Markov chain needs to be irreducible (one can reach any 𝐴 s.t.
𝜋(𝐴) > 0) and aperiodic (not visiting periodically the state-space).
Then
𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ = 𝜋(𝜃)min 1, 𝑞 𝜃, 𝜃′ = min 𝜋(𝜃)𝑞 𝜃, 𝜃′ , 𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝑞 𝜃, 𝜃′
𝜋(𝜃)𝑞 𝜃, 𝜃′
= 𝜋(𝜃′)min 1, 𝑞 𝜃′, 𝜃 = 𝜋(𝜃′)𝛼 𝜃′, 𝜃 𝑞 𝜃′, 𝜃
𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝐾(𝜃, 𝜃 ′ ) = 𝜋(𝜃 ′ )𝐾 𝜃 ′ , 𝜃
Detailed balance implies that 𝜋 is invariant. Indeed:
= 𝜋(𝜃 ′ ) න𝐾 𝜃 ′ , 𝜃 𝑑𝜃 = 𝜋(𝜃 ′ )
Suppose that the probability distribution you want to sample from is 𝜋(𝑥).
1. Initialize 𝑥.
2. Propose a new 𝑥𝑛𝑒𝑤 ~ 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) = 𝒩(𝑥𝑛𝑒𝑤 |𝑥, 𝑠2). Here 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) leads
to a reversible Markov Chain and the classic Metropolis algorithm is used.
3. Draw a random number 𝑢 ~ 𝒰[0, 1].
4. If 𝑢 <= min(1, 𝜋(𝑥𝑛𝑒𝑤 ) /𝜋(𝑥)), accept the move, i.e. 𝑥 = 𝑥𝑛𝑒𝑤 .
5. Otherwise reject the move.
2 2
1 1
0 0
-1 -1
-2 -2
-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 4
2 2
1 1
0 0
-1 -1
-2
-2
𝑠 = 0.1 -3
𝑠 = 0.05
-3 -3 -2 -1 0 1 2 3 4
-3 -2 -1 0 1 2 3 4
When the variance of the random walk increments (if it exists) is very small
then the acceptance rate can be expected to be around 0.5 − 0.7.
You would like to scale the random walk moves such that it is possible to
move reasonably fast in regions of positive probability masses under 𝜋.
ergodic histogram
1.4
0.3 1.2
0.25 0.8
see here
probability density
0.2 0.4
I
1
x i X 0.75
0.2 N i
0
0.15 10
2 3
10 10
4
1.2
normalized autocovariance function
1
0.1
0.8
ff (s) C ff (s) / C ff (0) C ff (s) / var( f )
0.05 0.6
2
n 1
0.4
N n 1
0
-15 -10 -5 0 5 10
0.2
0.4
1
0.35
0.8
0.3
probability density
0.6
0.25
0.2 0.4
0.15
0.2
0.1
0
0.05
0 -0.2
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100
Acceptance Ratio: 0.05, the acceptance rate is very low, the auto-correlation very high and
thus the convergence rate very slow
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Random Walk Metropolis-Hastings
Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn ) N (0, 2 ) q ( xn , xn 1 ) N ( xn , 2 )
Case: 𝜎 = 0.5, 𝑥0 = 0.0, length of chain = 10000
0.35 1
ergodic histogram normalized autocovariance function
0.3
0.9
0.25
0.8
probability density
0.2
0.7
0.15
0.6
0.1
0.5
0.05
0 0.4
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100
Acceptance Ratio: 0.76, the acceptance rate is the highest from the 3 cases considered,
the auto-correlation also the highest and thus the convergence rate very slow
'
2
q1 ( , ') e 2(0.2)2
'
2
q2 ( , ') e 2(5)2
( ) e 2
q1 ( , ') e 2(0.2)2
4 0.5
0.45
3
0.4
2
0.35
1
0.3
0 0.25
0.2
-1
0.15
-2
0.1
-3
0.05
-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4
( ) e 2 q2 ( , ') e 2(5)2
4 0.5
0.45
3
0.4
2
0.35
1
0.3
0 0.25
0.2
-1
0.15
-2
0.1
-3
0.05
-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4
0.04
0.03
0.02
0.01
0
0
2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis
0.04
0.03
0.02
0.01
0
0
2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis
You should not adapt 𝜎 2 on the fly in order to achieve an acceptance ratio in
that range.
The chain is not Markov anymore and the desired convergence properties
might be lost.
Heavy tails increments (tails of the distribution of the random walk) can
prevent you from getting trapped in modes.
0.8 1.4
0.6
1.2
0.4
1
0.2
0.8
0.6
-0.2
0.4
-0.4
0.2
-0.6
-0.8 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
4 0.45
3 0.4
2 0.35
0.3
1
0.25
0
0.2
-1
0.15
-2
0.1
-3
0.05
-4
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
-4 -3 -2 -1 0 1 2 3 4
This algorithm is different from using 𝑞(𝜃, 𝜃′) = 𝜆𝑞1 (𝜃′) + (1 − 𝜆)𝑞2 (𝜃, 𝜃′).
( ') (1 ) ( ')
( ')
In the limiting case, you could have 𝐾2 (𝜃, 𝜃′) = 𝛿𝜃 (𝜃′) and the total kernel 𝐾
would still be irreducible and aperiodic if 𝐾1 is irreducible and aperiodic.
z K ( z , ')dz '
2
( 2' | 1 )q2 ( , ),
1
'
2 2
( 2 | 1 )q2 ( , ),
1 2
'
2
𝑖−1 𝑖−1
Sample 𝜃1∗ ~𝑞1 𝜃1 , 𝜃2 ,⋅ and compute
1* | 2(i 1) q1 1* , 2(i 1) ,1(i 1)
1 1(i 1) , 2(i 1) , 1* , 2(i 1) min 1,
1(i 1) | 2(i 1) q1 1(i 1) , 2(i 1) ,1*
𝑖−1 𝑖−1 𝑖−1 𝑖
With probability 𝛼1 𝜃1 , 𝜃2 , 𝜃1∗ , 𝜃2 , set 𝜃1 = 𝜃1∗
𝑖 𝑖−1
otherwise set 𝜃1 = 𝜃1
𝑖 𝑖−1
Sample 𝜃2∗ ~𝑞2 𝜃1 , 𝜃2 ,⋅ and compute
2* | 1(i ) q2 1( i ) , 2* , 2( i 1)
2 ,
(i ) ( i 1)
, (i )
, *
min 1,
1 2 1 2
2(i 1) | 1(i ) q1 1(i ) , 2( i 1) , 2*
𝑖 𝑖−1 𝑖 𝑖
With probability 𝛼2 𝜃1 , 𝜃2 , 𝜃1 , 𝜃2∗ , set 𝜃2 = 𝜃2∗
𝑖 𝑖−1
otherwise set 𝜃2 = 𝜃2 .
J* | (iJ) qJ J* , ( iJ) , J( i 1)
J , (i ) ( i 1)
, *
, (i )
min 1,
1 2 J J
J(i 1) | (iJ) qK J( i 1) , ( iJ) , J*
𝑖−1 𝑖−1 𝑖 𝑖
With probability 𝛼𝐽 𝜃𝐽 , 𝜃𝐽 , 𝜃𝐽∗ , 𝜃−𝐽 , set 𝜃𝐽 = 𝜃𝐽∗ ;
𝑖 𝑖−1
otherwise set 𝜃𝐽 = 𝜃𝐽 .
However, the composition and mixture of these kernels can be irreducible and
aperiodic because then all the components are updated.
then
r ( , ')
1' | 2 q1 1' , 2 ,1
| |
1
'
2
1
1 2
| q , , | |
1 ' '
1 2 1 1 2 1 1 2 1 2
Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
(𝑖)
Sample 𝜃𝑘 using an MH step of proposal distribution
𝑖 𝑖−1 𝑖
𝑞𝑘 ൬𝜃−𝑘 , 𝜃𝑘 ), 𝜃𝑘′ and target 𝜋(𝜃𝑘 |𝜃−𝑘 ቁ
𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = (𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝 ቁ.
𝑖 𝑖−1
Sample 𝜃1 ~𝜋 𝜃1 |𝜃2
𝑖 𝑖 𝑖−1
Sample 𝜃2 using an MH step of proposal distribution 𝑞2 ൬𝜃1 , 𝜃2 ), 𝜃2
𝑖
and target 𝜋 𝜃2 |𝜃1
We can use 2
' log ( ) V , V ~ N (0,1)
2
𝛿 𝜃′, 𝜃
𝛼(𝜃, 𝜃′) =
𝜋 𝜃 𝑞 𝜃, 𝜃 ′
𝜋 𝜃 ′ 𝑞 𝜃 ′ ,𝜃 𝜋 𝜃 𝑞 𝜃,𝜃 ′
Note that 0 ≤ 𝛼 𝜃, 𝜃′ ≤ 1 and 𝛿 𝜃, 𝜃 ′ = = 𝛿 𝜃′, 𝜃 .
𝜋 𝜃′ 𝑞 𝜃 ′ ,𝜃 +𝜋 𝜃 𝑞 𝜃,𝜃 ′
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 61
Alternative Acceptance Probabilities
Indeed, one can check that
𝐾 𝜃, 𝜃′ = 𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 𝜃′
is 𝜋 −reversible.
We have:
𝛿 𝜃, 𝜃′
𝜋 𝜃 𝛼(𝜃, 𝜃′)𝑞 𝜃, 𝜃′ = 𝜋 𝜃 𝑞 𝜃, 𝜃′ = 𝛿 𝜃, 𝜃′ = 𝛿 𝜃′, 𝜃
𝜋 𝜃 𝑞 𝜃, 𝜃′
𝛿 𝜃′, 𝜃
= 𝜋 𝜃′ 𝑞 𝜃′, 𝜃 = 𝜋 𝜃′ 𝛼(𝜃′, 𝜃)𝑞 𝜃′, 𝜃
𝜋 𝜃′ 𝑞 𝜃′, 𝜃
Consider that 𝑥 represents the position of some “real particles”. Then write:
𝜋(𝑥) = exp(−𝑉(𝑥))
To construct the proposal, we bring into the picture “the dynamics described
by the Hamiltonian”.
1. Sample an initial 𝑝 from a Gaussian (in 𝜋(𝑥, 𝑝), 𝑥 and 𝑝 are decoupled and the
probability distribution of 𝑝 is 𝒩(𝑝|0, 1))
2. Evolve the equations of motion for a finite amount of time using a finite time
step.
Notice that the proposal built this way is reversible if the integration scheme is
reversible. To guarantee this, we use a the Leapfrog integration scheme for the
integration of motion which has the property of preserving the value of the
Hamiltonian.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 65
Hybrid Hamiltonian Metropolis Proposal
3
-1
-2
50 MH steps
-3
-3 -2 -1 0 1 2 3