0% found this document useful (0 votes)
32 views66 pages

Lec33 MetropolisHastings

The document discusses the Metropolis-Hastings algorithm, an important Markov chain Monte Carlo (MCMC) method. It begins with an overview of MCMC and defines the key components of a Markov chain, including the initial distribution, transition kernel, and conditions for ergodicity. It then provides an example of an autoregressive model and uses it to illustrate generating a Markov chain and its asymptotic properties. The document aims to explain the goals and fundamentals of MCMC methods and the Metropolis-Hastings algorithm.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views66 pages

Lec33 MetropolisHastings

The document discusses the Metropolis-Hastings algorithm, an important Markov chain Monte Carlo (MCMC) method. It begins with an overview of MCMC and defines the key components of a Markov chain, including the initial distribution, transition kernel, and conditions for ergodicity. It then provides an example of an autoregressive model and uses it to illustrate generating a Markov chain and its asymptotic properties. The document aims to explain the goals and fundamentals of MCMC methods and the Metropolis-Hastings algorithm.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

The Metropolis-

Hastings Algorithm
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

November 1, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1


Contents
 MCMC, Autoregressive Model, Auto covariance function, Metropolis-Hastings
Algorithm, Metropolis Algorithm, Independent Metropolis-Hastings, Transition
Kernel, Reversibility, Irreducibility, Aperiodicity, Examples
 Mixture of Proposals, Composition of MH Kernels, General Hybrid Algorithm,
Alternative Acceptance Probability
 Hamiltonian (Hybrid) Metropolis Proposal

 Arnaud Doucet, Statistical Computing – Monte Carlo Methods (online course)


 Christian P. Robert and George Casella, Monte Carlo Statistical Methods, Springer, 2nd edition (Chapters 6, 7, 9
& 10) (Video, Lecture Slides)
 C.P. Robert, The Metropolis-Hasting Algorithm (with R programs), https://arxiv.org/pdf/1504.01896.pdf
 Julian Besag, Markov Chain Monte Carlo for Statistical Inference (2000) (working paper)
 C. Andrieu, et al. , An Introduction to MCMC for Machine Learning (2003)
 S. Chib and E. Greenberg, Understanding the Metropolis-Hastings algorithm, The American Statistician, 1995
 Java applets for the Metropolis Hastings algorithm
 L. Held, Conditional Prior Proposals in Dynamic Models, Scand. J. Statist., 1999
 M.K. Pitt & N. Shephard, Likelihood Analysis of Non-Gaussian Measurement Time Series, Biometrika, 1996
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goals for today’s lecture include the following:

 Understand the fundamentals of MCMC

 Learn about the Metropolis-Hastings algorithm and its variants

 Understand the use of mixture of proposals and composition of transition


kernels

 Understand how to implement hybrid algorithms

 Acquire basic understanding of the Hamiltonian Metropolis proposal

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Markov Chain Monte Carlo
 The simplest way to generate a sequence of random variables and be able to
say something about asymptotics is using Markov Chains.

 A Markov Chain 𝑋𝑛 , 𝑛 = 0,1,2, … , ∞ is fully defined if we know:

 Initial distribution 𝑝0 (𝑥0 ) = Pr[𝑋0 = 𝑥0 ] (this will prove of little


significance)

 Transition Kernel: 𝐾(𝑥𝑛 , 𝑥𝑛+1 ) = Pr[𝑋𝑛+1 = 𝑥𝑛+1 | 𝑋𝑛 = 𝑥𝑛 ].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Autoregressive Model
 We generate a sequence of random variables using Markov Chains.
 A Markov Chain 𝑋𝑛 , 𝑛 = 1, 2, … is fully defined if we know:
 Initial distribution 𝑝0(𝑥0) = Pr[𝑋0 = 𝑥0] (this will prove of little significance)
 Transition Kernel: 𝐾(𝑥𝑛, 𝑥𝑛+1 ) = Pr[𝑋𝑛+1 = 𝑥𝑛+1 | 𝑋𝑛 = 𝑥𝑛]

 An example of a Markov chain is an autoregressive model:


𝑋𝑛 = 𝜌𝑋𝑛−1 + 𝑍𝑛 where 𝑋0, 𝑍𝑛 ∼ 𝒩(0, 1) (i.i.d) with |𝜌| < 1

 Initial distribution: 𝑋0 ∼ 𝒩(0, 1)


1   2n 1
 X n   0, Var  X n    2n
 
1  2 1  2
 Transition Kernel: 𝑋𝑛 | 𝑋𝑛−1 ∼ 𝒩(𝜌 𝑋𝑛−1 , 1), |𝜌| < 1

1
 Asymptotically: 𝑋𝑛 ~𝒩 0, .
1−𝜌2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Autoregressive Model: Example
 Case: 𝜌 = 0.5, initial state: 𝑋0 ~𝒩(0, 1). Asymptotic variance: 4/3
2.5
0.4

0.35

2
0.3

1.5 0.25

probability density
0.2

0.15

0.5 0.1

0.05
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
-6 -4 -2 0 2 4 6

Variance of the Markov Chain vs. the


number of samples Histogram of the distribution of
𝑛 𝑛
samples (compared with the exact pdf)
1 2 1
𝜎ො𝑛2 = ෍ 𝑋𝑖 − 𝑋෠𝑛 , 𝑤ℎ𝑒𝑟𝑒: 𝑋෠𝑛 = ෍ 𝑋𝑖
𝑛 𝑛
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Autoregressive Model: Example
 Case: 𝜌 = 0.5, initial state: 𝑋0 = −1000 (MatLab implementation)
0.4

0.35
Since the initial
value 𝑋0 here has a 0.3

significant influence 0.25

probability density
on the estimated
“variance” (𝑋0 − 𝑋෠𝑛 0.2

is much larger than 0.15

other 𝑋𝑛 − 𝑋෠𝑛 ), the


0.1
figure of the
variance is not 0.05

presented. 0
-6 -4 -2 0 2 4 6

Histogram of the distribution of


samples (compared with the exact pdf)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Markov Chain Monte Carlo
 To define a Markov Chain only requires determining a local rule 𝐾 𝑋𝑛 , 𝑋𝑛+1 .

 If we make a good selection for the transition kernel, it could asymptotically


converge to a target distribution independently of where we started from.

 More importantly, we can use the realizations of the Markov Chain in Monte
Carlo estimators i.e. we can average across the path.

 However note that even if 𝑋𝑛 were exact draws, they are not independent
anymore!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Markov Chain Monte Carlo
 Ergodic Markov chain:
 f  X   I   f  x    x  dx
N
1
I n
N n 1

𝑋𝑖 form a Markov Chain which asymptotically converges to 𝜋(𝑥) (we


haven’t discussed yet under which conditions this holds)
 We also care about how fast it converges (particularly when each evaluation
of 𝑓 is expensive)
 In standard Monte Carlo using i.i.d. samples we had:

𝑉𝑎𝑟𝜋 𝑓(𝑥)
𝑉𝑎𝑟 𝐼መ =
𝑁

 Let us compute 𝑉𝑎𝑟 𝐼መ for a Markov chain.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Autocovariance Function
1 N
I   f ( x)   f ( x) ( x)dx   f ( X i )  I
N i 1
I   1 I 2  
 
2

  N   f ( X i )   f  , and var  I    
I 
 
i
2
 1 N
 1 N
   1 N
 N N
) f ( X m )    f 
1
     f ( X
2
  f (Xn) f ( X m )     f ( X n )   2 n
 N n 1  N m 1    N n 1  N n 1 m 1

 Let 𝑍𝑖 = 𝑓 𝑋𝑖 − 𝔼 𝑓(𝑋𝑖 ) and assume it is weakly stationary


var  Z i    Z i2    2 ,  Z i Z j    2  (| j  i |)
Normalized
 Then you can easily show that: auto-covariance
function
Autocovariance function
1 N N
  2 N n 1

var  I   2
  N   Z n Z m   N 2  N  (0)  2  (n  m) 
   
C ff ( s )  cov  f  X n  , f  X n  s  
n 1 m 1 n  2 m 1  f  X n  , f  X n s     f 
2

2  2  N 1   2
N

 2 N  (0)  2( N  1)  (1)  ...2  N   N  1   ( N  1) 
N  j 1 N
j

1  2 (1  )  ( j )  
 N
f
 ff ( s )  C ff ( s ) / C ff (0)
 C ff ( s ) / var( f )
  : autocovariance time 
 f 
 For some 𝑀 sufficiently large 𝜌𝑓𝑓 (𝑠) ≈ 0 when 𝑠 ≥ 𝑀.
 For 𝑁 ≫ 𝑀, the 𝑋0 and 𝑋𝑁 samples are totally uncorrelated.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Markov Chain Monte Carlo
 Objective: Given an arbitrary distribution 𝜋 𝒙 , we want to construct a Markov
Chain that asymptotically converges to the target independently of the initial
state.
 We want to use the Markov Chain paths in estimators

 f  X   I   f  x    x  dx
N
1
I n
N n 1

 This requires coming up with a way to produce suitable transition kernels


𝐾(𝑋𝑛 , 𝑋𝑛+1 ) for any target 𝜋 𝒙 .

 The first successful attempt was the Metropolis algorithm proposed in 1953 by
N. Metropolis, AW Rosenbluth, MN Rosenbluth, AH Teller and E. Teller in “
Equations of State calculations by fast computing machines”, J. Chem Phys,
21 pp 1087. This paper has been cited 42,782 times since then!
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Metropolis-Hastings Algorithm
 This is another way to sample from 𝜋 𝜃 known up to a normalizing constant.

 The algorithm builds a Markov kernel that has 𝜋 𝜃 as its invariant


distribution.

 The algorithm is the basis of many other MCMC algorithms.

 The algorithm requires a proposal distribution (kernel) 𝑞 𝜃, 𝜃 ′ to propose a


candidate 𝜃 ′ given 𝜃. The following should hold:

‫𝜃 𝑞 ׬‬, 𝜃 ′ 𝑑𝜃 ′ = 1 for all 𝜃

 𝜃 ′ is accepted with probability 𝑎 𝜃, 𝜃 ′ that ensures that 𝜋 𝜃 as the invariant


distribution of the transition kernel.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Metropolis – Hastings Algorithm
 Let 𝜋(𝜃) the target and 𝑞(𝜃, 𝜃 ′ ) any (symmetric or not) proposal distribution.

 Initialization: Select (deterministically or randomly) 𝜃 (0) .


 Iteration 𝑖, 𝑖 ≥ 1:
 Draw a proposal 𝜃 ∗ from 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )
 Calculate acceptance ratio:
∗ 𝑞 𝜃 ∗ , 𝜃 (𝑖−1)
𝜋 𝜃
𝛼(𝜃 (𝑖−1) , 𝜃 ∗ ) = min 1,
𝜋(𝜃 (𝑖−1) ) 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )

 With probability 𝛼 𝜃 𝑖−1 , 𝜃 ∗ , set 𝜃 (𝑖) = 𝜃 ∗ ; otherwise 𝜃 (𝑖) = 𝜃 (𝑖−1) .

W. Hastings, Monte Carlo Sampling Methods using Markov Chains and their Applications, Biometrica,
Vol. 57(1), pp. 97-109 (1970).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Metropolis-Hastings Algorithm
 To implement the Metropolis scheme we only need to know the target density
𝜋(𝜃) up to a constant!
 This is useful in Bayesian inference where the target distribution is the posterior
(not known normalizing factor)
𝑝(𝜽 | 𝒟) ∝ 𝑝(𝒟| 𝜽) 𝑝(𝜽)

 𝑞 𝜃, 𝜃 ′ can be any proposal distribution. E.g. one can use 𝜃 ′ ~𝒩 𝜙 𝜃 , 𝜎 2


where 𝜙 𝜃 is any deterministic function of 𝜃 (e.g. a neural network or the
local max of 𝜋 closest to 𝜃).

 Much more flexibility than in Gibbs sampling.

 M-H is a stochastic algorithm. Even if you draw the same 𝜃 ′ twice, this is
accepted with a certain probability.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Metropolis Algorithm
 The original version of the algorithm considers a random walk proposal

𝜃 ′ = 𝜃 + 𝑍, 𝑍~𝑓 N. Metropolis, AW Rosenbluth, MN


Rosenbluth, AH Teller and E. Teller in “
Equations of State calculations by fast
computing machines”, J. Chem Phys, 21
 In this case, 𝑞 𝜃, 𝜃 ′ = 𝑓 𝜃 ′ − 𝜃 . pp 1087

 The acceptance probability becomes:

𝜋(𝜃 ′ ) 𝑞(𝜃 ′ , 𝜃 ) 𝜋(𝜃 ′) 𝑓 𝜃 − 𝜃′


𝛼(𝜃, 𝜃 ′ ) = min 1, ′
= min 1,
)
𝜋(𝜃 𝑞(𝜃, 𝜃 ) 𝜋(𝜃 ) 𝑓 𝜃 ′ − 𝜃

 For symmetric 𝑓 𝜃 ′ − 𝜃 = 𝑓 𝜃 − 𝜃 ′ e.g. 𝑍~𝒩 0, Σ

𝜋(𝜃 ′ ) 𝑓 𝜃 − 𝜃 ′ 𝜋(𝜃 ′ )
𝛼 𝜃, 𝜃 ′ = min 1, = min 1,
𝜋(𝜃 ) 𝑓 𝜃 ′ − 𝜃 𝜋(𝜃 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Metropolis Algorithm
 Let 𝜋(𝜃) the target and 𝑞(𝜃, 𝜃 ′ ) a symmetric proposal distribution such
𝑞(𝜃, 𝜃 ′ ) = 𝑞(𝜃 ′ , 𝜃).

 Initialization: Select (deterministically or randomly) 𝜃 (0) .


 Iteration 𝑖, 𝑖 ≥ 1:
 Draw a proposal 𝜃 ∗ from 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )
 Calculate acceptance ratio:

𝜋 𝜃
𝛼(𝜃 (𝑖−1) , 𝜃 ∗ ) = min 1,
𝜋(𝜃 (𝑖−1) )
𝑖−1
 With probability 𝛼 𝜃 , 𝜃 ∗ , set 𝜃 (𝑖) = 𝜃 ∗ ; otherwise 𝜃 (𝑖) = 𝜃 (𝑖−1) .
N. Metropolis, A W Rosenbluth, M N Resenbluth, A H Teller and E Teller, Equations of
State Calculations for Fast Computing Machines, J Chem Physics, Vol 21, pp. 1087 (1953)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Independent Metropolis-Hastings
 If 𝑞(𝜃, 𝜃′) = 𝑞(𝜃′) (independent proposal) then:

𝜋(𝜃 ′ ) 𝑞(𝜃 ′ , 𝜃 ) 𝜋(𝜃′) 𝜋(𝜃 )


𝛼 𝜃, 𝜃 ′ = min 1, ′
= min 1, ൘
𝜋(𝜃 ) 𝑞(𝜃, 𝜃 ) 𝑞(𝜃′) 𝑞(𝜃 )

 Unnormalized 𝜋 ∗ 𝜃 and 𝑞 ∗ 𝜃 can be used.


 When using independent proposals then you would like to have 𝑞 𝜃 ≅ 𝜋 𝜃 .
 Similarly to Rejection sampling or Importance Sampling, you need to ensure
that
𝜋 ∗ (𝜃)

≤𝑀
𝑞 (𝜃 )
to obtain good performance.
 Without the above constraint in the selection of 𝑞(𝜃), the algorithm might not
work at all.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Independent Metropolis-Hastings
 One might argue that since the proposed state does not depend on the
previous state, the states of the Markov Chain are independent and therefore
the autocorrelation is zero and the achieved convergence rate optimal.

 This is not the case because the proposals are not always accepted!

 In addition, if the proposal focuses on a region of low probability mass, it will


spend most of its time there.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Independent Metropolis-Hastings
 Considering sampling from the posterior
𝑝(𝜽|𝒟)~𝑝(𝒟|𝜽)𝑝(𝜽)

 Let us use independent Metropolis-Hastings with the prior as the proposal


distribution
𝑝(𝒟|𝜽 ′ )𝑝(𝜽′ ) 𝑝(𝜽)
𝑎𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑎𝑡𝑖𝑜 𝑎(𝜽, 𝜽′ ) = min 1, =
𝑝(𝒟|𝜽)𝑝(𝜽) 𝑝(𝜽′ )
𝑝(𝒟|𝜽′ )
= min 1,
𝑝(𝒟|𝜽)

 This works if the effect of the data is not significant – i.e. the posterior is close
to the prior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Metropolis Algorithm
 If such a scheme is to converge to the target distribution 𝜋(𝜃) then this must
be invariant, i.e.
න𝜋(𝜃)𝐾 ( 𝜃, 𝜃 ′ )𝑑𝜃 = 𝜋 𝜃 ′

 Note that the transition kernel 𝐾 𝜃, 𝜃 ′ is not the same as the proposal
distribution 𝑞(𝜃, 𝜃 ′ )!

𝐾(𝜃, 𝜃 ′ ) = p(𝜃 ′ | proposal acc.) Pr [proposal accepted]


+ p(𝜃 ′ | proposal rejected) Pr [proposal rejected]

 In addition the Markov chain needs to be irreducible (one can reach any 𝐴 s.t.
𝜋(𝐴) > 0) and aperiodic (not visiting periodically the state-space).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Invariant Distribution of the Metropolis-Hastings
 The transition kernel associated to the MH algorithm can be written as
 
K  ,  '    , '  q  , '   1     , u q  , u  du  ( ')
Rejection
Probability

 This is a loose notation for

𝐾 𝜃, 𝑑𝜃′ = 𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ 𝑑𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 𝑑𝜃′

 Clearly we need to satisfy ‫𝜃 𝐾 ׬‬, 𝜃′ 𝑑𝜃′ = 1. Indeed:

න𝐾 𝜃, 𝜃′ 𝑑𝜃′ = න𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ 𝑑𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 න𝛿𝜃 (𝜃′)𝑑 𝜃′ = 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


The MH Kernel is Reversible
 By definition of the kernel we have
𝜋(𝜃)𝐾 𝜃, 𝜃′ = 𝜋(𝜃)𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 (𝜃′)𝜋 𝜃

 Then
𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ = 𝜋(𝜃)min 1, 𝑞 𝜃, 𝜃′ = min 𝜋(𝜃)𝑞 𝜃, 𝜃′ , 𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝑞 𝜃, 𝜃′
𝜋(𝜃)𝑞 𝜃, 𝜃′
= 𝜋(𝜃′)min 1, 𝑞 𝜃′, 𝜃 = 𝜋(𝜃′)𝛼 𝜃′, 𝜃 𝑞 𝜃′, 𝜃
𝜋(𝜃′)𝑞 𝜃′, 𝜃

 We also have obviously

1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 (𝜃′)𝜋(𝜃) = 1 − න𝛼 𝜃′, 𝑢 𝑞 𝜃′, 𝑢 𝑑𝑢 𝛿𝜃′ (𝜃)𝜋 𝜃′

 It follows that 𝜋(𝜃)𝐾 𝜃, 𝜃′ = 𝜋(𝜃′)𝐾 𝜃′, 𝜃 .

 Hence, 𝜋 is the invariant distribution of the transition kernel 𝐾.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Detailed Balance vs 𝝅 −invariant
 𝐾(𝜃, 𝜃 ′ ) = 𝑞(𝜃, 𝜃 ′ )𝑎(𝜃, 𝜃 ′ )+ 1 − ‫׬‬a(𝜃, 𝜃 ′ ) 𝑞 𝜃, 𝜃 ′ 𝑑𝜃 ′ 𝛿𝜃 (𝜃 ′ )
 The transition kernel 𝐾 satisfies the detailed balance condition (reversibility)

𝜋(𝜃)𝐾(𝜃, 𝜃 ′ ) = 𝜋(𝜃 ′ )𝐾 𝜃 ′ , 𝜃
 Detailed balance implies that 𝜋 is invariant. Indeed:

න𝜋(𝜃 ) 𝐾(𝜃, 𝜃 ′ )𝑑𝜃 = න𝜋(𝜃 ′ ) 𝐾(𝜃 ′ , 𝜃)𝑑𝜃

= 𝜋(𝜃 ′ ) න𝐾 𝜃 ′ , 𝜃 𝑑𝜃 = 𝜋(𝜃 ′ )

 Many more kernels are 𝜋 − invariant than 𝜋 − reversible.


 Fortunately, it is easier to construct a transition kernel that is 𝜋 −reversible
than just 𝜋 −invariant.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


aperiodicity
 We already have seen that 𝜋 − invariance is not enough to guarantee that the
chain converges to 𝜋.
 In addition we need: aperiodicity and 𝜋 − irreducibility.
 Aperiodicity: Let 𝑀 be an irreducible Markov chain with transition matrix 𝐾 and
let 𝜽 be a fixed state. Define the set

𝑇 = 𝑘: 𝐾 𝑘 (𝜃, 𝜃) > 0, 𝑘 > 0


These are the steps on which it is possible for a chain which starts in state 𝜃 to
revisit 𝜃. The greatest common divisor (g.c.d.) of the integers in 𝑇 is called the
period of state 𝜃.
 The chain is said to be periodic if the period of any of its states is greater
than one.
 A state with period one is aperiodic, i.e. one does not visit in a periodic way
the state-space.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Irreducibility and Ergodicity
 Irreducibility is a measure of the sensitivity of the Markov Chain to initial
conditions
𝐾(𝜃, 𝜃 ′ ) is 𝜋− irreducible if for any set A ⊂ 𝛺 with න 𝜋(𝜃)𝑑𝜃 > 0,
𝐴
Pr(Θ𝑛 ∈ A for some finite 𝑛|Θ0 = 𝜃)>0, so that the chain can hit any set that has
finite probability in 𝜋

 It is satisfied if ∀ 𝜃 ′ : 𝜋(𝜃 ′ ) > 0 ⇒ 𝑞(𝜃, 𝜃 ′ ) > 0 ∀ θ


 Theorem (Ergodicity from reversibility)

Let 𝜋(𝜃) be a given probability density on Ω. If 𝐾(𝜃, 𝜃 ′ ) is 𝜋 −irreducible and


if 𝐾 is reversible and aperiodic with respect to , then
𝑛
න𝜋 (𝜃)𝑑𝜃 → න 𝜋(𝜃)𝑑𝜃 as 𝑛 → ∞
𝐴 𝐴
for any set 𝐴 ⊂ Ω and starting distribution 𝜋 (0) .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Irreducibility and Aperiodicity
 To ensure irreducibility, a sufficient but not necessary condition is that

𝜋(𝜃′) > 0 ⇒ 𝑞(𝜃, 𝜃′) > 0 ∀ θ

 Aperiodicity is automatically ensured as there is always a strictly positive


probability to reject the candidate.

 Theoretically, the MH algorithm converges under very weak assumptions to


the target distribution 𝜋.

 The convergence can be very slow.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Sampling from a Mixture of Gaussians
 The MatLab demo here shows how you can sample from a probability
distribution known up to a normalizing constant using MCMC with random
walk proposals.

 Suppose that the probability distribution you want to sample from is 𝜋(𝑥).
 1. Initialize 𝑥.
 2. Propose a new 𝑥𝑛𝑒𝑤 ~ 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) = 𝒩(𝑥𝑛𝑒𝑤 |𝑥, 𝑠2). Here 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) leads
to a reversible Markov Chain and the classic Metropolis algorithm is used.
 3. Draw a random number 𝑢 ~ 𝒰[0, 1].
 4. If 𝑢 <= min(1, 𝜋(𝑥𝑛𝑒𝑤 ) /𝜋(𝑥)), accept the move, i.e. 𝑥 = 𝑥𝑛𝑒𝑤 .
 5. Otherwise reject the move.

 The target distribution is a 50% − 50% mixture of two Gaussians.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Sampling from a Mixture of Gaussians
3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 4

𝑞(𝑥𝑛𝑒𝑤 | 𝑥𝑜𝑙𝑑) = 𝒩(𝑥𝑜𝑙𝑑, 𝑠2), 𝑠 = 2.0 𝑠 = 0.5


3
3

2 2

1 1

0 0

-1 -1

-2
-2

𝑠 = 0.1 -3
𝑠 = 0.05
-3 -3 -2 -1 0 1 2 3 4
-3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Selecting the Proposal in Random Walk
 Consider a random walk move. There is no clear guideline how to select the
proposal distribution.

 When the variance of the random walk increments (if it exists) is very small
then the acceptance rate can be expected to be around 0.5 − 0.7.

 You would like to scale the random walk moves such that it is possible to
move reasonably fast in regions of positive probability masses under 𝜋.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 5, 𝑥0 = 0.0, length of chain = 10000 1.8
ergodic mean
0.35 1.6

ergodic histogram
1.4

0.3 1.2

0.25 0.8

For a C++ implementation 0.6

see here
probability density

0.2 0.4
I
1
x i   X   0.75
0.2 N i
0
0.15 10
2 3
10 10
4

1.2
normalized autocovariance function

1
0.1

0.8
 ff (s)  C ff (s) / C ff (0)  C ff (s) / var( f )
0.05 0.6
2

 f x  f x   N12  f  x(n) 


N N
1 (ns )
C ff ( s )  (n)

 n 1 
0.4
N n 1
0
-15 -10 -5 0 5 10
0.2

Acceptance Ratio: 0.38, the best among the three 0

choices considered -0.2


0 10 20 30 40 50 60 70 80 90 100

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 50, 𝑥0 = 0.0, length of chain = 10000
0.45 1.2
ergodic histogram normalized autocovariance function

0.4
1

0.35

0.8
0.3
probability density

0.6
0.25

0.2 0.4

0.15
0.2

0.1

0
0.05

0 -0.2
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100

Acceptance Ratio: 0.05, the acceptance rate is very low, the auto-correlation very high and
thus the convergence rate very slow
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 0.5, 𝑥0 = 0.0, length of chain = 10000
0.35 1
ergodic histogram normalized autocovariance function

0.3
0.9

0.25
0.8
probability density

0.2

0.7

0.15

0.6
0.1

0.5
0.05

0 0.4
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100

Acceptance Ratio: 0.76, the acceptance rate is the highest from the 3 cases considered,
the auto-correlation also the highest and thus the convergence rate very slow

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Example
 Consider the case where 2

 ( )  e 2

 We implement the MH algorithm for


  '
2

q1 ( ,  ')  e 2(0.2)2

 We implement the MH algorithm for


 '  
2

q2 ( ,  ')  e 2(5)2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Example
 MCMC output for 𝑞1, we estimate 𝔼(𝜃) = 0.0126 and 𝑉𝑎𝑟(𝜃) = 0.9371.
2 
  '
2


 ( )  e 2
q1 ( ,  ')  e 2(0.2)2

4 0.5

0.45
3

0.4
2
0.35

1
0.3

0 0.25

0.2
-1

0.15
-2
0.1

-3
0.05

-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Example
 MCMC output for 𝑞2, we estimate 𝔼(𝜃) = 0.0034 and 𝑉𝑎𝑟(𝜃) = 1.0081.
2 
 '  
2


 ( )  e 2 q2 ( ,  ')  e 2(5)2

4 0.5

0.45
3

0.4
2
0.35

1
0.3

0 0.25

0.2
-1

0.15
-2
0.1

-3
0.05

-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Example: Bimodal Distribution
 Exploration of a bimodal distribution using a random walk MH algorithm

0.04

0.03

0.02

0.01

0
0

2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Example: Bimodal Distribution
 Bad exploration of a bimodal distribution using a random walk MH algorithm.
The variance of the random walk increments is too small.

0.04

0.03

0.02

0.01

0
0

2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Random Walk Metropolis-Hastings
 A rule of thump is to have an average acceptance ratio between 0.2 and 0.4.

 You should not adapt 𝜎 2 on the fly in order to achieve an acceptance ratio in
that range.

 The chain is not Markov anymore and the desired convergence properties
might be lost.

 Heavy tails increments (tails of the distribution of the random walk) can
prevent you from getting trapped in modes.

 High-dimensional proposal distributions 𝑞 𝜃, 𝜃 ′ are difficult to select.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Independent MH: Example
 Consider the case where 2

 ( )  e 2

 We implement the MH algorithm for


2

q1 ( )  e 2(0.2)2

so 𝜋(𝜃 )Τ𝑞1 (𝜃) → ∞, 𝑎𝑠 𝜃 → ∞ and for


2

q2 ( )  e 2(5)2

so 𝜋(𝜃 )Τ𝑞2 (𝜃) ≤ 𝑀 𝑓𝑜𝑟 𝑎𝑙𝑙 𝜃.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39


Independent MH: Example
 MCMC output for 𝑞1, we estimate 𝔼(𝜃) = 0.0174 and 𝑉𝑎𝑟(𝜃) = 0.1374.
2 2
 
 ( )  e 2
q1 ( )  e 2(0.2)2

0.8 1.4

0.6
1.2

0.4
1

0.2
0.8

0.6
-0.2

0.4
-0.4

0.2
-0.6

-0.8 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


Independent MH: Example
 MCMC output for 𝑞2, we estimate 𝔼(𝜃) = 0.0193 and 𝑉𝑎𝑟(𝜃) = 1.0107.
2 2
 
 ( )  e 2
q2 ( )  e 2(5)2

4 0.45

3 0.4

2 0.35

0.3
1

0.25
0

0.2
-1
0.15

-2
0.1

-3
0.05

-4
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
-4 -3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Independent Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Independent Proposal : 𝑞(𝑥) = 𝒩(0, 𝜎2)
 Independent walk proposals are used to jump from one mode to another
 Cases shown: 𝜎 = 1, 𝜎 = 10

Ergodic mean for the two 𝜎. True value


Autocorrelation for
0.75.
various . Acceptance
ratio ∼ 0.24 for both
proposals!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


Mixture of Proposals
 In practice, random walk proposals can be used to explore locally the space
whereas independent walk proposals can be used to jump into the space.

 A good strategy can be to use a proposal distribution of the following mixture


form
q ( ,  ')   q1 ( ')  (1   ) q2 ( ,  ')
where 0 < 𝜆 < 1.

 This algorithm is valid (satisfies all needed properties of transition kernels) as


it is a particular case of the MH algorithm.

 Combining random walk (conservative small steps) with independent (large


jumps) proposals takes advantage of the merits of both algorithms.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Mixture of MH Kernels
 An alternative is to use a transition kernel

K ( ,  ')   K1 ( , ')  (1   ) K 2 ( ,  ')

where 𝐾1 (respectively, 𝐾2) is an MH algorithm of proposal 𝑞1 (respectively,


q2)

 This algorithm is different from using 𝑞(𝜃, 𝜃′) = 𝜆𝑞1 (𝜃′) + (1 − 𝜆)𝑞2 (𝜃, 𝜃′).

 It is computationally cheaper and 𝐾 𝜃, 𝜃 ′ has 𝜋 𝜃 as its invariant distribution:

  ( ) K ( , ')d     ( ) K ( , ')d  (1   )   ( ) K ( , ')d


1 2

  ( ')  (1   ) ( ')
  ( ')

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Mixture of MH Kernels
 A sufficient condition to ensure that 𝐾 is irreducible and aperiodic is to have
either 𝐾1 or 𝐾2 irreducible and aperiodic.

 You do NOT need to have both kernels to be irreducible and aperiodic.

 In the limiting case, you could have 𝐾2 (𝜃, 𝜃′) = 𝛿𝜃 (𝜃′) and the total kernel 𝐾
would still be irreducible and aperiodic if 𝐾1 is irreducible and aperiodic.

 None of the kernels have to be irreducible and aperiodic to ensure that 𝐾 is


irreducible and aperiodic (sufficient but not necessary condition).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45


Composition of MH Kernels
 Alternatively, we can apply at each iteration of the algorithm first the kernel 𝐾1
then the kernel 𝐾2, i.e. in this case we have at iteration 𝑖
Z ~ K1 ( (i 1) , ) and  (i ) ~ K2 ( Z , )
 The composition of these kernels corresponds to
K ( , ')   K1 ( , z ) K 2 ( z,  ')dz

 If 𝐾1 and 𝐾2 are both 𝜋 −invariant, then the composition is also 𝜋 −invariant.

 The algorithm admits the right invariant distribution as

    K ( , ')d        K ( , z )d  K ( z, ')dz 


1 2

    z  K ( z , ')dz    '
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Mixture and Composition of MH Kernels
 In practice, the choice of the proposal distribution is crucial.

 In high-dimensional problems, a simple MH algorithm is useless. It will be


necessary to use a combination of MH kernels.

 Using mixture and composition of kernels can be a powerful approach.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Mixture and Composition of MH algorithms
 Consider the target distribution 𝜋 𝜃1 , 𝜃2 .

 We use two MH kernels to sample from this distribution

 𝐾1 updates 𝜃1 and keeps 𝜃2 fixed whereas

 𝐾2 updates 𝜃2 and keeps 𝜃1 fixed.

 We then combine 𝐾1 and 𝐾2 through mixture or composition.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48


Description of Transition Kernels
 The proposal 𝑞ത1 (𝜃, 𝜃′) associated to 𝐾1 (𝜃, 𝜃′) is given by

q1 ( , ')  q1  (1 ,  2 ), (1' ,  2' )   q1  (1 ,  2 ), 1'  2  2' 

 The acceptance probability is given by 𝛼1 𝜃, 𝜃 ′ = min 1, 𝑟1 (𝜃, 𝜃 ′ ) where:


 ( ')q1 ( ',  )  (1' , 2' )q1  (1' , 2' ),1   ( 2 )
'
r1 ( ,  ')   2

 ( )q1 ( , ')  (1 , 2 )q1  (1 ,  2 ), 1'   ( 2' )


2

 (1' , 2 )q1  (1' ,  2 ), 1 



 (1 , 2 )q1  (1 ,  2 ), 1' 
 (1' |  2 )q1  (1' ,  2 ), 1 

 (1 |  2 )q1  (1 ,  2 ), 1' 
 This move is also equivalent to an MH step of invariant 𝜋(𝜃1 |𝜃2 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Description of Transition Kernels
 The proposal 𝑞ത2 (𝜃, 𝜃′) associated to 𝐾2 (𝜃, 𝜃′) is given by
q 2 ( , ')  q 2  (1 , 2 ), (1' , 2' )   q2  (1 , 2 ), 2'  1 1' 

 The acceptance probability is given by 𝛼2 𝜃, 𝜃 ′ = min 1, 𝑟2 (𝜃, 𝜃 ′ )


where:
 ( ')q 2 ( ', )  (1 , 2 )q2  (1 , 2 ), 2   (1 )
' ' ' '
'
r2 ( ,  ')   1

 ( )q 2 ( ,  ')  (1 , 2 )q2  (1 ,  2 ),  2   (1 )
' '
1

 (1 , 2' )q2  (1 , 2' ), 2 


 
 (1 , 2 )q2  (1 , 2 ), '
2


 ( 2' | 1 )q2  ( , ), 
1
'
2 2

 ( 2 | 1 )q2  ( , ), 
1 2
'
2

 This move is also equivalent to an MH step of invariant 𝜋(𝜃2 |𝜃1 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50


Composition of MH Kernels
 Assume we use a composition of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 MH Step to Update Component 1

𝑖−1 𝑖−1
 Sample 𝜃1∗ ~𝑞1 𝜃1 , 𝜃2 ,⋅ and compute

  1* |  2(i 1)  q1 1* , 2(i 1)  ,1(i 1)  

1 1(i 1) , 2(i 1)  , 1* , 2(i 1)    min 1, 
 
  1(i 1) |  2(i 1)  q1 1(i 1) , 2(i 1)  ,1*  

𝑖−1 𝑖−1 𝑖−1 𝑖
 With probability 𝛼1 𝜃1 , 𝜃2 , 𝜃1∗ , 𝜃2 , set 𝜃1 = 𝜃1∗
𝑖 𝑖−1
otherwise set 𝜃1 = 𝜃1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51


Composition of MH Kernels
 Assume we use a composition of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 MH Step to Update Component 2

𝑖 𝑖−1
 Sample 𝜃2∗ ~𝑞2 𝜃1 , 𝜃2 ,⋅ and compute

   2* | 1(i )  q2 1( i ) , 2*  , 2( i 1)  

 2  ,
(i ) ( i 1)
 , (i )
, *
  min 1, 
1 2 1 2

 
   2(i 1) | 1(i )  q1 1(i ) , 2( i 1)  , 2*  

𝑖 𝑖−1 𝑖 𝑖
 With probability 𝛼2 𝜃1 , 𝜃2 , 𝜃1 , 𝜃2∗ , set 𝜃2 = 𝜃2∗
𝑖 𝑖−1
otherwise set 𝜃2 = 𝜃2 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52


Mixture of MH Kernels
 Assume we use an even mixture of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 Sample the index of the component to update 𝐽~𝒰 1,2


𝑖 𝑖−1
 Set 𝜃−𝐽 = 𝜃−𝐽
𝑖−1 𝑖−1
 Sample 𝜃𝐽∗ ~𝑞𝐽 𝜃1 , 𝜃2 ,⋅ and compute


   J* |  (iJ)  qJ  J* , ( iJ)  , J( i 1)  

 J  , (i ) ( i 1)
 ,  *
, (i )
  min 1, 
1 2 J J

 
   J(i 1) |  (iJ)  qK  J( i 1) , ( iJ)  , J*  

𝑖−1 𝑖−1 𝑖 𝑖
 With probability 𝛼𝐽 𝜃𝐽 , 𝜃𝐽 , 𝜃𝐽∗ , 𝜃−𝐽 , set 𝜃𝐽 = 𝜃𝐽∗ ;

𝑖 𝑖−1
otherwise set 𝜃𝐽 = 𝜃𝐽 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53


Properties
 It is clear that in such cases both 𝐾1 and 𝐾2 are NOT irreducible and
aperiodic.

⇒ Each of them only updates one component!!!!

 However, the composition and mixture of these kernels can be irreducible and
aperiodic because then all the components are updated.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54


Discussion
 For parameter space 𝜃 = 𝜃1 , . . . , 𝜃𝑝 ,

 we update each parameter 𝜃𝑘 according to an MH step of proposal


distribution 𝑞𝑘 (𝜃1:𝑝 , 𝜃𝑘′ ) = 𝑞𝑘 ቀ𝜃−𝑘 , 𝜃𝑘 ), 𝜃𝑘′ and

 invariant distribution 𝜋(𝜃𝑘 |𝜃−𝑘 ൯.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55


Using Full Conditionals Leads to Gibbs Sampler
 Consider now the case where
q1  (1 , 2 ),1'    1' |  2 

then
r ( , ') 
 
 1' |  2  q1 1' , 2  ,1

  |     |  
1
'
2
1
1 2

  |   q   ,  ,    |     |  
1 ' '
1 2 1 1 2 1 1 2 1 2

 Similarly if 𝑞2 ൫𝜃1 , 𝜃2 ), 𝜃2′ = 𝜋 𝜃2′ |𝜃2 , then 𝑟2 (𝜃, 𝜃′) = 1.

 Using as proposal distributions in MH the conditional distributions gives you


the Gibbs sampler!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56


General Hybrid Algorithm
 To sample from 𝜋(𝜃),𝜃 = 𝜃1 , 𝜃2 , . . . , 𝜃𝑝 , we can use the following algorithm at
iteration 𝑖.

 Iteration 𝑖, 𝑖 ≥ 1

 For 𝑘 = 1: 𝑝

(𝑖)
 Sample 𝜃𝑘 using an MH step of proposal distribution
𝑖 𝑖−1 𝑖
𝑞𝑘 ൬𝜃−𝑘 , 𝜃𝑘 ), 𝜃𝑘′ and target 𝜋(𝜃𝑘 |𝜃−𝑘 ቁ

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = (𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝 ቁ.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57


General Hybrid Algorithm
 If we have 𝑞𝑘 𝜃1:𝑝 , 𝜃𝑘′ = 𝜋 𝜃𝑘′ |𝜃−𝑘 then we are back to the Gibbs sampler.

 Update some parameters according to 𝜋 𝜃𝑘′ |𝜃−𝑘 (and the move is


automatically accepted) and the rest according to different proposals. For
example:

 For 𝜋 𝜃1 , 𝜃2 , sample from 𝜋 𝜃1 |𝜃2 and

 Then use an MH step of invariant distribution 𝜋 𝜃2 |𝜃1 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58


General Hybrid Algorithm
 At iteration 𝑖, 𝑖 ≥ 1

𝑖 𝑖−1
 Sample 𝜃1 ~𝜋 𝜃1 |𝜃2

𝑖 𝑖 𝑖−1
 Sample 𝜃2 using an MH step of proposal distribution 𝑞2 ൬𝜃1 , 𝜃2 ), 𝜃2
𝑖
and target 𝜋 𝜃2 |𝜃1

 There is no need to run the MH Algorithm multiple steps to ensure that


𝑖 𝑖−1
𝜃2 ~𝜋(𝜃2 |𝜃2 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59


Using Gradient Information to Build 𝒒 𝜽, 𝜽 ′

 We usually want to sample candidates in regions of high probability

 We can use 2
 '    log  ( )   V , V ~ N (0,1)
2

where 𝜎2 is selected such that the acceptance ratio is approximately 0.57.

 The motivation comes from the continuous-time case where


1
dt   log  ( )   dWt
2

admits 𝜋 as an invariant distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 60


Alternative Acceptance Probabilities
 The standard MH algorithm uses the acceptance probability:
   ' q  ',  
 ( , ')  min 1, 
    q  , '  
 
 One can also use 𝛼(𝜃, 𝜃′) below with any function 𝛿 𝜃′, 𝜃

𝛿 𝜃′, 𝜃
𝛼(𝜃, 𝜃′) =
𝜋 𝜃 𝑞 𝜃, 𝜃 ′

which is such that 𝛿 𝜃′, 𝜃 = 𝛿 𝜃, 𝜃′ 𝑎𝑛𝑑 0 ≤ 𝛼 𝜃, 𝜃 ′ ≤ 1.

 For example (Baker, 1965)


𝜋 𝜃′ 𝑞 𝜃′, 𝜃
𝛼(𝜃, 𝜃′) =
𝜋 𝜃′ 𝑞 𝜃′, 𝜃 + 𝜋 𝜃 𝑞 𝜃, 𝜃′

𝜋 𝜃 ′ 𝑞 𝜃 ′ ,𝜃 𝜋 𝜃 𝑞 𝜃,𝜃 ′
Note that 0 ≤ 𝛼 𝜃, 𝜃′ ≤ 1 and 𝛿 𝜃, 𝜃 ′ = = 𝛿 𝜃′, 𝜃 .
𝜋 𝜃′ 𝑞 𝜃 ′ ,𝜃 +𝜋 𝜃 𝑞 𝜃,𝜃 ′
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 61
Alternative Acceptance Probabilities
 Indeed, one can check that

𝐾 𝜃, 𝜃′ = 𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 𝜃′
is 𝜋 −reversible.

 We have:
𝛿 𝜃, 𝜃′
𝜋 𝜃 𝛼(𝜃, 𝜃′)𝑞 𝜃, 𝜃′ = 𝜋 𝜃 𝑞 𝜃, 𝜃′ = 𝛿 𝜃, 𝜃′ = 𝛿 𝜃′, 𝜃
𝜋 𝜃 𝑞 𝜃, 𝜃′
𝛿 𝜃′, 𝜃
= 𝜋 𝜃′ 𝑞 𝜃′, 𝜃 = 𝜋 𝜃′ 𝛼(𝜃′, 𝜃)𝑞 𝜃′, 𝜃
𝜋 𝜃′ 𝑞 𝜃′, 𝜃

 The MH acceptance is favored as it increases the acceptance probability.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 62


Hybrid Hamiltonian Metropolis Proposal
 Hybrid MC is essentially Metropolis with a special choice of a proposal.

 Assume that you want to sample 𝑥 ~ 𝜋(𝑥), where 𝜋(𝑥) is known up to a


proportionality constant.

 Consider that 𝑥 represents the position of some “real particles”. Then write:

𝜋(𝑥) = exp(−𝑉(𝑥))

where we have defined: 𝑉(𝑥) = −log 𝜋(𝑥).

 This is similar to the Boltzmann distribution at inverse temperature equal to


one (statistical mechanics).

 Think of 𝑉(𝑥) as the potential of the system at 𝑥.


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 63
Hybrid Hamiltonian Metropolis Proposal
 To complete the picture, introduce the momenta 𝑝 (of the same dimension as
𝑥) and write a probability distribution in the extended space:

𝑥, 𝑝 ~ 𝜋(𝑥, 𝑝) = 𝜋(𝑥) × 𝒩(𝑝|0, 1) ∝ exp(−𝑉(𝑥) − 𝑝2 / 2)

 Write 𝐻(𝑥, 𝑝) = 𝑉(𝑥) + 𝑝2 / 2 for the Hamiltonian of the system.

 To construct the proposal, we bring into the picture “the dynamics described
by the Hamiltonian”.

 If you integrate the equations of motion described by 𝐻(𝑥, 𝑝) starting at any


initial condition for a long time, you will get a sample from 𝜋(𝑥, 𝑝).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 64


Hybrid Hamiltonian Metropolis Proposal
𝑥, 𝑝 ~ 𝜋(𝑥, 𝑝) = 𝜋(𝑥) × 𝒩(𝑝|0, 1) ∝ exp(−𝑉(𝑥) − 𝑝2 / 2)

 Based on this the hybrid Metropolis proposal is constructed as:

 1. Sample an initial 𝑝 from a Gaussian (in 𝜋(𝑥, 𝑝), 𝑥 and 𝑝 are decoupled and the
probability distribution of 𝑝 is 𝒩(𝑝|0, 1))

 2. Evolve the equations of motion for a finite amount of time using a finite time
step.

 3. Use the 𝑥 at the final step as the proposed move.

 Notice that the proposal built this way is reversible if the integration scheme is
reversible. To guarantee this, we use a the Leapfrog integration scheme for the
integration of motion which has the property of preserving the value of the
Hamiltonian.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 65
Hybrid Hamiltonian Metropolis Proposal
3

-1

-2

50 MH steps
-3
-3 -2 -1 0 1 2 3

MatLab Implementation with


animation of the dynamics

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 66

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy