0% found this document useful (0 votes)

32 views66 pages

Lec33 MetropolisHastings

The document discusses the Metropolis-Hastings algorithm, an important Markov chain Monte Carlo (MCMC) method. It begins with an overview of MCMC and defines the key components of a Markov chain, including the initial distribution, transition kernel, and conditions for ergodicity. It then provides an example of an autoregressive model and uses it to illustrate generating a Markov chain and its asymptotic properties. The document aims to explain the goals and fundamentals of MCMC methods and the Metropolis-Hastings algorithm.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views66 pages

Lec33 MetropolisHastings

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

The Metropolis-

Hastings Algorithm
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

November 1, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1

Contents
 MCMC, Autoregressive Model, Auto covariance function, Metropolis-Hastings
Algorithm, Metropolis Algorithm, Independent Metropolis-Hastings, Transition
Kernel, Reversibility, Irreducibility, Aperiodicity, Examples
 Mixture of Proposals, Composition of MH Kernels, General Hybrid Algorithm,
Alternative Acceptance Probability
 Hamiltonian (Hybrid) Metropolis Proposal

 Arnaud Doucet, Statistical Computing – Monte Carlo Methods (online course)

 Christian P. Robert and George Casella, Monte Carlo Statistical Methods, Springer, 2nd edition (Chapters 6, 7, 9
& 10) (Video, Lecture Slides)
 C.P. Robert, The Metropolis-Hasting Algorithm (with R programs), https://arxiv.org/pdf/1504.01896.pdf
 Julian Besag, Markov Chain Monte Carlo for Statistical Inference (2000) (working paper)
 C. Andrieu, et al. , An Introduction to MCMC for Machine Learning (2003)
 S. Chib and E. Greenberg, Understanding the Metropolis-Hastings algorithm, The American Statistician, 1995
 Java applets for the Metropolis Hastings algorithm
 L. Held, Conditional Prior Proposals in Dynamic Models, Scand. J. Statist., 1999
 M.K. Pitt & N. Shephard, Likelihood Analysis of Non-Gaussian Measurement Time Series, Biometrika, 1996
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goals for today’s lecture include the following:

 Understand the fundamentals of MCMC

 Learn about the Metropolis-Hastings algorithm and its variants

 Understand the use of mixture of proposals and composition of transition

kernels

 Understand how to implement hybrid algorithms

 Acquire basic understanding of the Hamiltonian Metropolis proposal

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Markov Chain Monte Carlo
 The simplest way to generate a sequence of random variables and be able to
say something about asymptotics is using Markov Chains.

 A Markov Chain 𝑋𝑛 , 𝑛 = 0,1,2, … , ∞ is fully defined if we know:

 Initial distribution 𝑝0 (𝑥0 ) = Pr[𝑋0 = 𝑥0 ] (this will prove of little

significance)

 Transition Kernel: 𝐾(𝑥𝑛 , 𝑥𝑛+1 ) = Pr[𝑋𝑛+1 = 𝑥𝑛+1 | 𝑋𝑛 = 𝑥𝑛 ].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Autoregressive Model
 We generate a sequence of random variables using Markov Chains.
 A Markov Chain 𝑋𝑛 , 𝑛 = 1, 2, … is fully defined if we know:
 Initial distribution 𝑝0(𝑥0) = Pr[𝑋0 = 𝑥0] (this will prove of little significance)
 Transition Kernel: 𝐾(𝑥𝑛, 𝑥𝑛+1 ) = Pr[𝑋𝑛+1 = 𝑥𝑛+1 | 𝑋𝑛 = 𝑥𝑛]

 An example of a Markov chain is an autoregressive model:

𝑋𝑛 = 𝜌𝑋𝑛−1 + 𝑍𝑛 where 𝑋0, 𝑍𝑛 ∼ 𝒩(0, 1) (i.i.d) with |𝜌| < 1

 Initial distribution: 𝑋0 ∼ 𝒩(0, 1)

1   2n 1
 X n   0, Var  X n    2n
 
1  2 1  2
 Transition Kernel: 𝑋𝑛 | 𝑋𝑛−1 ∼ 𝒩(𝜌 𝑋𝑛−1 , 1), |𝜌| < 1

1
 Asymptotically: 𝑋𝑛 ~𝒩 0, .
1−𝜌2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5
Autoregressive Model: Example
 Case: 𝜌 = 0.5, initial state: 𝑋0 ~𝒩(0, 1). Asymptotic variance: 4/3
2.5
0.4

0.35

2
0.3

1.5 0.25

probability density
0.2

0.15

0.5 0.1

0.05
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
-6 -4 -2 0 2 4 6

Variance of the Markov Chain vs. the

number of samples Histogram of the distribution of
𝑛 𝑛
samples (compared with the exact pdf)
1 2 1
𝜎ො𝑛2 = ෍ 𝑋𝑖 − 𝑋෠𝑛 , 𝑤ℎ𝑒𝑟𝑒: 𝑋෠𝑛 = ෍ 𝑋𝑖
𝑛 𝑛
𝑖=1 𝑖=1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6
Autoregressive Model: Example
 Case: 𝜌 = 0.5, initial state: 𝑋0 = −1000 (MatLab implementation)
0.4

0.35
Since the initial
value 𝑋0 here has a 0.3

significant influence 0.25

probability density
on the estimated
“variance” (𝑋0 − 𝑋෠𝑛 0.2

is much larger than 0.15

other 𝑋𝑛 − 𝑋෠𝑛 ), the

0.1
figure of the
variance is not 0.05

presented. 0
-6 -4 -2 0 2 4 6

Histogram of the distribution of

samples (compared with the exact pdf)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Markov Chain Monte Carlo
 To define a Markov Chain only requires determining a local rule 𝐾 𝑋𝑛 , 𝑋𝑛+1 .

 If we make a good selection for the transition kernel, it could asymptotically

converge to a target distribution independently of where we started from.

 More importantly, we can use the realizations of the Markov Chain in Monte
Carlo estimators i.e. we can average across the path.

 However note that even if 𝑋𝑛 were exact draws, they are not independent
anymore!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

Markov Chain Monte Carlo
 Ergodic Markov chain:
 f  X   I   f  x    x  dx
N
1
I n
N n 1

𝑋𝑖 form a Markov Chain which asymptotically converges to 𝜋(𝑥) (we

haven’t discussed yet under which conditions this holds)
 We also care about how fast it converges (particularly when each evaluation
of 𝑓 is expensive)
 In standard Monte Carlo using i.i.d. samples we had:

𝑉𝑎𝑟𝜋 𝑓(𝑥)
𝑉𝑎𝑟 𝐼መ =
𝑁

 Let us compute 𝑉𝑎𝑟 𝐼መ for a Markov chain.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

Autocovariance Function
1 N
I   f ( x)   f ( x) ( x)dx   f ( X i )  I
N i 1
I   1 I 2  
 
2

  N   f ( X i )   f  , and var  I    
I 
 
i
2
 1 N
 1 N
   1 N
 N N
) f ( X m )    f 
1
     f ( X
2
  f (Xn) f ( X m )     f ( X n )   2 n
 N n 1  N m 1    N n 1  N n 1 m 1

 Let 𝑍𝑖 = 𝑓 𝑋𝑖 − 𝔼 𝑓(𝑋𝑖 ) and assume it is weakly stationary

var  Z i    Z i2    2 ,  Z i Z j    2  (| j  i |)
Normalized
 Then you can easily show that: auto-covariance
function
Autocovariance function
1 N N
  2 N n 1

var  I   2
  N   Z n Z m   N 2  N  (0)  2  (n  m) 
   
C ff ( s )  cov  f  X n  , f  X n  s  
n 1 m 1 n  2 m 1  f  X n  , f  X n s     f 
2

2  2  N 1   2
N

 2 N  (0)  2( N  1)  (1)  ...2  N   N  1   ( N  1) 
N  j 1 N
j

1  2 (1  )  ( j )  
 N
f
 ff ( s )  C ff ( s ) / C ff (0)
 C ff ( s ) / var( f )
  : autocovariance time 
 f 
 For some 𝑀 sufficiently large 𝜌𝑓𝑓 (𝑠) ≈ 0 when 𝑠 ≥ 𝑀.
 For 𝑁 ≫ 𝑀, the 𝑋0 and 𝑋𝑁 samples are totally uncorrelated.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Markov Chain Monte Carlo
 Objective: Given an arbitrary distribution 𝜋 𝒙 , we want to construct a Markov
Chain that asymptotically converges to the target independently of the initial
state.
 We want to use the Markov Chain paths in estimators

 f  X   I   f  x    x  dx
N
1
I n
N n 1

 This requires coming up with a way to produce suitable transition kernels

𝐾(𝑋𝑛 , 𝑋𝑛+1 ) for any target 𝜋 𝒙 .

 The first successful attempt was the Metropolis algorithm proposed in 1953 by
N. Metropolis, AW Rosenbluth, MN Rosenbluth, AH Teller and E. Teller in “
Equations of State calculations by fast computing machines”, J. Chem Phys,
21 pp 1087. This paper has been cited 42,782 times since then!
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Metropolis-Hastings Algorithm
 This is another way to sample from 𝜋 𝜃 known up to a normalizing constant.

 The algorithm builds a Markov kernel that has 𝜋 𝜃 as its invariant

distribution.

 The algorithm is the basis of many other MCMC algorithms.

 The algorithm requires a proposal distribution (kernel) 𝑞 𝜃, 𝜃 ′ to propose a

candidate 𝜃 ′ given 𝜃. The following should hold:

‫𝜃 𝑞 ׬‬, 𝜃 ′ 𝑑𝜃 ′ = 1 for all 𝜃

 𝜃 ′ is accepted with probability 𝑎 𝜃, 𝜃 ′ that ensures that 𝜋 𝜃 as the invariant

distribution of the transition kernel.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Metropolis – Hastings Algorithm
 Let 𝜋(𝜃) the target and 𝑞(𝜃, 𝜃 ′ ) any (symmetric or not) proposal distribution.

 Initialization: Select (deterministically or randomly) 𝜃 (0) .

 Iteration 𝑖, 𝑖 ≥ 1:
 Draw a proposal 𝜃 ∗ from 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )
 Calculate acceptance ratio:
∗ 𝑞 𝜃 ∗ , 𝜃 (𝑖−1)
𝜋 𝜃
𝛼(𝜃 (𝑖−1) , 𝜃 ∗ ) = min 1,
𝜋(𝜃 (𝑖−1) ) 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )

 With probability 𝛼 𝜃 𝑖−1 , 𝜃 ∗ , set 𝜃 (𝑖) = 𝜃 ∗ ; otherwise 𝜃 (𝑖) = 𝜃 (𝑖−1) .

W. Hastings, Monte Carlo Sampling Methods using Markov Chains and their Applications, Biometrica,
Vol. 57(1), pp. 97-109 (1970).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Metropolis-Hastings Algorithm
 To implement the Metropolis scheme we only need to know the target density
𝜋(𝜃) up to a constant!
 This is useful in Bayesian inference where the target distribution is the posterior
(not known normalizing factor)
𝑝(𝜽 | 𝒟) ∝ 𝑝(𝒟| 𝜽) 𝑝(𝜽)

 𝑞 𝜃, 𝜃 ′ can be any proposal distribution. E.g. one can use 𝜃 ′ ~𝒩 𝜙 𝜃 , 𝜎 2

where 𝜙 𝜃 is any deterministic function of 𝜃 (e.g. a neural network or the
local max of 𝜋 closest to 𝜃).

 Much more flexibility than in Gibbs sampling.

 M-H is a stochastic algorithm. Even if you draw the same 𝜃 ′ twice, this is
accepted with a certain probability.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Metropolis Algorithm
 The original version of the algorithm considers a random walk proposal

𝜃 ′ = 𝜃 + 𝑍, 𝑍~𝑓 N. Metropolis, AW Rosenbluth, MN

Rosenbluth, AH Teller and E. Teller in “
Equations of State calculations by fast
computing machines”, J. Chem Phys, 21
 In this case, 𝑞 𝜃, 𝜃 ′ = 𝑓 𝜃 ′ − 𝜃 . pp 1087

 The acceptance probability becomes:

𝜋(𝜃 ′ ) 𝑞(𝜃 ′ , 𝜃 ) 𝜋(𝜃 ′) 𝑓 𝜃 − 𝜃′

𝛼(𝜃, 𝜃 ′ ) = min 1, ′
= min 1,
)
𝜋(𝜃 𝑞(𝜃, 𝜃 ) 𝜋(𝜃 ) 𝑓 𝜃 ′ − 𝜃

 For symmetric 𝑓 𝜃 ′ − 𝜃 = 𝑓 𝜃 − 𝜃 ′ e.g. 𝑍~𝒩 0, Σ

𝜋(𝜃 ′ ) 𝑓 𝜃 − 𝜃 ′ 𝜋(𝜃 ′ )
𝛼 𝜃, 𝜃 ′ = min 1, = min 1,
𝜋(𝜃 ) 𝑓 𝜃 ′ − 𝜃 𝜋(𝜃 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Metropolis Algorithm
 Let 𝜋(𝜃) the target and 𝑞(𝜃, 𝜃 ′ ) a symmetric proposal distribution such
𝑞(𝜃, 𝜃 ′ ) = 𝑞(𝜃 ′ , 𝜃).

 Initialization: Select (deterministically or randomly) 𝜃 (0) .

 Iteration 𝑖, 𝑖 ≥ 1:
 Draw a proposal 𝜃 ∗ from 𝑞(𝜃 (𝑖−1) , 𝜃 ∗ )
 Calculate acceptance ratio:
∗
𝜋 𝜃
𝛼(𝜃 (𝑖−1) , 𝜃 ∗ ) = min 1,
𝜋(𝜃 (𝑖−1) )
𝑖−1
 With probability 𝛼 𝜃 , 𝜃 ∗ , set 𝜃 (𝑖) = 𝜃 ∗ ; otherwise 𝜃 (𝑖) = 𝜃 (𝑖−1) .
N. Metropolis, A W Rosenbluth, M N Resenbluth, A H Teller and E Teller, Equations of
State Calculations for Fast Computing Machines, J Chem Physics, Vol 21, pp. 1087 (1953)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

Independent Metropolis-Hastings
 If 𝑞(𝜃, 𝜃′) = 𝑞(𝜃′) (independent proposal) then:

𝜋(𝜃 ′ ) 𝑞(𝜃 ′ , 𝜃 ) 𝜋(𝜃′) 𝜋(𝜃 )

𝛼 𝜃, 𝜃 ′ = min 1, ′
= min 1, ൘
𝜋(𝜃 ) 𝑞(𝜃, 𝜃 ) 𝑞(𝜃′) 𝑞(𝜃 )

 Unnormalized 𝜋 ∗ 𝜃 and 𝑞 ∗ 𝜃 can be used.

 When using independent proposals then you would like to have 𝑞 𝜃 ≅ 𝜋 𝜃 .
 Similarly to Rejection sampling or Importance Sampling, you need to ensure
that
𝜋 ∗ (𝜃)
∗
≤𝑀
𝑞 (𝜃 )
to obtain good performance.
 Without the above constraint in the selection of 𝑞(𝜃), the algorithm might not
work at all.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Independent Metropolis-Hastings
 One might argue that since the proposed state does not depend on the
previous state, the states of the Markov Chain are independent and therefore
the autocorrelation is zero and the achieved convergence rate optimal.

 This is not the case because the proposals are not always accepted!

 In addition, if the proposal focuses on a region of low probability mass, it will

spend most of its time there.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

Independent Metropolis-Hastings
 Considering sampling from the posterior
𝑝(𝜽|𝒟)~𝑝(𝒟|𝜽)𝑝(𝜽)

 Let us use independent Metropolis-Hastings with the prior as the proposal

distribution
𝑝(𝒟|𝜽 ′ )𝑝(𝜽′ ) 𝑝(𝜽)
𝑎𝑐𝑐𝑒𝑝𝑡𝑎𝑛𝑐𝑒 𝑟𝑎𝑡𝑖𝑜 𝑎(𝜽, 𝜽′ ) = min 1, =
𝑝(𝒟|𝜽)𝑝(𝜽) 𝑝(𝜽′ )
𝑝(𝒟|𝜽′ )
= min 1,
𝑝(𝒟|𝜽)

 This works if the effect of the data is not significant – i.e. the posterior is close
to the prior.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Metropolis Algorithm
 If such a scheme is to converge to the target distribution 𝜋(𝜃) then this must
be invariant, i.e.
න𝜋(𝜃)𝐾 ( 𝜃, 𝜃 ′ )𝑑𝜃 = 𝜋 𝜃 ′

 Note that the transition kernel 𝐾 𝜃, 𝜃 ′ is not the same as the proposal
distribution 𝑞(𝜃, 𝜃 ′ )!

𝐾(𝜃, 𝜃 ′ ) = p(𝜃 ′ | proposal acc.) Pr [proposal accepted]

+ p(𝜃 ′ | proposal rejected) Pr [proposal rejected]

 In addition the Markov chain needs to be irreducible (one can reach any 𝐴 s.t.
𝜋(𝐴) > 0) and aperiodic (not visiting periodically the state-space).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

Invariant Distribution of the Metropolis-Hastings
 The transition kernel associated to the MH algorithm can be written as
 
K  ,  '    , '  q  , '   1     , u q  , u  du  ( ')
Rejection
Probability

 This is a loose notation for

𝐾 𝜃, 𝑑𝜃′ = 𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ 𝑑𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 𝑑𝜃′

 Clearly we need to satisfy ‫𝜃 𝐾 ׬‬, 𝜃′ 𝑑𝜃′ = 1. Indeed:

න𝐾 𝜃, 𝜃′ 𝑑𝜃′ = න𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ 𝑑𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 න𝛿𝜃 (𝜃′)𝑑 𝜃′ = 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

The MH Kernel is Reversible
 By definition of the kernel we have
𝜋(𝜃)𝐾 𝜃, 𝜃′ = 𝜋(𝜃)𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 (𝜃′)𝜋 𝜃

 Then
𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ = 𝜋(𝜃)min 1, 𝑞 𝜃, 𝜃′ = min 𝜋(𝜃)𝑞 𝜃, 𝜃′ , 𝜋(𝜃′)𝑞 𝜃′, 𝜃
𝜋(𝜃)𝑞 𝜃, 𝜃′
𝜋(𝜃)𝑞 𝜃, 𝜃′
= 𝜋(𝜃′)min 1, 𝑞 𝜃′, 𝜃 = 𝜋(𝜃′)𝛼 𝜃′, 𝜃 𝑞 𝜃′, 𝜃
𝜋(𝜃′)𝑞 𝜃′, 𝜃

 We also have obviously

1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 (𝜃′)𝜋(𝜃) = 1 − න𝛼 𝜃′, 𝑢 𝑞 𝜃′, 𝑢 𝑑𝑢 𝛿𝜃′ (𝜃)𝜋 𝜃′

 It follows that 𝜋(𝜃)𝐾 𝜃, 𝜃′ = 𝜋(𝜃′)𝐾 𝜃′, 𝜃 .

 Hence, 𝜋 is the invariant distribution of the transition kernel 𝐾.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Detailed Balance vs 𝝅 −invariant
 𝐾(𝜃, 𝜃 ′ ) = 𝑞(𝜃, 𝜃 ′ )𝑎(𝜃, 𝜃 ′ )+ 1 − ‫׬‬a(𝜃, 𝜃 ′ ) 𝑞 𝜃, 𝜃 ′ 𝑑𝜃 ′ 𝛿𝜃 (𝜃 ′ )
 The transition kernel 𝐾 satisfies the detailed balance condition (reversibility)

𝜋(𝜃)𝐾(𝜃, 𝜃 ′ ) = 𝜋(𝜃 ′ )𝐾 𝜃 ′ , 𝜃
 Detailed balance implies that 𝜋 is invariant. Indeed:

න𝜋(𝜃 ) 𝐾(𝜃, 𝜃 ′ )𝑑𝜃 = න𝜋(𝜃 ′ ) 𝐾(𝜃 ′ , 𝜃)𝑑𝜃

= 𝜋(𝜃 ′ ) න𝐾 𝜃 ′ , 𝜃 𝑑𝜃 = 𝜋(𝜃 ′ )

 Many more kernels are 𝜋 − invariant than 𝜋 − reversible.

 Fortunately, it is easier to construct a transition kernel that is 𝜋 −reversible
than just 𝜋 −invariant.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

aperiodicity
 We already have seen that 𝜋 − invariance is not enough to guarantee that the
chain converges to 𝜋.
 In addition we need: aperiodicity and 𝜋 − irreducibility.
 Aperiodicity: Let 𝑀 be an irreducible Markov chain with transition matrix 𝐾 and
let 𝜽 be a fixed state. Define the set

𝑇 = 𝑘: 𝐾 𝑘 (𝜃, 𝜃) > 0, 𝑘 > 0

These are the steps on which it is possible for a chain which starts in state 𝜃 to
revisit 𝜃. The greatest common divisor (g.c.d.) of the integers in 𝑇 is called the
period of state 𝜃.
 The chain is said to be periodic if the period of any of its states is greater
than one.
 A state with period one is aperiodic, i.e. one does not visit in a periodic way
the state-space.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Irreducibility and Ergodicity
 Irreducibility is a measure of the sensitivity of the Markov Chain to initial
conditions
𝐾(𝜃, 𝜃 ′ ) is 𝜋− irreducible if for any set A ⊂ 𝛺 with න 𝜋(𝜃)𝑑𝜃 > 0,
𝐴
Pr(Θ𝑛 ∈ A for some finite 𝑛|Θ0 = 𝜃)>0, so that the chain can hit any set that has
finite probability in 𝜋

 It is satisfied if ∀ 𝜃 ′ : 𝜋(𝜃 ′ ) > 0 ⇒ 𝑞(𝜃, 𝜃 ′ ) > 0 ∀ θ

 Theorem (Ergodicity from reversibility)

Let 𝜋(𝜃) be a given probability density on Ω. If 𝐾(𝜃, 𝜃 ′ ) is 𝜋 −irreducible and

if 𝐾 is reversible and aperiodic with respect to , then
𝑛
න𝜋 (𝜃)𝑑𝜃 → න 𝜋(𝜃)𝑑𝜃 as 𝑛 → ∞
𝐴 𝐴
for any set 𝐴 ⊂ Ω and starting distribution 𝜋 (0) .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
Irreducibility and Aperiodicity
 To ensure irreducibility, a sufficient but not necessary condition is that

𝜋(𝜃′) > 0 ⇒ 𝑞(𝜃, 𝜃′) > 0 ∀ θ

 Aperiodicity is automatically ensured as there is always a strictly positive

probability to reject the candidate.

 Theoretically, the MH algorithm converges under very weak assumptions to

the target distribution 𝜋.

 The convergence can be very slow.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

Sampling from a Mixture of Gaussians
 The MatLab demo here shows how you can sample from a probability
distribution known up to a normalizing constant using MCMC with random
walk proposals.

 Suppose that the probability distribution you want to sample from is 𝜋(𝑥).
 1. Initialize 𝑥.
 2. Propose a new 𝑥𝑛𝑒𝑤 ~ 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) = 𝒩(𝑥𝑛𝑒𝑤 |𝑥, 𝑠2). Here 𝑞(𝑥, 𝑥𝑛𝑒𝑤 ) leads
to a reversible Markov Chain and the classic Metropolis algorithm is used.
 3. Draw a random number 𝑢 ~ 𝒰[0, 1].
 4. If 𝑢 <= min(1, 𝜋(𝑥𝑛𝑒𝑤 ) /𝜋(𝑥)), accept the move, i.e. 𝑥 = 𝑥𝑛𝑒𝑤 .
 5. Otherwise reject the move.

 The target distribution is a 50% − 50% mixture of two Gaussians.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Sampling from a Mixture of Gaussians
3 3

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 4

𝑞(𝑥𝑛𝑒𝑤 | 𝑥𝑜𝑙𝑑) = 𝒩(𝑥𝑜𝑙𝑑, 𝑠2), 𝑠 = 2.0 𝑠 = 0.5

3
3

2 2

1 1

0 0

-1 -1

-2
-2

𝑠 = 0.1 -3
𝑠 = 0.05
-3 -3 -2 -1 0 1 2 3 4
-3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Selecting the Proposal in Random Walk
 Consider a random walk move. There is no clear guideline how to select the
proposal distribution.

 When the variance of the random walk increments (if it exists) is very small
then the acceptance rate can be expected to be around 0.5 − 0.7.

 You would like to scale the random walk moves such that it is possible to
move reasonably fast in regions of positive probability masses under 𝜋.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 5, 𝑥0 = 0.0, length of chain = 10000 1.8
ergodic mean
0.35 1.6

ergodic histogram
1.4

0.3 1.2

0.25 0.8

For a C++ implementation 0.6

see here
probability density

0.2 0.4
I
1
x i   X   0.75
0.2 N i
0
0.15 10
2 3
10 10
4

1.2
normalized autocovariance function

1
0.1

0.8
 ff (s)  C ff (s) / C ff (0)  C ff (s) / var( f )
0.05 0.6
2

 f x  f x   N12  f  x(n) 

N N
1 (ns )
C ff ( s )  (n)

 n 1 
0.4
N n 1
0
-15 -10 -5 0 5 10
0.2

Acceptance Ratio: 0.38, the best among the three 0

choices considered -0.2

0 10 20 30 40 50 60 70 80 90 100

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 50, 𝑥0 = 0.0, length of chain = 10000
0.45 1.2
ergodic histogram normalized autocovariance function

0.4
1

0.35

0.8
0.3
probability density

0.6
0.25

0.2 0.4

0.15
0.2

0.1

0
0.05

0 -0.2
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100

Acceptance Ratio: 0.05, the acceptance rate is very low, the auto-correlation very high and
thus the convergence rate very slow
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Random Walk Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Random walk proposal: 𝑋𝑛+1 = 𝑋𝑛 + 𝑧𝑛
p ( zn )  N (0,  2 )  q ( xn , xn 1 )  N ( xn ,  2 )
 Case: 𝜎 = 0.5, 𝑥0 = 0.0, length of chain = 10000
0.35 1
ergodic histogram normalized autocovariance function

0.3
0.9

0.25
0.8
probability density

0.2

0.7

0.15

0.6
0.1

0.5
0.05

0 0.4
-15 -10 -5 0 5 10 0 10 20 30 40 50 60 70 80 90 100

Acceptance Ratio: 0.76, the acceptance rate is the highest from the 3 cases considered,
the auto-correlation also the highest and thus the convergence rate very slow

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

Example
 Consider the case where 2

 ( )  e 2

 We implement the MH algorithm for


  '
2

q1 ( ,  ')  e 2(0.2)2

 We implement the MH algorithm for


 '  
2

q2 ( ,  ')  e 2(5)2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Example
 MCMC output for 𝑞1, we estimate 𝔼(𝜃) = 0.0126 and 𝑉𝑎𝑟(𝜃) = 0.9371.
2 
  '
2


 ( )  e 2
q1 ( ,  ')  e 2(0.2)2

4 0.5

0.45
3

0.4
2
0.35

1
0.3

0 0.25

0.2
-1

0.15
-2
0.1

-3
0.05

-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

Example
 MCMC output for 𝑞2, we estimate 𝔼(𝜃) = 0.0034 and 𝑉𝑎𝑟(𝜃) = 1.0081.
2 
 '  
2


 ( )  e 2 q2 ( ,  ')  e 2(5)2

4 0.5

0.45
3

0.4
2
0.35

1
0.3

0 0.25

0.2
-1

0.15
-2
0.1

-3
0.05

-4 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -4 -3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Example: Bimodal Distribution
 Exploration of a bimodal distribution using a random walk MH algorithm

0.04

0.03

0.02

0.01

0
0

2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Example: Bimodal Distribution
 Bad exploration of a bimodal distribution using a random walk MH algorithm.
The variance of the random walk increments is too small.

0.04

0.03

0.02

0.01

0
0

2000
-100
4000
-50
6000
0
8000
50
10000 100
iteration x-axis

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Random Walk Metropolis-Hastings
 A rule of thump is to have an average acceptance ratio between 0.2 and 0.4.

 You should not adapt 𝜎 2 on the fly in order to achieve an acceptance ratio in
that range.

 The chain is not Markov anymore and the desired convergence properties
might be lost.

 Heavy tails increments (tails of the distribution of the random walk) can
prevent you from getting trapped in modes.

 High-dimensional proposal distributions 𝑞 𝜃, 𝜃 ′ are difficult to select.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Independent MH: Example
 Consider the case where 2

 ( )  e 2

 We implement the MH algorithm for

2

q1 ( )  e 2(0.2)2

so 𝜋(𝜃 )Τ𝑞1 (𝜃) → ∞, 𝑎𝑠 𝜃 → ∞ and for

2

q2 ( )  e 2(5)2

so 𝜋(𝜃 )Τ𝑞2 (𝜃) ≤ 𝑀 𝑓𝑜𝑟 𝑎𝑙𝑙 𝜃.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39

Independent MH: Example
 MCMC output for 𝑞1, we estimate 𝔼(𝜃) = 0.0174 and 𝑉𝑎𝑟(𝜃) = 0.1374.
2 2
 
 ( )  e 2
q1 ( )  e 2(0.2)2

0.8 1.4

0.6
1.2

0.4
1

0.2
0.8

0.6
-0.2

0.4
-0.4

0.2
-0.6

-0.8 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

A MatLab implementation is given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40

Independent MH: Example
 MCMC output for 𝑞2, we estimate 𝔼(𝜃) = 0.0193 and 𝑉𝑎𝑟(𝜃) = 1.0107.
2 2
 
 ( )  e 2
q2 ( )  e 2(5)2

4 0.45

3 0.4

2 0.35

0.3
1

0.25
0

0.2
-1
0.15

-2
0.1

-3
0.05

-4
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
-4 -3 -2 -1 0 1 2 3 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

Independent Metropolis-Hastings
 Target: 𝜋(𝑥) = 0.25𝒩(−3, 2) + 0.75 𝒩(2, 1)
 Independent Proposal : 𝑞(𝑥) = 𝒩(0, 𝜎2)
 Independent walk proposals are used to jump from one mode to another
 Cases shown: 𝜎 = 1, 𝜎 = 10

Ergodic mean for the two 𝜎. True value

Autocorrelation for
0.75.
various . Acceptance
ratio ∼ 0.24 for both
proposals!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42

Mixture of Proposals
 In practice, random walk proposals can be used to explore locally the space
whereas independent walk proposals can be used to jump into the space.

 A good strategy can be to use a proposal distribution of the following mixture

form
q ( ,  ')   q1 ( ')  (1   ) q2 ( ,  ')
where 0 < 𝜆 < 1.

 This algorithm is valid (satisfies all needed properties of transition kernels) as

it is a particular case of the MH algorithm.

 Combining random walk (conservative small steps) with independent (large

jumps) proposals takes advantage of the merits of both algorithms.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43

Mixture of MH Kernels
 An alternative is to use a transition kernel

K ( ,  ')   K1 ( , ')  (1   ) K 2 ( ,  ')

where 𝐾1 (respectively, 𝐾2) is an MH algorithm of proposal 𝑞1 (respectively,

q2)

 This algorithm is different from using 𝑞(𝜃, 𝜃′) = 𝜆𝑞1 (𝜃′) + (1 − 𝜆)𝑞2 (𝜃, 𝜃′).

 It is computationally cheaper and 𝐾 𝜃, 𝜃 ′ has 𝜋 𝜃 as its invariant distribution:

  ( ) K ( , ')d     ( ) K ( , ')d  (1   )   ( ) K ( , ')d

1 2

  ( ')  (1   ) ( ')
  ( ')

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44

Mixture of MH Kernels
 A sufficient condition to ensure that 𝐾 is irreducible and aperiodic is to have
either 𝐾1 or 𝐾2 irreducible and aperiodic.

 You do NOT need to have both kernels to be irreducible and aperiodic.

 In the limiting case, you could have 𝐾2 (𝜃, 𝜃′) = 𝛿𝜃 (𝜃′) and the total kernel 𝐾
would still be irreducible and aperiodic if 𝐾1 is irreducible and aperiodic.

 None of the kernels have to be irreducible and aperiodic to ensure that 𝐾 is

irreducible and aperiodic (sufficient but not necessary condition).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45

Composition of MH Kernels
 Alternatively, we can apply at each iteration of the algorithm first the kernel 𝐾1
then the kernel 𝐾2, i.e. in this case we have at iteration 𝑖
Z ~ K1 ( (i 1) , ) and  (i ) ~ K2 ( Z , )
 The composition of these kernels corresponds to
K ( , ')   K1 ( , z ) K 2 ( z,  ')dz

 If 𝐾1 and 𝐾2 are both 𝜋 −invariant, then the composition is also 𝜋 −invariant.

 The algorithm admits the right invariant distribution as

    K ( , ')d        K ( , z )d  K ( z, ')dz 

1 2

    z  K ( z , ')dz    '
2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

Mixture and Composition of MH Kernels
 In practice, the choice of the proposal distribution is crucial.

 In high-dimensional problems, a simple MH algorithm is useless. It will be

necessary to use a combination of MH kernels.

 Using mixture and composition of kernels can be a powerful approach.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47

Mixture and Composition of MH algorithms
 Consider the target distribution 𝜋 𝜃1 , 𝜃2 .

 We use two MH kernels to sample from this distribution

 𝐾1 updates 𝜃1 and keeps 𝜃2 fixed whereas

 𝐾2 updates 𝜃2 and keeps 𝜃1 fixed.

 We then combine 𝐾1 and 𝐾2 through mixture or composition.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48

Description of Transition Kernels
 The proposal 𝑞ത1 (𝜃, 𝜃′) associated to 𝐾1 (𝜃, 𝜃′) is given by

q1 ( , ')  q1  (1 ,  2 ), (1' ,  2' )   q1  (1 ,  2 ), 1'  2  2' 

 The acceptance probability is given by 𝛼1 𝜃, 𝜃 ′ = min 1, 𝑟1 (𝜃, 𝜃 ′ ) where:

 ( ')q1 ( ',  )  (1' , 2' )q1  (1' , 2' ),1   ( 2 )
'
r1 ( ,  ')   2

 ( )q1 ( , ')  (1 , 2 )q1  (1 ,  2 ), 1'   ( 2' )

 (1' , 2 )q1  (1' ,  2 ), 1 


 (1 , 2 )q1  (1 ,  2 ), 1' 
 (1' |  2 )q1  (1' ,  2 ), 1 

 (1 |  2 )q1  (1 ,  2 ), 1' 
 This move is also equivalent to an MH step of invariant 𝜋(𝜃1 |𝜃2 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49

Description of Transition Kernels
 The proposal 𝑞ത2 (𝜃, 𝜃′) associated to 𝐾2 (𝜃, 𝜃′) is given by
q 2 ( , ')  q 2  (1 , 2 ), (1' , 2' )   q2  (1 , 2 ), 2'  1 1' 

 The acceptance probability is given by 𝛼2 𝜃, 𝜃 ′ = min 1, 𝑟2 (𝜃, 𝜃 ′ )

where:
 ( ')q 2 ( ', )  (1 , 2 )q2  (1 , 2 ), 2   (1 )
' ' ' '
'
r2 ( ,  ')   1

 ( )q 2 ( ,  ')  (1 , 2 )q2  (1 ,  2 ),  2   (1 )
' '
1

 (1 , 2' )q2  (1 , 2' ), 2 

 
 (1 , 2 )q2  (1 , 2 ), '
2


 ( 2' | 1 )q2  ( , ), 
1
'
2 2

 ( 2 | 1 )q2  ( , ), 
1 2
'
2

 This move is also equivalent to an MH step of invariant 𝜋(𝜃2 |𝜃1 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50

Composition of MH Kernels
 Assume we use a composition of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 MH Step to Update Component 1

𝑖−1 𝑖−1
 Sample 𝜃1∗ ~𝑞1 𝜃1 , 𝜃2 ,⋅ and compute

  1* |  2(i 1)  q1 1* , 2(i 1)  ,1(i 1)  

1 1(i 1) , 2(i 1)  , 1* , 2(i 1)    min 1, 
 
  1(i 1) |  2(i 1)  q1 1(i 1) , 2(i 1)  ,1*  

𝑖−1 𝑖−1 𝑖−1 𝑖
 With probability 𝛼1 𝜃1 , 𝜃2 , 𝜃1∗ , 𝜃2 , set 𝜃1 = 𝜃1∗
𝑖 𝑖−1
otherwise set 𝜃1 = 𝜃1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51

Composition of MH Kernels
 Assume we use a composition of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 MH Step to Update Component 2

𝑖 𝑖−1
 Sample 𝜃2∗ ~𝑞2 𝜃1 , 𝜃2 ,⋅ and compute

   2* | 1(i )  q2 1( i ) , 2*  , 2( i 1)  

 2  ,
(i ) ( i 1)
 , (i )
, *
  min 1, 
1 2 1 2

 
   2(i 1) | 1(i )  q1 1(i ) , 2( i 1)  , 2*  

𝑖 𝑖−1 𝑖 𝑖
 With probability 𝛼2 𝜃1 , 𝜃2 , 𝜃1 , 𝜃2∗ , set 𝜃2 = 𝜃2∗
𝑖 𝑖−1
otherwise set 𝜃2 = 𝜃2 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52

Mixture of MH Kernels
 Assume we use an even mixture of these kernels, then the resulting algorithm
proceeds as follows at iteration 𝑖.

 Sample the index of the component to update 𝐽~𝒰 1,2

𝑖 𝑖−1
 Set 𝜃−𝐽 = 𝜃−𝐽
𝑖−1 𝑖−1
 Sample 𝜃𝐽∗ ~𝑞𝐽 𝜃1 , 𝜃2 ,⋅ and compute


   J* |  (iJ)  qJ  J* , ( iJ)  , J( i 1)  

 J  , (i ) ( i 1)
 ,  *
, (i )
  min 1, 
1 2 J J

 
   J(i 1) |  (iJ)  qK  J( i 1) , ( iJ)  , J*  

𝑖−1 𝑖−1 𝑖 𝑖
 With probability 𝛼𝐽 𝜃𝐽 , 𝜃𝐽 , 𝜃𝐽∗ , 𝜃−𝐽 , set 𝜃𝐽 = 𝜃𝐽∗ ;

𝑖 𝑖−1
otherwise set 𝜃𝐽 = 𝜃𝐽 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53

Properties
 It is clear that in such cases both 𝐾1 and 𝐾2 are NOT irreducible and
aperiodic.

⇒ Each of them only updates one component!!!!

 However, the composition and mixture of these kernels can be irreducible and
aperiodic because then all the components are updated.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54

Discussion
 For parameter space 𝜃 = 𝜃1 , . . . , 𝜃𝑝 ,

 we update each parameter 𝜃𝑘 according to an MH step of proposal

distribution 𝑞𝑘 (𝜃1:𝑝 , 𝜃𝑘′ ) = 𝑞𝑘 ቀ𝜃−𝑘 , 𝜃𝑘 ), 𝜃𝑘′ and

 invariant distribution 𝜋(𝜃𝑘 |𝜃−𝑘 ൯.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55

Using Full Conditionals Leads to Gibbs Sampler
 Consider now the case where
q1  (1 , 2 ),1'    1' |  2 

then
r ( , ') 
 
 1' |  2  q1 1' , 2  ,1

  |     |  
1
'
2
1
1 2

  |   q   ,  ,    |     |  
1 ' '
1 2 1 1 2 1 1 2 1 2

 Similarly if 𝑞2 ൫𝜃1 , 𝜃2 ), 𝜃2′ = 𝜋 𝜃2′ |𝜃2 , then 𝑟2 (𝜃, 𝜃′) = 1.

 Using as proposal distributions in MH the conditional distributions gives you

the Gibbs sampler!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56

General Hybrid Algorithm
 To sample from 𝜋(𝜃),𝜃 = 𝜃1 , 𝜃2 , . . . , 𝜃𝑝 , we can use the following algorithm at
iteration 𝑖.

 Iteration 𝑖, 𝑖 ≥ 1

 For 𝑘 = 1: 𝑝

(𝑖)
 Sample 𝜃𝑘 using an MH step of proposal distribution
𝑖 𝑖−1 𝑖
𝑞𝑘 ൬𝜃−𝑘 , 𝜃𝑘 ), 𝜃𝑘′ and target 𝜋(𝜃𝑘 |𝜃−𝑘 ቁ

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = (𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝 ቁ.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57

General Hybrid Algorithm
 If we have 𝑞𝑘 𝜃1:𝑝 , 𝜃𝑘′ = 𝜋 𝜃𝑘′ |𝜃−𝑘 then we are back to the Gibbs sampler.

 Update some parameters according to 𝜋 𝜃𝑘′ |𝜃−𝑘 (and the move is

automatically accepted) and the rest according to different proposals. For
example:

 For 𝜋 𝜃1 , 𝜃2 , sample from 𝜋 𝜃1 |𝜃2 and

 Then use an MH step of invariant distribution 𝜋 𝜃2 |𝜃1 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58

General Hybrid Algorithm
 At iteration 𝑖, 𝑖 ≥ 1

𝑖 𝑖−1
 Sample 𝜃1 ~𝜋 𝜃1 |𝜃2

𝑖 𝑖 𝑖−1
 Sample 𝜃2 using an MH step of proposal distribution 𝑞2 ൬𝜃1 , 𝜃2 ), 𝜃2
𝑖
and target 𝜋 𝜃2 |𝜃1

 There is no need to run the MH Algorithm multiple steps to ensure that

𝑖 𝑖−1
𝜃2 ~𝜋(𝜃2 |𝜃2 ).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59

Using Gradient Information to Build 𝒒 𝜽, 𝜽 ′

 We usually want to sample candidates in regions of high probability

 We can use 2
 '    log  ( )   V , V ~ N (0,1)
2

where 𝜎2 is selected such that the acceptance ratio is approximately 0.57.

 The motivation comes from the continuous-time case where

1
dt   log  ( )   dWt
2

admits 𝜋 as an invariant distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 60

Alternative Acceptance Probabilities
 The standard MH algorithm uses the acceptance probability:
   ' q  ',  
 ( , ')  min 1, 
    q  , '  
 
 One can also use 𝛼(𝜃, 𝜃′) below with any function 𝛿 𝜃′, 𝜃

𝛿 𝜃′, 𝜃
𝛼(𝜃, 𝜃′) =
𝜋 𝜃 𝑞 𝜃, 𝜃 ′

which is such that 𝛿 𝜃′, 𝜃 = 𝛿 𝜃, 𝜃′ 𝑎𝑛𝑑 0 ≤ 𝛼 𝜃, 𝜃 ′ ≤ 1.

 For example (Baker, 1965)

𝜋 𝜃′ 𝑞 𝜃′, 𝜃
𝛼(𝜃, 𝜃′) =
𝜋 𝜃′ 𝑞 𝜃′, 𝜃 + 𝜋 𝜃 𝑞 𝜃, 𝜃′

𝜋 𝜃 ′ 𝑞 𝜃 ′ ,𝜃 𝜋 𝜃 𝑞 𝜃,𝜃 ′
Note that 0 ≤ 𝛼 𝜃, 𝜃′ ≤ 1 and 𝛿 𝜃, 𝜃 ′ = = 𝛿 𝜃′, 𝜃 .
𝜋 𝜃′ 𝑞 𝜃 ′ ,𝜃 +𝜋 𝜃 𝑞 𝜃,𝜃 ′
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 61
Alternative Acceptance Probabilities
 Indeed, one can check that

𝐾 𝜃, 𝜃′ = 𝛼 𝜃, 𝜃′ 𝑞 𝜃, 𝜃′ + 1 − න𝛼 𝜃, 𝑢 𝑞 𝜃, 𝑢 𝑑𝑢 𝛿𝜃 𝜃′
is 𝜋 −reversible.

 We have:
𝛿 𝜃, 𝜃′
𝜋 𝜃 𝛼(𝜃, 𝜃′)𝑞 𝜃, 𝜃′ = 𝜋 𝜃 𝑞 𝜃, 𝜃′ = 𝛿 𝜃, 𝜃′ = 𝛿 𝜃′, 𝜃
𝜋 𝜃 𝑞 𝜃, 𝜃′
𝛿 𝜃′, 𝜃
= 𝜋 𝜃′ 𝑞 𝜃′, 𝜃 = 𝜋 𝜃′ 𝛼(𝜃′, 𝜃)𝑞 𝜃′, 𝜃
𝜋 𝜃′ 𝑞 𝜃′, 𝜃

 The MH acceptance is favored as it increases the acceptance probability.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 62

Hybrid Hamiltonian Metropolis Proposal
 Hybrid MC is essentially Metropolis with a special choice of a proposal.

 Assume that you want to sample 𝑥 ~ 𝜋(𝑥), where 𝜋(𝑥) is known up to a

proportionality constant.

 Consider that 𝑥 represents the position of some “real particles”. Then write:

𝜋(𝑥) = exp(−𝑉(𝑥))

where we have defined: 𝑉(𝑥) = −log 𝜋(𝑥).

 This is similar to the Boltzmann distribution at inverse temperature equal to

one (statistical mechanics).

 Think of 𝑉(𝑥) as the potential of the system at 𝑥.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 63
Hybrid Hamiltonian Metropolis Proposal
 To complete the picture, introduce the momenta 𝑝 (of the same dimension as
𝑥) and write a probability distribution in the extended space:

𝑥, 𝑝 ~ 𝜋(𝑥, 𝑝) = 𝜋(𝑥) × 𝒩(𝑝|0, 1) ∝ exp(−𝑉(𝑥) − 𝑝2 / 2)

 Write 𝐻(𝑥, 𝑝) = 𝑉(𝑥) + 𝑝2 / 2 for the Hamiltonian of the system.

 To construct the proposal, we bring into the picture “the dynamics described
by the Hamiltonian”.

 If you integrate the equations of motion described by 𝐻(𝑥, 𝑝) starting at any

initial condition for a long time, you will get a sample from 𝜋(𝑥, 𝑝).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 64

Hybrid Hamiltonian Metropolis Proposal
𝑥, 𝑝 ~ 𝜋(𝑥, 𝑝) = 𝜋(𝑥) × 𝒩(𝑝|0, 1) ∝ exp(−𝑉(𝑥) − 𝑝2 / 2)

 Based on this the hybrid Metropolis proposal is constructed as:

 1. Sample an initial 𝑝 from a Gaussian (in 𝜋(𝑥, 𝑝), 𝑥 and 𝑝 are decoupled and the
probability distribution of 𝑝 is 𝒩(𝑝|0, 1))

 2. Evolve the equations of motion for a finite amount of time using a finite time
step.

 3. Use the 𝑥 at the final step as the proposed move.

 Notice that the proposal built this way is reversible if the integration scheme is
reversible. To guarantee this, we use a the Leapfrog integration scheme for the
integration of motion which has the property of preserving the value of the
Hamiltonian.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 65
Hybrid Hamiltonian Metropolis Proposal
3

-1

-2

50 MH steps
-3
-3 -2 -1 0 1 2 3

MatLab Implementation with

animation of the dynamics

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 66

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
June 2024 (v1) QP
No ratings yet
June 2024 (v1) QP
16 pages
Stat Descr
No ratings yet
Stat Descr
68 pages
Inferential Statistics, Probability-1-10
No ratings yet
Inferential Statistics, Probability-1-10
10 pages
Schuberth 2023 B
No ratings yet
Schuberth 2023 B
12 pages
JGI 220 - Tutorial 9 - Memorandum - 2024
No ratings yet
JGI 220 - Tutorial 9 - Memorandum - 2024
4 pages
The Basic Practice of Statistics 7th Edition Moore - Download The Full Ebook Now For A Seamless Reading Experience
100% (3)
The Basic Practice of Statistics 7th Edition Moore - Download The Full Ebook Now For A Seamless Reading Experience
71 pages
MAS-II Formula Sheet
No ratings yet
MAS-II Formula Sheet
14 pages
Get SAS For Linear Models Fourth Edition Ramon Littell PDF Ebook With Full Chapters Now
No ratings yet
Get SAS For Linear Models Fourth Edition Ramon Littell PDF Ebook With Full Chapters Now
52 pages
(Azzalini, 1985) A Class of Distributions Which Includes The Normal Ones
No ratings yet
(Azzalini, 1985) A Class of Distributions Which Includes The Normal Ones
9 pages
CPSC 540: Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 540: Machine Learning: Monte Carlo Methods
32 pages
Gaussian Invariant Markov Chain Monte Carlo: Metropolis-Hastings, Variance Reduction, Control Variate, Poisson Equation
No ratings yet
Gaussian Invariant Markov Chain Monte Carlo: Metropolis-Hastings, Variance Reduction, Control Variate, Poisson Equation
29 pages
Deep Learning and Genetic Algorithms For Cosmological Bayesian Inference Speed-Up
No ratings yet
Deep Learning and Genetic Algorithms For Cosmological Bayesian Inference Speed-Up
16 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
This Content Downloaded From 47.39.198.204 On Wed, 06 Oct 2021 13:46:12 UTC
No ratings yet
This Content Downloaded From 47.39.198.204 On Wed, 06 Oct 2021 13:46:12 UTC
18 pages
Lecture 9 F Test Practice Questions
No ratings yet
Lecture 9 F Test Practice Questions
2 pages
05 Work Sampling Assignment
No ratings yet
05 Work Sampling Assignment
3 pages
Ex. Sheet 2 - Sol.
No ratings yet
Ex. Sheet 2 - Sol.
8 pages
Financial Econometrics, Mathematics and Statistics
No ratings yet
Financial Econometrics, Mathematics and Statistics
19 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Convergence Analysis of A Collapsed Gibbs Sampler For Bayesian Vector Autoregressions
No ratings yet
Convergence Analysis of A Collapsed Gibbs Sampler For Bayesian Vector Autoregressions
31 pages
Lec 25
No ratings yet
Lec 25
3 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Seminar em
No ratings yet
Seminar em
51 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
5d MCMC
No ratings yet
5d MCMC
9 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Cap 7
No ratings yet
Cap 7
32 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
728852
No ratings yet
728852
124 pages
Statistics For Management SFM - BA4101 - Notes by JeppiaarEC
No ratings yet
Statistics For Management SFM - BA4101 - Notes by JeppiaarEC
152 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
IDS22Bayes Applications
No ratings yet
IDS22Bayes Applications
34 pages
Stats 102c Notes
No ratings yet
Stats 102c Notes
6 pages
Stat 413
No ratings yet
Stat 413
55 pages
MCMC
No ratings yet
MCMC
46 pages
MCMC Notes
No ratings yet
MCMC Notes
77 pages
LectureNotes Complete
No ratings yet
LectureNotes Complete
90 pages
Solutions CN2116 HW7
No ratings yet
Solutions CN2116 HW7
3 pages
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
No ratings yet
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
64 pages
Hamiltonian Monte Carlo For Efficient Gaussian Sampling: Long and Random Steps
No ratings yet
Hamiltonian Monte Carlo For Efficient Gaussian Sampling: Long and Random Steps
30 pages
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
No ratings yet
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
107 pages
Chapter - 3 Oprns MGMT Q&a
No ratings yet
Chapter - 3 Oprns MGMT Q&a
46 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
A Review of Basic Statistical Concepts: Answers To Problems and Cases 1
No ratings yet
A Review of Basic Statistical Concepts: Answers To Problems and Cases 1
94 pages
Lesson 7 - Linear Correlation and Simple Linear Regression
No ratings yet
Lesson 7 - Linear Correlation and Simple Linear Regression
8 pages
My Notes Unit 5
No ratings yet
My Notes Unit 5
12 pages
Lectures 6
No ratings yet
Lectures 6
17 pages
Chapter 4: Forecasting: Problem 1: Auto Sales at Carmen's Chevrolet Are Shown Below. Find A Naive Forecast
No ratings yet
Chapter 4: Forecasting: Problem 1: Auto Sales at Carmen's Chevrolet Are Shown Below. Find A Naive Forecast
11 pages
MCMC: Metropolis Hastings Algorithm
No ratings yet
MCMC: Metropolis Hastings Algorithm
5 pages
ST407 2020 Notes
100% (1)
ST407 2020 Notes
126 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
Finish Analisis Soal
No ratings yet
Finish Analisis Soal
7 pages
Adaptive MCMC For Everyone
No ratings yet
Adaptive MCMC For Everyone
13 pages
Eco 270
No ratings yet
Eco 270
9 pages
MTH210
No ratings yet
MTH210
126 pages
Stat513 l10
No ratings yet
Stat513 l10
27 pages
Statistics For Geoscience Applications: Univariate Statistics Bivariate Statistics Multivariate Statistics
No ratings yet
Statistics For Geoscience Applications: Univariate Statistics Bivariate Statistics Multivariate Statistics
25 pages
BT Wk10 LectureNotes
No ratings yet
BT Wk10 LectureNotes
16 pages
MCMC
No ratings yet
MCMC
70 pages
Estimation of Claim Cost Data Using Zero Adjusted Gamma and Inverse Gaussian Regression Models
No ratings yet
Estimation of Claim Cost Data Using Zero Adjusted Gamma and Inverse Gaussian Regression Models
7 pages
Chapter 3: Probability Distribution: 3.1 Random Variables
No ratings yet
Chapter 3: Probability Distribution: 3.1 Random Variables
11 pages
RMCT Course Outline
No ratings yet
RMCT Course Outline
2 pages
Bayesian Modelling Tuts-12-15
No ratings yet
Bayesian Modelling Tuts-12-15
4 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Stochastic Processes by Joseph T Chang
0% (1)
Stochastic Processes by Joseph T Chang
233 pages
Lec30 GibbsSampling
No ratings yet
Lec30 GibbsSampling
55 pages
Poisson CDF Table
No ratings yet
Poisson CDF Table
6 pages
Geyer - Markov Chain Monte Carlo Lecture Notes
No ratings yet
Geyer - Markov Chain Monte Carlo Lecture Notes
166 pages
An Glicky 2016
No ratings yet
An Glicky 2016
130 pages
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
No ratings yet
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
41 pages
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
No ratings yet
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
13 pages
Metropolis Hastings Explained
No ratings yet
Metropolis Hastings Explained
2 pages
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
No ratings yet
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
78 pages
An Adaptive Simulated Annealing Algorithm PDF
No ratings yet
An Adaptive Simulated Annealing Algorithm PDF
9 pages
Expected Values Expected Values: Alberto Suárez
No ratings yet
Expected Values Expected Values: Alberto Suárez
6 pages
Introduce To Probabilistic Machine Learning
No ratings yet
Introduce To Probabilistic Machine Learning
53 pages
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
No ratings yet
Part A Simulation: Matthias Winkel Department of Statistics University of Oxford
54 pages
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
No ratings yet
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
66 pages
Simon Shaw Bayes Theory
No ratings yet
Simon Shaw Bayes Theory
72 pages
Advstatcomp PDF
No ratings yet
Advstatcomp PDF
109 pages
MCMC With Temporary Mapping and Caching With Application On Gaussian Process Regression
No ratings yet
MCMC With Temporary Mapping and Caching With Application On Gaussian Process Regression
16 pages
Sampling Methods: Søren Højsgaard
No ratings yet
Sampling Methods: Søren Højsgaard
22 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
39 pages
Examples of Adaptive MCMC
No ratings yet
Examples of Adaptive MCMC
28 pages
An Introduction To MCMC For Machine Learning
No ratings yet
An Introduction To MCMC For Machine Learning
39 pages
MCMC Brief
100% (1)
MCMC Brief
69 pages
Stochastic Processes
100% (2)
Stochastic Processes
233 pages
Bayesian Inference On Change Point Problems
No ratings yet
Bayesian Inference On Change Point Problems
71 pages
MCMC Final Edition
No ratings yet
MCMC Final Edition
17 pages
An Introduction To MCMC For Machine Learning: Abstract
No ratings yet
An Introduction To MCMC For Machine Learning: Abstract
39 pages
Monte Carlo
No ratings yet
Monte Carlo
59 pages
Putational Statistics Using Matlab
No ratings yet
Putational Statistics Using Matlab
78 pages
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
No ratings yet
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
97 pages
Computational Statistics With Matlab
No ratings yet
Computational Statistics With Matlab
71 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.