0% found this document useful (0 votes)

120 views13 pages

PDF Sampling: Markov Chain Monte Carlo: X N I I

The document discusses Markov Chain Monte Carlo (MCMC) sampling methods for approximating integrals and sampling probability distributions. MCMC involves generating a Markov chain of random samples whose distribution converges to the target distribution. The Metropolis-Hastings algorithm is introduced as a commonly used MCMC method. It employs an acceptance probability to determine whether proposed steps that satisfy the detailed balance condition are accepted or rejected. This ensures the chain converges to the correct distribution over many iterations.

Uploaded by

Kikku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views13 pages

PDF Sampling: Markov Chain Monte Carlo: X N I I

Uploaded by

Kikku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

PDF sampling: Markov Chain Monte Carlo

The need to sample a given distribution is one of the most common tasks arising in research. For example, in
the generic Monte Carlo (i.e., method using random samples) numerical integration, integral is approximated
as as
Z N
Vx X
I(x) = g(x)π(x)dx ≈ g(xi )π(xi ), (1.1)
Vx N i=1

where x is a vector with the number of components equal to the number of dimensions, Vx is integration
volume, and {xi } are random samples in Vx . We could distribute the samples uniformly over the integration
volume, but the integral converges very slowly in this case. Plus, most of the samples may fall in the regions
where g(xi )π(xi ) is very small and does not contribute significantly to the sum. The number of points
required for the integration with a certain accuracy increases exponentially with the number of dimensions
and the problem thus suffers from the curse of dimensionality.
As in the Gaussian quadrature numerical integration methods, integration would be much more efficient
if the sample points are distributed not uniformly, but sample distribution π(x). In this case the integral
can be approximated as
N
1 X
I≈ g(xi ), (1.2)
N i=1

and the number of points to reach a given integration accuracy is much smaller and does not grow exponen-
tially with the number of dimensions.
Another application is sampling of the posterior pdf, the problem often encountered in statistical analyses
of data. Given the likelihood of observational data d given a model M (x) that depends on a vector of
parameters x, L(d|M (x)) and pdf for the parameter values given some prior information (the prior pdf),
p(x|I), the posterior distribution according to the Bayes theorem is

π(x|d, I) ∝ L(d|M (x)) p(x|I). (1.3)

Thus, we can reconstruct the posterior distribution of parameter values by randomly sampling the pdf
∝ L(d|M (x)) p(x|I).
Simple and efficient pdf sampling methods, such as rejection sampling or inverse transform sampling, exist
but require detailed knowledge of the probability distribution function (pdf) or its integral. If probability
function shape is not well known, as is often the case when the pdf depends on many parameters, these
methods are often impractical. In this case, the method of choice is the Markov Chain Monte Carlo (MCMC)
sampling method. Monte Carlo is because it involves random samples, while Markov Chain is because
sampling algorithm depends only on the previous sample (the so-called Markov process): the probability of
N
step from xi to xi+1 is p(xi+1 | {xj }j=1 ) = p(xi+1 |xi ) — i.e., depends only on xi and xi+1 .
The key concept of the MCMC method is statistical equilibrium. The method was first developed by
physicists to model thermodynamic properties of particle systems, in which approach to equlibrium depends
on interaction between particles. Likewise, in the MCMC method equilibrium distribution of points {xj }
that sample the target distribution is achieved by appropriately chosen transition probabilities. To reach
equilibrium the transition probability must be symmetric: p(xi+1 |xi ) = p(xi |xi+1 ). This condition is also
called the detailed balance condition.
We can write the transition probability as a product of the transition probability kernel, T (xi+1 |xi ),
properly normalized so that it integrates to unity and target pdf, π(x): p(xi+1 |xi ) = T (xi+1 |xi )π(xi ).
Then the detailed balance condition reads:

T (xi+1 |xi )π(xi ) = T (xi |xi+1 )π(xi+1 ) (1.4)

1
2 1.1. The Metropolis-Hastings algorithm

which physically means that the flux of samples xi → xi+1 is statistically balanced by the reverse flux
xi+1 → xi . Indeed, integrating over xi gives
Z Z
π(xi+1 ) T (xi |xi+1 )dxi = π(xi+1 ) = T (xi+1 |xi )π(xi )dxi , (1.5)

i.e., if x is drawn from π, then the next sample drawn with probability satisfying the detailed balance will
also be drawn from π.
Different MCMC methods use different choices for the stepping rules and transition kernel probability
T (xi+1 |xi ), but they all must satisfy the detailed balance condition in order to sample the target pdf
faithfully.

1.1 The Metropolis-Hastings algorithm

Metropolis et al. (1953) developed a Monte Carlo integration method for modelling thermodynamic properties
(e.g., equation of state) of a system of particles interacting with a certain rule. The method was then
generalized for the MCMC applications by Hastings (1970) and has become known as the Metropolis–
Hastings algorithm. In this algorithm the transition probability kernel is chosen as follows:
T (xi+1 |xi ) = pacc (xi+1 , xi )P (xi+1 |xi ), (1.6)
where P (xi+1 |xi ) is the proposed step distribution and can, in principle, be any function such as uniform
distribution within some interval xi ±dx or a Gaussian pdf centered on xi , and pacc is acceptance probability
is
P (xi |xi+1 )π(xi+1 )
pacc (xi+1 , xi ) = . (1.7)
P (xi+1 |xi )π(xi )
It is easy to check that for this choice the detailed balance condition is satisfied.
In the original Metropolis et al. (1953) algorithm, P (xi+1 |xi ) = P (xi |xi+1 ), and the acceptance proba-
bility is just
π(xi+1 )
pacc (xi+1 , xi ) = min , 1.0 , (1.8)
π(xi )
i.e., if π(xi+1 ) > π(xi ) the proposed step is always accepted, while if π(xi+1 ) < π(xi ) the step is accepted
with probability π(xi+1 )/π(xi ). The latter means that we draw a random number uniformly distributed
from 0 to 1 using a random number generator and if the drawn number is smaller than π(xi+1 )/π(xi ) the
step is accepted, if not – it is not accepted. In the latter case, the new entry into the MCMC chain of samples
is the same value xi . That is we don’t simply go to the next step proposal but add a duplicate value of xi to
the chain. This is important for satisfying the detailed balance, as illustrated below. Note that acceptance
probability depends on the ratio of the posterior values and thus does not depend on the normalization of
the posterior. This is a very useful property because it means that we can sample posterior even without
knowing its correct normalization. Note also that in practice the fraction of proposed steps that will be
accepted will depend on the details of the proposed step distribution with optimal stepping corresponding
to the fraction of accepted proposed steps of ∼ 0.2 − 0.5.
Thus, we can write the Metropolis MCMC algorithm to produce N samples as a simple pseudo-code:
procedure SimpleMetropolis[In:N, x0 ;Out:{xi }]

1. for i from 1 to N
draw xi+1 using P (xi+1 |xi )
if π(xi+1 ) > π(xi ): accept xi+1 as the next sample in the chain
else:
draw a random number r from a uniform distribution U [0, 1)
if r < π(xi+1 )/π(xi ): accept xi+1 as the next sample in the chain
else: take xi as the next sample in the chain

Figure 1.1 shows a simple 1d Gaussian with zero mean and unit variance sampled with the Metropolis-
Hastings algorithm starting at x0 = 10. The left panel shows the correct algorithm, while the right panel
shows what happens if one fails to include duplicate values of xi in the chain when proposed step fails, but
only takes new positions to include in the chain. In this case, the detailed balance condition is violated and
the target pdf is not sampled correctly.
Chapter 1. PDF sampling: Markov Chain Monte Carlo 3

0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0.25 0.25
frequency

frequency
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 4 3 2 1 0 1 2 3 4 0.00 4 3 2 1 0 1 2 3 4
x x

Figure 1.1: Left panel: the histogram of the MCMC chain {xi } of 106 samples produced using the Metropolis
algorithm (blue bins) compared to the target pdf (red line). Right panel: the distribution of samples in an
incorrect implementation of the Metropolis algorithm in which the sample positions were not duplicated
when proposed step failed. In this case detailed balance condition was violated and the target Gaussian pdf
is not sampled correctly.

1.2 The burn-in and thinning

It is always good to start the chain near the peak of the posterior. However, often information about the
posterior is limited, at least for some of the parameters of the problem. The initial guess can thus be quite
a bit off in the low-probability region. If steps are chosen reasonably, the chain will recover and, in fact, the
initial samples in the low probability region are formally correct samples of the target pdf. Nevertheless,
these low probability values are often extremely improbable for the finite length of the sample chain that one
generates in practice. For example, the Gaussian pdf shown in Figure 1.1 was sampled with the chain that
was started at x0 = 10 – i.e., 10σ away from the peak. The probability of such sample is ≈ 1.5×10−23 and so
we would need to have a chain of length N ∼ 1023 to make such sample “normal.” For samples of smaller N
this starting value can bias estimates of the mean, rms dispersion, etc. Thus, in practice a certain number of
initial chain samples is discarded to avoid such biases. The initial range of the improbable samples is called
“the burn in” period of the chain. Determining this period is to a large extent a black art, and is handled
in conjunction on deciding on chain convergence (see below). Simple checks by how much one’s estimates of
the statistics of interest change after discarding a certain number of the initial samples will do the trick.
Left plot in Figure 1.2 shows the initial 1000 steps of the chain sampling the Gaussian started from
x0 = 10. Clearly the chain moves to probable values of the pdf after only ≈ 50 − 100 samples, but the initial
≈ 50 samples are clearly highly improbable for the chain of length 1000 or even 106 and thus need to be
discarded to avoid biases.
Additional unavoidable feature of the chains is short-range correlations. Although probability of the next
step depends only on the current location, the current location depended on the previous one, and so on.
This can be clearly seen in the left plot of Figure 1.2, as the initial location choice of x0 = 10 predetermined
high values of x until Nsample ∼ 50. These short-range correlation mean that samples in the chain are not
truly independent. To deal with this chains are often “thinned” by selecting only every N th sample, where
N is determined by the correlation length of the chain measured by the autocorrelation function.
4 1.3. A practical example: line fit with errors in both variables and intrinsic scatter

10 10

8 8

6 6

4 4

2 2
x

x
0 0

2 2

4 4

60 200 400 600 800 1000 60 20000 40000 60000 80000 100000
Nsample Nsample

Figure 1.2: Left panel: the initial samples of the chain sampling Gaussian pdf that was started 10σ away
from the mean. The chain recovers to probable region after ≈ 100 steps, but the initial ≈ 50 samples are
clearly highly improbable for the chain of length 1000 or even 106 and thus need to be discarded to avoid
biases. Right panel: the samples in the chain sampling the Gaussian for 105 samples. The chain convergence
is indicated by the stable distribution of samples around the mean x = 0.

1.3 A practical example: line fit with errors in both variables and
intrinsic scatter
As an illustration for how the simple Metropolis algorithm can be used in practice, consider the problem of
the Bayesian fit of linear relation to a set of measurements, {xi , yi }, in which both x and y has significant
(Gaussian) errors and which may exhibit intrinsic scatter. The posterior distribution derived using the
Bayesian approach for this problem is (see D’Agostini, 2005):
(yi − mxi − c)2

Y 1
π(m, c, s|x, y, I) = k exp − 2 p(m, c, s|I), (1.9)
2(s + σy2i + m2 σx2i )
q
i s2 + σy2i + m2 σx2i

where m, c, s are the slope, normalization, and intrinsic scatter of the relation; k is normalization constant
(unknown, but this is irrelevant for the Metropolis algorithm), σxi and σyi are the errors of xi and yi , and
p(m, c, s|I) is the prior probability distribution for the values of the slope, normalization, and scatter.
The posterior can be sampled using a simple Metropolis algorithm and resulting chain can be used to
calculate the best fit values of m, c, s and their confidence limits. The results of such fitting for the specific
case of comparing masses measured using X-ray mass indicator by Vikhlinin et al. (2009) and a recent
measurement by (Hoekstra et al., 2015) is shown in Figure 1.3. In this case, the chain was run with 106
samples, in which the first 1000 were discarded as burn-in and the chain was thinned with only every 100th
sample selected.

1.4 Metropolis algorithm and parameter degeneracies

The Metropolis algorithm works quite well when posterior is well-localized and its computation is fast, as
in the above line fitting example. The chain converges fast and one can always just generate sufficiently
large number of samples to ensure convergence. However, in practice, especially for problems with many
Chapter 1. PDF sampling: Markov Chain Monte Carlo 5

1.4

normalization at pivot mass

1.2
M500,YX (h72−1 ×1014 M ¯ )

1015

1.0 100

0.8

0.6

1015 0.5 0.0 0.5 1.0 1.5

M500,H15(h72−1 ×1014 M ¯ ) slope

Figure 1.3: Left panel: weak lensing masses measured by Hoekstra et al. (2015) for the 10 clusters overlap-
ping with the sample used for cosmological analysis by Vikhlinin et al. (2009) vs the masses measured from
X-ray indicator YX . The green-dashed line shows one-to-one relation between the masses, while the solid
blue line shows the best fit linear relation in the Bayesian fit using posterior given by equation 1.9 with flat
prior on m, c, and s which accounts for errors in both directions and intrinisic scatter in the y-direction.
This fit gives the best fit slope value of m = 0.57 ± 0.25 (although the slope is consistent with unity at
the 95% conf. level) and relative normalization between masses of 0.87 ± 0.10, which is also consistent with
unity. Right panel: the distribution of the MCMC samples in the plane of slope and normalization.

parameters, strong and complicated degeneracies often exist among some of them and the chain convergence
may be very slow so that the number of samples required can be very large.
Figure 1.4 illustrates this by comparing convergence of the MCMC chain generated using the Metropolis
algorithm to sample a 2D Gaussian posterior with a significant correlation
( " #)
2 2
1 x1 x2 rx1 x2
π(x1 , x2 ) = √ exp −0.5 + −2 (1.10)
2πσ1 σ2 1 − r2 σ1 σ2 σ1 σ2 (1 − r2 )

and result for the same chain length (Nsample = 105 ) sampling the Rosenbrock “banana” pdf:
π(x1 , x2 ) = exp −0.05 [100(x2 − x21 )2 + (1 − x1 )2 ] .

(1.11)
Note that this pdf has peak at (x1 , x2 ) = (1.0, 1.0). We can see that in the case of the Gaussian the traces
of x1 and x2 are stable and fluctuate around the region of high posterior. In the case of the Rosenbrock
pdf, the traces show that the chain has not converged as values of x1 and x2 fluctuate wildly indicating that
the chain is still exploring the remote regions of this highly degenerate pdf.1 This indicates that the chain
length must be N 105 to sample this pdf.
This highlights two important issues: 1) we need better algorithm than the Metropolis with isotropic
proposed step distribution to handle sampling of the highly degenerate posterior distributions and 2) we
need to have an objective criterion for chain convergence. We will consider these issues in the next two
sections.

1.5 An affine-invariant MCMC sampling algorithm

Goodman & Weare (2010, herafter GW10)2 have developed a simple MCMC sampling algorithm, which is
efficient for distributions that exhibit strong degeneracies (narrow ridges in the pdf), such as the Rosenbrock
1 The rough visual rule to gauge convergence is that traces of parameters should look like horizontal “hairy caterpillars” for

well converged chain.

2 See http://msp.org/camcos/2010/5-1/p04.xhtml
6 1.5. An affine-invariant MCMC sampling algorithm

Metropolis 70 Metropolis
4 60
50
2
40
0 30
20
2
10
4 0

0 20000 40000 60000 80000 100000 100 20000 40000 60000 80000 100000

4 MCMC samples vs target distribution MCMC samples vs target distribution

60
3
68.27%
99% 10-1
50
2 10-1
1 40
0 30
y

y
1 10-2
20
2 68.27% 10-2
3 90% 10
99%
44 0
3 2 1 0 1 2 3 4 2 0 2 4 6
x x

Figure 1.4: In each panel the upper plot shows traces of 2 parameters of the sampled pdf (green and blue
curves), while the bottom plot shows the distribution of the chain samples in the 2D parameter space along
with the 68.27%, 90%, and 99% confidence contours. 105 samples were generated using Metropolis algorithm
with uniform isotropic step proposal distribution; the plotted chain was obtained by thinning the original
chain by taking every 10th sample. Left panel: sampling of a 2D Gaussian pdf (eq. 1.10) with the correlation
coefficient r = 0.9. Right panel: the result of sampling of the Rosenbrock “banana” pdf (eq. 1.11). The
trace in the left plot indicates that the chain has converged for Nsample = 105 , while the trace in the right
plot shows that chain sampling the Rosenbrock pdf is far from convergence for the same number of samples
and step proposal distribution.

density pdf discussed above, or pdfs with multiple peaks. In this method, one initializes a number of
walkers distributed with a multi-variate Gaussian probability around a starting point.3 Then MCMC chain
is constructed in a way similar to the Metropolis algorithm: proposing a step, estimating probability at
the proposed location, and then accepting the step with an acceptance probability. The difference from
Metropolis is in the way the step is proposed and how acceptance probability is calculated (see §2 of GW10
for more details and detailed pseudo-code of the algorithm, but all details needed to code up the algorithm
are below):

• In this method a step from location xi to xi+1 is proposed as a stretch move: x0i = xj + zr (xi − xj ),
where xj is the current location of another randomly chosen walker (but
√ not the current one we are
updating) and zr is a random number drawn from the pdf g(z) = 1/ z for z ∈ [1/a, a] interval and
g(z) = 0 outside this interval, where GW10 suggest a = 2.

• The proposed stretch move is then accepted with probability pacc = min[1.0, zrD−1 π(x0i )/π(x)], where
π is the target pdf the chain is supposed to sample and D is the number of components of x (i.e., the
number of dimensions of the pdf we are sampling; e.g., for the D = 2 for the Rosenbrock banana pdf
3 This by itself is not a new or distinct feature, as multiple chains (aka “walkers”) can be used in the Metropolis-Hastings

algorithm too.
Chapter 1. PDF sampling: Markov Chain Monte Carlo 7

140
140
Goodman & Weare 2010 sampler
120 68.27%
120 99%
10-2
100
100
80 10-3
80
x1 , x2

60
10-4

x2
60
40
40 10-5
20
20 10-6
0
0 20000 40000 60000 80000 100000120000140000160000180000 0
5 0 5 10
Nsample/100 x1

Figure 1.5: The result of sampling the Rosenbrock “banana” pdf (eq. 1.11) using the GW10 affine invariant
algorithm run until the Gelman-Rubin convergence indicator for both x1 and x2 was < 0.05. The chain was
sampled with 100 walkers that were initialized around (x1 , x2 ) = (0.0, 0.0) (note that this is not the peak of
the pdf, which is at (1, 1)) with a Gaussian distribution of the rms dispersion of 0.1. Left panel: the traces for
x1 and x2 (only every 100th sample is shown). Right panel: the distribution of the resulting chain samples
in the x1 − x2 plane. Comparison to the corresponding distribution obtained by the Metropolis algorithm
after 105 samples in Figure 1.4 shows that the tails of the Rosenbrock pdf were severely undersampled in
the latter case, as was also indicated by the non-convered trace. The trace in the left panel of this figure
appears much more relaxed, although fluctuations are still quite large, which indicates that more stringent
convergence may be required for statistics sensitive to the tails of the distribution. This is also manifested
in the fact that the Rubin-Gelman convergence indicator R is still slowly converging to unity (Figure 1.6)
when this chain was stopped.

above).
Figure 1.5 shows the result of sampling the Rosenbrock “banana” pdf with the GW10 sampler, which was
run until the maximum Gelman-Rubin convergence indicator (see next section) among the two parameters
x1 and x2 have become smaller than 1.05 (ideal convergence would correspond to unity). This required
N ≈ 1.7 × 107 samples split among 100 individual chains (“walkers”) advanced in parallel during each step
of the algorithm. The chain was sampled with 100 walkers that were initialized around (x1 , x2 ) = (0.0, 0.0)
(note that this is not the peak of the pdf, which is at (1, 1)) with a Gaussian distribution of the rms dispersion
of 0.1. One could see a much better converged trace and much better sampled tails of the pdf compared to
the right panel of Figure 1.4.
Attempt to use Metropolis algorithm with 100 walkers with the same initial distribution of walkers and
advanced with the proposal step distribution uniform in [−1, 1] around the current location results in much
slower convergence, as will be discussed below (see Figure 1.6).

1.6 The chain convergence criteria

A number of convergence criteria are considered in the literature. Although specifics vary substantially, the
general idea behind these criteria is testing stationarity of the chain parameter distribution (e.g., Gelman &
Rubin, 1992) or measuring the auto-correlation length of the chain samples (Dunkley et al., 2005; Goodman
& Weare, 2010; Foreman-Mackey et al., 2013) compared to the total length of the chain.
The most commonly used indicator is that by Gelman & Rubin (1992, , see also Brooks & Gelman 1998)
and we will focus on it here. The idea behind this indicator is that if we have a number of individually
advanced chains (see Giakoumatos et al., 1999, for generalization of the indicator for a single chain), we
8 1.6. The chain convergence criteria

100 100

10-1 10-1
RGR

RGR
10-2 0 20 40 60 80 100 120 140 160 180 10-2 0 20 40 60 80 100 120 140 160 180
iteration/1000 iteration/1000

Figure 1.6: The Rubin-Gelman convergence indicator R (actually, whats plotted is R − 1) as a function
of step in the chain (every 1000th is used). Note that 100 walkers have been advanced during one step
so effective number of sample at the end of the shown chain is 1.8 × 107 . The left panel shows the run
with the Metropolis algorithm, while the right panel shows the run using the GW10 algorithm. Both runs
where initialized with the same distribution of walkers (Gaussian of the rms dispersion 0.1, centered on
(0, 0)). Clearly, the convergence indicated by RGR is much faster in the GW10 sampler due to its innate
insensitivity to the strong degeneracies of the pdf. Even for the GW10 algorithm, the indicator shows that
the chain is still slowly converging and this number of samples may not be sufficient for certain statistics
sensitive to the tails of the distribution despite large number of samples.

can compare the variance we get within each chain to the variance we get among different chains. Perfect
convergence would correspond to the “within chain” and “between chain” values of the variance matching
each other. Thus, the Gelman-Rubin indicator is defined as the ratio of the two variances:
2
V Nw + 1 σ + N −1
R= = − , (1.12)
W Nw W Nw N

where Nw is the number of walkers (independent chains) and N is the total length of each of the individual
chains,
2 N −1 B
σ+ = W+ ; (1.13)
N N
if we denote the random variable vector x and denote xjt the tth of the N steps of the chain j “between-
walker” variance is
Nw
B 1 X
= (x̄j. − x̄.. )2 (1.14)
N Nw − 1 j=1

and “within chain” variance W is

Nw X N
1 X
W = (xjt − xj. )2 (1.15)
Nw (N − 1) j=1 t=1

Figure 1.6 shows the convergence indicator RGR ≡ R−1.0 calculated after every 1000th step as a function
of step in the chains run with the Metropolis and GW10 algorithms for the Rosenbrock pdf using Nw = 100
independent chains. The difficulty of this pdf is apparent in the slow convergence of RGR to zero. However,
Chapter 1. PDF sampling: Markov Chain Monte Carlo 9

103 GW10 parallel sampler timings

Amdahl's law
ideal ∝ 1/Nproc
actual

wall clock time (sec)

102

101
100 101
Nproc

Figure 1.7: The wall-clock execution time of the parallel implementation of the GW10 algorithmm sampling
of the 2D Rosenbrock pdf that was sampled for the finite number of samples N = 500 with Nw = 1600 for
different number of processors. The red points show the actual (non)-scaling. The parallel execution affords
no speed up in this case, because this simple pdf is too cheap to compute to really benefit from parallelization
(the overhead associated with organizing parallel communications swamps any speed up from parallelization),
especially for the fast vectorized version of the sampler. To simulate expensive pdf, a call to a sleep function
was added to the model pdf routine, which makes computation of the posterior more expensive. The green,
magenta and blue points show wall-clock time as a function of the number of processors, Nproc , as pdf
calculation is made more expensive. The dashed blue line shows the ideal parallel speed-up for the most
expensive pdf case. The solid lines connecting the points show predictions of the Amdahl’s law for the actual
speed up using actual timings of serial and parallel portions of the code in each case. As the total execution
time becomes dominated more and more by the parallel part of the code, the speed-up scaling with Nproc
approaches ideal scaling.

the convergence of the GW10 method is much faster than those for the Metropolis algorithm started from the
same initial distribution of walkers. Although it is possible to improve the performance of the Metropolis
algorithm for this particular problem by tuning the step proposal distribution and making it anisotropic
along the local degeneracy direction or making variable transformation so that the local posterior in the new
variables is not strongly degenerate (e.g., as implemented in CosmoMC, see Lewis, 2013), this comparison
shows that GW10 algorithm performs much better without any such tuning and is thus much more general.

1.7 Parallelization of the Goodman & Weare (2010) algorithm

In principle, when one uses multiple individual chains MCMC sampling can be trivially parallelized by
running each chain in parallel (communicating the “within chain” means and variances for the convergence
indicator estimate). However, given that effectively every chain has to converge, we are simply increasing the
overall number of MCMC samples available, not the time to convergence. Cutting the time to convergence is
a challenging task because it requires parallelization of the chain computation itself, while this computation
is sequential by the nature of the MCMC.
This issue is solved to a certain extent in the GW10 algorithm (see Goodman & Weare, 2010, for detail
description of parallelization algorithm). Suppose we split walkers in the GW10 algorithm into two equal
10 1.7. Parallelization of the Goodman & Weare (2010) algorithm

80
correlation function
a ΛCDM model
DR11 ξ(r) re-con
60

r2 ξ(r) (h−2 Mpc2 )

−20
40 60 80 100 120 140 160 180 200
r (h−1 Mpc)

Figure 1.8: Correlation function for the best fit parameters α, B 2 , a1 , a2 , a3 (magenta line) compared to
the DR11 data from Anderson et al. (2014).

size subsets of size Nw /2 (where Nw is the total number of walkers – should be divisible by 2). The walker
update will now consist of the loop over the two subsets and inner loop over walkers in each subset, during
which only walkers from the opposite (complementary) subset are used for the stretch moves of the current
subset. That inner loop can now be parallelized among processors.4 For parallelization on more processors
in the GW10 algorithm it is thus advantageous to have more independent chains (i.e., larger Nw ). Increasing
Nw does have a significant drawback: each chain will have its own burn-in period that will need to be
discarded. Thus, the number of discarded samples will increase with increasing Nw which may be a problem
when computation of the posterior π is very expensive. Nevertheless, parallelization does allow one to speed
up time to converence for the difficult to sample posteriors, such as the Rosenbrock pdf.
An efficient parallel implementation of the Goodman & Weare (2010) algorithm is implemented in the
publicly available emcee code5 by Foreman-Mackey et al. (2013). However, the algorithm is sufficiently
simple — just a few additional lines of code compared to the Metropolis-Hastings algorithm — that it is
instructive to write own’s code or examine the simple code used to produce Figure 1.5.
This figure shows the wall-clock execution time for the code in which 2D Rosenbrock pdf was sampled for
the finite number of samples N = 500 with Nw = 1600 for different number of processors as red points. The
parallel execution affords no speed up in this case, because this simple pdf is too cheap to compute to really
benefit from parallelization (the overhead associated with organizing parallel communications swamps any
speed up from parallelization), especially for the fast vectorized version of the sampler. However, for more
expensive posteriors the parallel execution would speed up wall clock. To simulate expensive pdf, a call to
a sleep function was added, which makes computation of the posterior more expensive. The green, magenta
and blue points show wall-clock time as a function of the number of processors, Nproc , as pdf calculation
is made more expensive. The dashed blue line shows the ideal parallel speed-up for the most expensive pdf
case. One can see that the actual parallelization of the algorithm is quite good and actual speed-up deviates
from the ideal only for Nproc > 10. This deviation is due to the contribution of the unavoidable serial parts
of the computations to the total time. The solid lines connecting the points show predictions of the Amdahl’s
law for the actual speed up using actual timings of serial and parallel portions of the code in each case. As
the total execution time becomes dominated more and more by the parallel part of the code, the speed-up
scaling with Nproc approaches ideal scaling.

4 For languages supporting vector operations, such as Fortran 90 and python, this inner loop can also be completely vectorized

– i.e., loop iterations can be avoided. See example MPI code for Rosenbrock banana and serial vectorized code for the BAO
analysis.
5 Available at https://github.com/dfm/emcee
Chapter 1. PDF sampling: Markov Chain Monte Carlo 11

1.5 0.002
3.0
1.0 0.000
2.5
B2

a3
0.5 −0.002
2.0

0.0 −0.004
1.5
−0.006
−0.5
0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20
α a1 a1

Figure 1.9: Distribution of sampled values of the parameters from which best fit values and their statistics
were calculated. Ellipses show the 1σ and 2σ confidence intervals. The nuisance parameters exhibit degen-
eracies, but there is little correlation between the bias factor B 2 and the dilation parameter α that carries
all of the cosmological information in the BAO analysis, as could be expected from parameters controlling
vertical normalization and horizontal shift.

1.8 A practical example: cosmological constraints from the BAO

With an implementation of an MCMC sampler it is relatively straightforward to apply it to interesting
cases. Here we will consider cosmological constraints from the BAO feature using measurements of the
reconstructed 2-point correlation function of the Luminous Red Galaxies (LRGs) from the SDSS DR11 data
by the BOSS collaboration (Anderson et al., 2014, hereafter A14).6
To get their cosmological constraints Anderson et al. (2014) do not fit the observed LRG correlation
function with a fully self-consistent model ξ(r). Instead, they use model ξ(r) computed for a fiducial
cosmology (assumed to be close to the actual cosmology) and simply use it as a template to to measure
position of the BAO peak in r. At the same time, the information about the shape of ξ(r) measured in
observations is essentially ignored, in large part because it is not measured very accurately at these large
scales and partly because the main cosmological information is in the peak and the collaboration wanted to
obtain constrain attributable solely to the BAO feature (N. Padmanabhan, priv. comm.).
Choosing the same fiducial cosmology as in A14 (see their §1): Ωm0 = 0.274, ΩΛ = 1 − Ωm0 = 0.726,
h = 0.7, Ωb h2 = 0.0224, ns = 0.95, σ8 = 0.8, I computed the model correlation function using eqs 25 and 26
in Anderson et al. with parameters they specified in the text using a function7 returning approximation of
the Eisenstein & Hu (1998) approximation for the power spectrum and integrating it to get the correlation
function. The full model correlation function is constructed using eq. 27 in A14 using free bias parameter Bξ2
and additive polynomial to model away the smooth shape of ξ(s) at the BAO scale via nuisance parameters.
These parametrize the unknown bias of the LRG galaxies and effects of the redshift-space distortions and
systematic errors in the binned measured correlation function. The correlation function is shifted left and
right in r via the “dilation” parameter α.
The vectorized GW10 MCMC sampler implemented in python was then used to sample the posterior pdf
of the model parameter α and nuisance parameters Bξ2 , a1 , a2 , a3 defined by the pdf:

1
π(p|d) ∝ exp − (m − d)T C−1 (m − d) , (1.16)
2
where d = {ξ(si )} is the measured correlation function vector, m = {ξfit (si )} is the corresponding vector of
the model corr. function that depends on parameters p = {α, Bξ2 , a1 , a2 , a3 }, C is the covariance matrix of
{ξ(s)} measurements.
The resulting correlation function for the best fit parameters is shown in Figures 1.8, while Figure
1.9 shows the distribution of sampled points in parameter space. The walkers initially were placed in a
6 The correlation function and covariance matrix are available on the DR11 website: https://www.sdss3.org/science/boss_

publications.php
7 I use cosmological functions implemented in Benedikt Diemer’s colossus python code available at: http://www.

benediktdiemer.com/code/
12 1.8. A practical example: cosmological constraints from the BAO

reasonable location (close to the best fit parameters, which could be determined by a rough “fit-by-eye” to
the correlation function). The chains converge after ≈ 500 steps for 500 walkers and give α = 1.003 ± 0.0088,
in good agreement (within 1σ) with the constraints quoted in Table 4 of Anderson et al. (2014) for the
post-recon ξ0 fit for DR11. The constraints for other parameters can be seen in Figure 1.9, but heir values
are of no interest for cosmology. The figure shows that the nuisance parameters exhibit degeneracies, which
indicates that that in principle similar results could be achieved with a different parametrization with fewer
parameters.
Bibliography

Anderson, L., Aubourg, É., Bailey, S., & et al. 2014, The clustering of galaxies in the SDSS-III Baryon
Oscillation Spectroscopic Survey: baryon acoustic oscillations in the Data Releases 10 and 11 Galaxy
samples, MNRAS, 441, 24
Brooks, S. & Gelman, A. 1998, General Methods for Monitoring Convergence of Iterative Simulations,
Journal of Computational and Graphical Statistics, 7, 434, http://www.stat.columbia.edu/~gelman/
research/published/brooksgelman2.pdf
D’Agostini, G. 2005, Fits, and especially linear fits, with errors on both axes, extra variance of the data
points and other complications, physics/0511182

Dunkley, J., Bucher, M., Ferreira, P. G., Moodley, K., & Skordis, C. 2005, Fast and reliable Markov chain
Monte Carlo technique for cosmological parameter estimation, MNRAS, 356, 925
Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, emcee: The MCMC Hammer, PASP,
125, 306
Gelman, A. & Rubin, D. 1992, Inference from iterative simulation using multiple sequences, Statistical
Science, 7, 457, http://www.stat.columbia.edu/~gelman/research/published/itsim.pdf
Giakoumatos, S., Vrontos, I., Dellaportas, P., & D.N., P. 1999, An MCMC Convergence Diagnostic using
Subsampling, J. Comput. Graph. Statistics, 8, 431
Goodman, J. & Weare, J. 2010, Ensemble samplers with affine invariance., Communications in Applied
Mathematics and Computational Science, 5, 65
Hastings, W. 1970, Monte Carlo Sampling Methods Using Markov Chains and Their Applications,
Biometrika, 57, 97
Hoekstra, H., Herbonnet, R., Muzzin, A., Babul, A., Mahdavi, A., Viola, M., & Cacciato, M. 2015, The
Canadian Cluster Comparison Project: detailed study of systematics and updated weak lensing masses,
MNRAS submitted (arXiv/1502.01883)
Lewis, A. 2013, Efficient sampling of fast and slow cosmological parameters, Phys. Rev. D, 87, 103529
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. 1953, Equation of State
Calculations by Fast Computing Machines, J. Comp. Phys., 21, 1087

Vikhlinin, A., Burenin, R. A., Ebeling, H., Forman, W. R., Hornstrup, A., Jones, C., Kravtsov, A. V.,
Murray, S. S., Nagai, D., Quintana, H., & Voevodkin, A. 2009, Chandra Cluster Cosmology Project. II.
Samples and X-Ray Data Reduction, ApJ, 692, 1033

CPSC 540: Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 540: Machine Learning: Monte Carlo Methods
32 pages
CPSC 440: Advanced Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 440: Advanced Machine Learning: Monte Carlo Methods
30 pages
CPSC 440: Advanced Machine Learning: Markov Chain Monte Carlo
No ratings yet
CPSC 440: Advanced Machine Learning: Markov Chain Monte Carlo
27 pages
MCMC Sampling - Class 2025
No ratings yet
MCMC Sampling - Class 2025
101 pages
This Content Downloaded From 47.39.198.204 On Wed, 06 Oct 2021 13:46:12 UTC
No ratings yet
This Content Downloaded From 47.39.198.204 On Wed, 06 Oct 2021 13:46:12 UTC
18 pages
MCMC Notes
No ratings yet
MCMC Notes
77 pages
Pe21 05 730
No ratings yet
Pe21 05 730
13 pages
RevRes PDF
No ratings yet
RevRes PDF
1,134 pages
Lecture 5
No ratings yet
Lecture 5
25 pages
Lecture 3
No ratings yet
Lecture 3
21 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
13 pages
NeurIPS 2019 Sample Adaptive MCMC Paper
No ratings yet
NeurIPS 2019 Sample Adaptive MCMC Paper
12 pages
MCMC
No ratings yet
MCMC
7 pages
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
No ratings yet
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
64 pages
5d MCMC
No ratings yet
5d MCMC
9 pages
ML Lesson - 5-4
No ratings yet
ML Lesson - 5-4
1 page
UNIT-5 Markov Chain Monte Carlo Methods
No ratings yet
UNIT-5 Markov Chain Monte Carlo Methods
17 pages
Lecture 19
No ratings yet
Lecture 19
12 pages
Metropolis-Hastings Algorithm - Wikipedia
No ratings yet
Metropolis-Hastings Algorithm - Wikipedia
10 pages
MCMC
No ratings yet
MCMC
70 pages
Bayesian - Lec - 4
No ratings yet
Bayesian - Lec - 4
25 pages
18.747-Victor G. Kac, A. K. Raina Bombay Lectures On Highest Weight Representations of Infinite Dimensional Lie Algebras 1988 PDF
No ratings yet
18.747-Victor G. Kac, A. K. Raina Bombay Lectures On Highest Weight Representations of Infinite Dimensional Lie Algebras 1988 PDF
156 pages
p403 17 MCMC
No ratings yet
p403 17 MCMC
33 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
51 pages
Complete Guide To Service Learning 2
No ratings yet
Complete Guide To Service Learning 2
110 pages
G H M C N N: Eneralizing Amiltonian Onte Arlo With Eural Etworks
No ratings yet
G H M C N N: Eneralizing Amiltonian Onte Arlo With Eural Etworks
15 pages
Schema de Principe Electrical Schematic
No ratings yet
Schema de Principe Electrical Schematic
78 pages
Computation
No ratings yet
Computation
11 pages
Hogg 2018 ApJS 236 11
No ratings yet
Hogg 2018 ApJS 236 11
18 pages
03 Markov Chain Monte Carlo
No ratings yet
03 Markov Chain Monte Carlo
4 pages
Manual HON 370 20 GB
No ratings yet
Manual HON 370 20 GB
51 pages
Siggraph03
No ratings yet
Siggraph03
24 pages
Tenses - Ready Reckoner: Tense Affirmative/Negative/Question Use Signal Words
100% (2)
Tenses - Ready Reckoner: Tense Affirmative/Negative/Question Use Signal Words
7 pages
Lectures 6
No ratings yet
Lectures 6
17 pages
Monte Carlo Simulation Technique
No ratings yet
Monte Carlo Simulation Technique
48 pages
Topic4-Monte Carlo Simulation
No ratings yet
Topic4-Monte Carlo Simulation
45 pages
EC 6310: Advanced Econometric Theory: Bayesian Computation in The Nonlinear Regression Model
No ratings yet
EC 6310: Advanced Econometric Theory: Bayesian Computation in The Nonlinear Regression Model
33 pages
Questions For Unit 5 RM
No ratings yet
Questions For Unit 5 RM
4 pages
3.1 Tuple Relational Calculus
No ratings yet
3.1 Tuple Relational Calculus
11 pages
Web of Science Core Collection:: Journal Evaluation Process and Selection Criteria
No ratings yet
Web of Science Core Collection:: Journal Evaluation Process and Selection Criteria
35 pages
Blaszczyk DAOsandRegulatoryCompetition Final
No ratings yet
Blaszczyk DAOsandRegulatoryCompetition Final
17 pages
Markov Chain Monte Carlo
No ratings yet
Markov Chain Monte Carlo
2 pages
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
No ratings yet
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
35 pages
MCMC Brief
100% (1)
MCMC Brief
69 pages
Markov Chain Monte Carlo Sampling Using A Reservoir Method
No ratings yet
Markov Chain Monte Carlo Sampling Using A Reservoir Method
11 pages
BCOC Outstanding 24 Oktober 2023
No ratings yet
BCOC Outstanding 24 Oktober 2023
12 pages
Monte Carlo Sampling Methods
No ratings yet
Monte Carlo Sampling Methods
25 pages
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
No ratings yet
CSE291D Lecture 6: Monte Carlo Methods 2: Markov Chain Monte Carlo
66 pages
Bayesian Modelling Tuts-12-15
No ratings yet
Bayesian Modelling Tuts-12-15
4 pages
18.725-David Mumford, E. Arbarello The Red Book of Varieties and Schemes 1358 1999 PDF
100% (1)
18.725-David Mumford, E. Arbarello The Red Book of Varieties and Schemes 1358 1999 PDF
311 pages
Bayesian Inference
No ratings yet
Bayesian Inference
28 pages
Cra I U Rosenthal Ann Rev
No ratings yet
Cra I U Rosenthal Ann Rev
40 pages
Taylor
No ratings yet
Taylor
63 pages
18.395-Pierre Ramond Group Theory A Physicists Survey 2010
100% (1)
18.395-Pierre Ramond Group Theory A Physicists Survey 2010
322 pages
ABC Telecom
No ratings yet
ABC Telecom
8 pages
77 4001 StaSaf
No ratings yet
77 4001 StaSaf
20 pages
Saon Bhakta@
No ratings yet
Saon Bhakta@
5 pages
Metropolis Hastings
No ratings yet
Metropolis Hastings
9 pages
Hirata Figureofmerit CFP
No ratings yet
Hirata Figureofmerit CFP
8 pages
Mcmc-A Comparative Study
No ratings yet
Mcmc-A Comparative Study
29 pages
Acebrofilina+budesonida
No ratings yet
Acebrofilina+budesonida
3 pages
An Introduction To MCMC For Machine Learning
No ratings yet
An Introduction To MCMC For Machine Learning
39 pages
Assignment 1 ECN3112
No ratings yet
Assignment 1 ECN3112
4 pages
Annurev Statistics 022513 115540
No ratings yet
Annurev Statistics 022513 115540
26 pages
Pickle Brand Auditing and Strengthening
No ratings yet
Pickle Brand Auditing and Strengthening
34 pages
Astr 300A Hw5: Sai Krishanth PM December 20, 2020
No ratings yet
Astr 300A Hw5: Sai Krishanth PM December 20, 2020
3 pages
Astr 300A Hw2: Sai Krishanth PM December 20, 2020
No ratings yet
Astr 300A Hw2: Sai Krishanth PM December 20, 2020
3 pages
Colonial Houses and The Stephen Moylan Press
No ratings yet
Colonial Houses and The Stephen Moylan Press
7 pages
Technical Data Sheet & Processing Guide: ENMAT™ Thermoplastics Resin Y1000P
No ratings yet
Technical Data Sheet & Processing Guide: ENMAT™ Thermoplastics Resin Y1000P
6 pages
Taller de Circuitos
No ratings yet
Taller de Circuitos
9 pages
Worksheet 3 LS6 - MIANO, REYMARK
No ratings yet
Worksheet 3 LS6 - MIANO, REYMARK
1 page
Astr 300A Hw1: Sai Krishanth PM December 20, 2020
No ratings yet
Astr 300A Hw1: Sai Krishanth PM December 20, 2020
2 pages
An Introduction To MCMC For Machine Learning: Abstract
No ratings yet
An Introduction To MCMC For Machine Learning: Abstract
39 pages
Metropolis Hastings Explained
No ratings yet
Metropolis Hastings Explained
2 pages
Sampling Methods: Søren Højsgaard
No ratings yet
Sampling Methods: Søren Højsgaard
22 pages
TTPL Supplier Evaluation Form Doc No:Ttpl/F/Pur/05 DOC REV NO/DATE:00/03.04.17 Page 1 of 3
No ratings yet
TTPL Supplier Evaluation Form Doc No:Ttpl/F/Pur/05 DOC REV NO/DATE:00/03.04.17 Page 1 of 3
3 pages
Das PDF
No ratings yet
Das PDF
3 pages
214.7 (MH)
No ratings yet
214.7 (MH)
6 pages
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
No ratings yet
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
4 pages
Sample ISTBS MATH-INTENSIVE FOUR-YEAR PLAN
No ratings yet
Sample ISTBS MATH-INTENSIVE FOUR-YEAR PLAN
2 pages
CS Nipple 21K-62-71310
No ratings yet
CS Nipple 21K-62-71310
1 page
Monte Carlo
No ratings yet
Monte Carlo
59 pages
MCMC Final Edition
No ratings yet
MCMC Final Edition
17 pages
Introduction To Chaos: The Damped, Driven, Nonlinear Pendulum
No ratings yet
Introduction To Chaos: The Damped, Driven, Nonlinear Pendulum
14 pages
Bayesian Analysis
No ratings yet
Bayesian Analysis
20 pages
Saqs Methods Cog T and D
No ratings yet
Saqs Methods Cog T and D
2 pages
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
No ratings yet
On The Markov Chain Monte Carlo (MCMC) Method: Rajeeva L Karandikar
24 pages
Community Engagement Solidarity and Citizenship
No ratings yet
Community Engagement Solidarity and Citizenship
24 pages
Fa22 Rba 003
No ratings yet
Fa22 Rba 003
7 pages
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
100% (1)
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
31 pages
Sundyne Compressor Brochure - US
No ratings yet
Sundyne Compressor Brochure - US
16 pages
IELTS Writing
0% (1)
IELTS Writing
8 pages
ASTR300AHW3
No ratings yet
ASTR300AHW3
2 pages
1.1 Survey of The History, Growth and Role of Translation in India
No ratings yet
1.1 Survey of The History, Growth and Role of Translation in India
50 pages
Algorithms Probability Distribution Markov Chain Limiting Distribution
No ratings yet
Algorithms Probability Distribution Markov Chain Limiting Distribution
1 page
Markov Chain Monte Carlo and Gibbs Sampling
No ratings yet
Markov Chain Monte Carlo and Gibbs Sampling
24 pages
Hufnagel Transcript
No ratings yet
Hufnagel Transcript
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

PDF Sampling: Markov Chain Monte Carlo: X N I I

Uploaded by

PDF Sampling: Markov Chain Monte Carlo: X N I I

Uploaded by

PDF sampling: Markov Chain Monte Carlo

π(x|d, I) ∝ L(d|M (x)) p(x|I). (1.3)

T (xi+1 |xi )π(xi ) = T (xi |xi+1 )π(xi+1 ) (1.4)

1.1 The Metropolis-Hastings algorithm

1.2 The burn-in and thinning

1.4 Metropolis algorithm and parameter degeneracies

normalization at pivot mass

1015 0.5 0.0 0.5 1.0 1.5

1.5 An affine-invariant MCMC sampling algorithm

well converged chain.

4 MCMC samples vs target distribution MCMC samples vs target distribution

1.6 The chain convergence criteria

and “within chain” variance W is

103 GW10 parallel sampler timings

wall clock time (sec)

1.7 Parallelization of the Goodman & Weare (2010) algorithm

r2 ξ(r) (h−2 Mpc2 )

1.8 A practical example: cosmological constraints from the BAO

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.