PDF Sampling: Markov Chain Monte Carlo: X N I I
PDF Sampling: Markov Chain Monte Carlo: X N I I
The need to sample a given distribution is one of the most common tasks arising in research. For example, in
the generic Monte Carlo (i.e., method using random samples) numerical integration, integral is approximated
as as
Z N
Vx X
I(x) = g(x)π(x)dx ≈ g(xi )π(xi ), (1.1)
Vx N i=1
where x is a vector with the number of components equal to the number of dimensions, Vx is integration
volume, and {xi } are random samples in Vx . We could distribute the samples uniformly over the integration
volume, but the integral converges very slowly in this case. Plus, most of the samples may fall in the regions
where g(xi )π(xi ) is very small and does not contribute significantly to the sum. The number of points
required for the integration with a certain accuracy increases exponentially with the number of dimensions
and the problem thus suffers from the curse of dimensionality.
As in the Gaussian quadrature numerical integration methods, integration would be much more efficient
if the sample points are distributed not uniformly, but sample distribution π(x). In this case the integral
can be approximated as
N
1 X
I≈ g(xi ), (1.2)
N i=1
and the number of points to reach a given integration accuracy is much smaller and does not grow exponen-
tially with the number of dimensions.
Another application is sampling of the posterior pdf, the problem often encountered in statistical analyses
of data. Given the likelihood of observational data d given a model M (x) that depends on a vector of
parameters x, L(d|M (x)) and pdf for the parameter values given some prior information (the prior pdf),
p(x|I), the posterior distribution according to the Bayes theorem is
Thus, we can reconstruct the posterior distribution of parameter values by randomly sampling the pdf
∝ L(d|M (x)) p(x|I).
Simple and efficient pdf sampling methods, such as rejection sampling or inverse transform sampling, exist
but require detailed knowledge of the probability distribution function (pdf) or its integral. If probability
function shape is not well known, as is often the case when the pdf depends on many parameters, these
methods are often impractical. In this case, the method of choice is the Markov Chain Monte Carlo (MCMC)
sampling method. Monte Carlo is because it involves random samples, while Markov Chain is because
sampling algorithm depends only on the previous sample (the so-called Markov process): the probability of
N
step from xi to xi+1 is p(xi+1 | {xj }j=1 ) = p(xi+1 |xi ) — i.e., depends only on xi and xi+1 .
The key concept of the MCMC method is statistical equilibrium. The method was first developed by
physicists to model thermodynamic properties of particle systems, in which approach to equlibrium depends
on interaction between particles. Likewise, in the MCMC method equilibrium distribution of points {xj }
that sample the target distribution is achieved by appropriately chosen transition probabilities. To reach
equilibrium the transition probability must be symmetric: p(xi+1 |xi ) = p(xi |xi+1 ). This condition is also
called the detailed balance condition.
We can write the transition probability as a product of the transition probability kernel, T (xi+1 |xi ),
properly normalized so that it integrates to unity and target pdf, π(x): p(xi+1 |xi ) = T (xi+1 |xi )π(xi ).
Then the detailed balance condition reads:
which physically means that the flux of samples xi → xi+1 is statistically balanced by the reverse flux
xi+1 → xi . Indeed, integrating over xi gives
Z Z
π(xi+1 ) T (xi |xi+1 )dxi = π(xi+1 ) = T (xi+1 |xi )π(xi )dxi , (1.5)
i.e., if x is drawn from π, then the next sample drawn with probability satisfying the detailed balance will
also be drawn from π.
Different MCMC methods use different choices for the stepping rules and transition kernel probability
T (xi+1 |xi ), but they all must satisfy the detailed balance condition in order to sample the target pdf
faithfully.
1. for i from 1 to N
draw xi+1 using P (xi+1 |xi )
if π(xi+1 ) > π(xi ): accept xi+1 as the next sample in the chain
else:
draw a random number r from a uniform distribution U [0, 1)
if r < π(xi+1 )/π(xi ): accept xi+1 as the next sample in the chain
else: take xi as the next sample in the chain
Figure 1.1 shows a simple 1d Gaussian with zero mean and unit variance sampled with the Metropolis-
Hastings algorithm starting at x0 = 10. The left panel shows the correct algorithm, while the right panel
shows what happens if one fails to include duplicate values of xi in the chain when proposed step fails, but
only takes new positions to include in the chain. In this case, the detailed balance condition is violated and
the target pdf is not sampled correctly.
Chapter 1. PDF sampling: Markov Chain Monte Carlo 3
0.45 0.45
0.40 0.40
0.35 0.35
0.30 0.30
0.25 0.25
frequency
frequency
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 4 3 2 1 0 1 2 3 4 0.00 4 3 2 1 0 1 2 3 4
x x
Figure 1.1: Left panel: the histogram of the MCMC chain {xi } of 106 samples produced using the Metropolis
algorithm (blue bins) compared to the target pdf (red line). Right panel: the distribution of samples in an
incorrect implementation of the Metropolis algorithm in which the sample positions were not duplicated
when proposed step failed. In this case detailed balance condition was violated and the target Gaussian pdf
is not sampled correctly.
It is always good to start the chain near the peak of the posterior. However, often information about the
posterior is limited, at least for some of the parameters of the problem. The initial guess can thus be quite
a bit off in the low-probability region. If steps are chosen reasonably, the chain will recover and, in fact, the
initial samples in the low probability region are formally correct samples of the target pdf. Nevertheless,
these low probability values are often extremely improbable for the finite length of the sample chain that one
generates in practice. For example, the Gaussian pdf shown in Figure 1.1 was sampled with the chain that
was started at x0 = 10 – i.e., 10σ away from the peak. The probability of such sample is ≈ 1.5×10−23 and so
we would need to have a chain of length N ∼ 1023 to make such sample “normal.” For samples of smaller N
this starting value can bias estimates of the mean, rms dispersion, etc. Thus, in practice a certain number of
initial chain samples is discarded to avoid such biases. The initial range of the improbable samples is called
“the burn in” period of the chain. Determining this period is to a large extent a black art, and is handled
in conjunction on deciding on chain convergence (see below). Simple checks by how much one’s estimates of
the statistics of interest change after discarding a certain number of the initial samples will do the trick.
Left plot in Figure 1.2 shows the initial 1000 steps of the chain sampling the Gaussian started from
x0 = 10. Clearly the chain moves to probable values of the pdf after only ≈ 50 − 100 samples, but the initial
≈ 50 samples are clearly highly improbable for the chain of length 1000 or even 106 and thus need to be
discarded to avoid biases.
Additional unavoidable feature of the chains is short-range correlations. Although probability of the next
step depends only on the current location, the current location depended on the previous one, and so on.
This can be clearly seen in the left plot of Figure 1.2, as the initial location choice of x0 = 10 predetermined
high values of x until Nsample ∼ 50. These short-range correlation mean that samples in the chain are not
truly independent. To deal with this chains are often “thinned” by selecting only every N th sample, where
N is determined by the correlation length of the chain measured by the autocorrelation function.
4 1.3. A practical example: line fit with errors in both variables and intrinsic scatter
10 10
8 8
6 6
4 4
2 2
x
x
0 0
2 2
4 4
60 200 400 600 800 1000 60 20000 40000 60000 80000 100000
Nsample Nsample
Figure 1.2: Left panel: the initial samples of the chain sampling Gaussian pdf that was started 10σ away
from the mean. The chain recovers to probable region after ≈ 100 steps, but the initial ≈ 50 samples are
clearly highly improbable for the chain of length 1000 or even 106 and thus need to be discarded to avoid
biases. Right panel: the samples in the chain sampling the Gaussian for 105 samples. The chain convergence
is indicated by the stable distribution of samples around the mean x = 0.
1.3 A practical example: line fit with errors in both variables and
intrinsic scatter
As an illustration for how the simple Metropolis algorithm can be used in practice, consider the problem of
the Bayesian fit of linear relation to a set of measurements, {xi , yi }, in which both x and y has significant
(Gaussian) errors and which may exhibit intrinsic scatter. The posterior distribution derived using the
Bayesian approach for this problem is (see D’Agostini, 2005):
(yi − mxi − c)2
Y 1
π(m, c, s|x, y, I) = k exp − 2 p(m, c, s|I), (1.9)
2(s + σy2i + m2 σx2i )
q
i s2 + σy2i + m2 σx2i
where m, c, s are the slope, normalization, and intrinsic scatter of the relation; k is normalization constant
(unknown, but this is irrelevant for the Metropolis algorithm), σxi and σyi are the errors of xi and yi , and
p(m, c, s|I) is the prior probability distribution for the values of the slope, normalization, and scatter.
The posterior can be sampled using a simple Metropolis algorithm and resulting chain can be used to
calculate the best fit values of m, c, s and their confidence limits. The results of such fitting for the specific
case of comparing masses measured using X-ray mass indicator by Vikhlinin et al. (2009) and a recent
measurement by (Hoekstra et al., 2015) is shown in Figure 1.3. In this case, the chain was run with 106
samples, in which the first 1000 were discarded as burn-in and the chain was thinned with only every 100th
sample selected.
1.4
1015
1.0 100
0.8
0.6
Figure 1.3: Left panel: weak lensing masses measured by Hoekstra et al. (2015) for the 10 clusters overlap-
ping with the sample used for cosmological analysis by Vikhlinin et al. (2009) vs the masses measured from
X-ray indicator YX . The green-dashed line shows one-to-one relation between the masses, while the solid
blue line shows the best fit linear relation in the Bayesian fit using posterior given by equation 1.9 with flat
prior on m, c, and s which accounts for errors in both directions and intrinisic scatter in the y-direction.
This fit gives the best fit slope value of m = 0.57 ± 0.25 (although the slope is consistent with unity at
the 95% conf. level) and relative normalization between masses of 0.87 ± 0.10, which is also consistent with
unity. Right panel: the distribution of the MCMC samples in the plane of slope and normalization.
parameters, strong and complicated degeneracies often exist among some of them and the chain convergence
may be very slow so that the number of samples required can be very large.
Figure 1.4 illustrates this by comparing convergence of the MCMC chain generated using the Metropolis
algorithm to sample a 2D Gaussian posterior with a significant correlation
( " #)
2 2
1 x1 x2 rx1 x2
π(x1 , x2 ) = √ exp −0.5 + −2 (1.10)
2πσ1 σ2 1 − r2 σ1 σ2 σ1 σ2 (1 − r2 )
and result for the same chain length (Nsample = 105 ) sampling the Rosenbrock “banana” pdf:
π(x1 , x2 ) = exp −0.05 [100(x2 − x21 )2 + (1 − x1 )2 ] .
(1.11)
Note that this pdf has peak at (x1 , x2 ) = (1.0, 1.0). We can see that in the case of the Gaussian the traces
of x1 and x2 are stable and fluctuate around the region of high posterior. In the case of the Rosenbrock
pdf, the traces show that the chain has not converged as values of x1 and x2 fluctuate wildly indicating that
the chain is still exploring the remote regions of this highly degenerate pdf.1 This indicates that the chain
length must be N 105 to sample this pdf.
This highlights two important issues: 1) we need better algorithm than the Metropolis with isotropic
proposed step distribution to handle sampling of the highly degenerate posterior distributions and 2) we
need to have an objective criterion for chain convergence. We will consider these issues in the next two
sections.
Metropolis 70 Metropolis
4 60
50
2
40
0 30
20
2
10
4 0
0 20000 40000 60000 80000 100000 100 20000 40000 60000 80000 100000
y
1 10-2
20
2 68.27% 10-2
3 90% 10
99%
44 0
3 2 1 0 1 2 3 4 2 0 2 4 6
x x
Figure 1.4: In each panel the upper plot shows traces of 2 parameters of the sampled pdf (green and blue
curves), while the bottom plot shows the distribution of the chain samples in the 2D parameter space along
with the 68.27%, 90%, and 99% confidence contours. 105 samples were generated using Metropolis algorithm
with uniform isotropic step proposal distribution; the plotted chain was obtained by thinning the original
chain by taking every 10th sample. Left panel: sampling of a 2D Gaussian pdf (eq. 1.10) with the correlation
coefficient r = 0.9. Right panel: the result of sampling of the Rosenbrock “banana” pdf (eq. 1.11). The
trace in the left plot indicates that the chain has converged for Nsample = 105 , while the trace in the right
plot shows that chain sampling the Rosenbrock pdf is far from convergence for the same number of samples
and step proposal distribution.
density pdf discussed above, or pdfs with multiple peaks. In this method, one initializes a number of
walkers distributed with a multi-variate Gaussian probability around a starting point.3 Then MCMC chain
is constructed in a way similar to the Metropolis algorithm: proposing a step, estimating probability at
the proposed location, and then accepting the step with an acceptance probability. The difference from
Metropolis is in the way the step is proposed and how acceptance probability is calculated (see §2 of GW10
for more details and detailed pseudo-code of the algorithm, but all details needed to code up the algorithm
are below):
• In this method a step from location xi to xi+1 is proposed as a stretch move: x0i = xj + zr (xi − xj ),
where xj is the current location of another randomly chosen walker (but
√ not the current one we are
updating) and zr is a random number drawn from the pdf g(z) = 1/ z for z ∈ [1/a, a] interval and
g(z) = 0 outside this interval, where GW10 suggest a = 2.
• The proposed stretch move is then accepted with probability pacc = min[1.0, zrD−1 π(x0i )/π(x)], where
π is the target pdf the chain is supposed to sample and D is the number of components of x (i.e., the
number of dimensions of the pdf we are sampling; e.g., for the D = 2 for the Rosenbrock banana pdf
3 This by itself is not a new or distinct feature, as multiple chains (aka “walkers”) can be used in the Metropolis-Hastings
algorithm too.
Chapter 1. PDF sampling: Markov Chain Monte Carlo 7
140
140
Goodman & Weare 2010 sampler
120 68.27%
120 99%
10-2
100
100
80 10-3
80
x1 , x2
60
10-4
x2
60
40
40 10-5
20
20 10-6
0
0 20000 40000 60000 80000 100000120000140000160000180000 0
5 0 5 10
Nsample/100 x1
Figure 1.5: The result of sampling the Rosenbrock “banana” pdf (eq. 1.11) using the GW10 affine invariant
algorithm run until the Gelman-Rubin convergence indicator for both x1 and x2 was < 0.05. The chain was
sampled with 100 walkers that were initialized around (x1 , x2 ) = (0.0, 0.0) (note that this is not the peak of
the pdf, which is at (1, 1)) with a Gaussian distribution of the rms dispersion of 0.1. Left panel: the traces for
x1 and x2 (only every 100th sample is shown). Right panel: the distribution of the resulting chain samples
in the x1 − x2 plane. Comparison to the corresponding distribution obtained by the Metropolis algorithm
after 105 samples in Figure 1.4 shows that the tails of the Rosenbrock pdf were severely undersampled in
the latter case, as was also indicated by the non-convered trace. The trace in the left panel of this figure
appears much more relaxed, although fluctuations are still quite large, which indicates that more stringent
convergence may be required for statistics sensitive to the tails of the distribution. This is also manifested
in the fact that the Rubin-Gelman convergence indicator R is still slowly converging to unity (Figure 1.6)
when this chain was stopped.
above).
Figure 1.5 shows the result of sampling the Rosenbrock “banana” pdf with the GW10 sampler, which was
run until the maximum Gelman-Rubin convergence indicator (see next section) among the two parameters
x1 and x2 have become smaller than 1.05 (ideal convergence would correspond to unity). This required
N ≈ 1.7 × 107 samples split among 100 individual chains (“walkers”) advanced in parallel during each step
of the algorithm. The chain was sampled with 100 walkers that were initialized around (x1 , x2 ) = (0.0, 0.0)
(note that this is not the peak of the pdf, which is at (1, 1)) with a Gaussian distribution of the rms dispersion
of 0.1. One could see a much better converged trace and much better sampled tails of the pdf compared to
the right panel of Figure 1.4.
Attempt to use Metropolis algorithm with 100 walkers with the same initial distribution of walkers and
advanced with the proposal step distribution uniform in [−1, 1] around the current location results in much
slower convergence, as will be discussed below (see Figure 1.6).
100 100
10-1 10-1
RGR
RGR
10-2 0 20 40 60 80 100 120 140 160 180 10-2 0 20 40 60 80 100 120 140 160 180
iteration/1000 iteration/1000
Figure 1.6: The Rubin-Gelman convergence indicator R (actually, whats plotted is R − 1) as a function
of step in the chain (every 1000th is used). Note that 100 walkers have been advanced during one step
so effective number of sample at the end of the shown chain is 1.8 × 107 . The left panel shows the run
with the Metropolis algorithm, while the right panel shows the run using the GW10 algorithm. Both runs
where initialized with the same distribution of walkers (Gaussian of the rms dispersion 0.1, centered on
(0, 0)). Clearly, the convergence indicated by RGR is much faster in the GW10 sampler due to its innate
insensitivity to the strong degeneracies of the pdf. Even for the GW10 algorithm, the indicator shows that
the chain is still slowly converging and this number of samples may not be sufficient for certain statistics
sensitive to the tails of the distribution despite large number of samples.
can compare the variance we get within each chain to the variance we get among different chains. Perfect
convergence would correspond to the “within chain” and “between chain” values of the variance matching
each other. Thus, the Gelman-Rubin indicator is defined as the ratio of the two variances:
2
V Nw + 1 σ + N −1
R= = − , (1.12)
W Nw W Nw N
where Nw is the number of walkers (independent chains) and N is the total length of each of the individual
chains,
2 N −1 B
σ+ = W+ ; (1.13)
N N
if we denote the random variable vector x and denote xjt the tth of the N steps of the chain j “between-
walker” variance is
Nw
B 1 X
= (x̄j. − x̄.. )2 (1.14)
N Nw − 1 j=1
Figure 1.6 shows the convergence indicator RGR ≡ R−1.0 calculated after every 1000th step as a function
of step in the chains run with the Metropolis and GW10 algorithms for the Rosenbrock pdf using Nw = 100
independent chains. The difficulty of this pdf is apparent in the slow convergence of RGR to zero. However,
Chapter 1. PDF sampling: Markov Chain Monte Carlo 9
102
101
100 101
Nproc
Figure 1.7: The wall-clock execution time of the parallel implementation of the GW10 algorithmm sampling
of the 2D Rosenbrock pdf that was sampled for the finite number of samples N = 500 with Nw = 1600 for
different number of processors. The red points show the actual (non)-scaling. The parallel execution affords
no speed up in this case, because this simple pdf is too cheap to compute to really benefit from parallelization
(the overhead associated with organizing parallel communications swamps any speed up from parallelization),
especially for the fast vectorized version of the sampler. To simulate expensive pdf, a call to a sleep function
was added to the model pdf routine, which makes computation of the posterior more expensive. The green,
magenta and blue points show wall-clock time as a function of the number of processors, Nproc , as pdf
calculation is made more expensive. The dashed blue line shows the ideal parallel speed-up for the most
expensive pdf case. The solid lines connecting the points show predictions of the Amdahl’s law for the actual
speed up using actual timings of serial and parallel portions of the code in each case. As the total execution
time becomes dominated more and more by the parallel part of the code, the speed-up scaling with Nproc
approaches ideal scaling.
the convergence of the GW10 method is much faster than those for the Metropolis algorithm started from the
same initial distribution of walkers. Although it is possible to improve the performance of the Metropolis
algorithm for this particular problem by tuning the step proposal distribution and making it anisotropic
along the local degeneracy direction or making variable transformation so that the local posterior in the new
variables is not strongly degenerate (e.g., as implemented in CosmoMC, see Lewis, 2013), this comparison
shows that GW10 algorithm performs much better without any such tuning and is thus much more general.
80
correlation function
a ΛCDM model
DR11 ξ(r) re-con
60
20
−20
40 60 80 100 120 140 160 180 200
r (h−1 Mpc)
Figure 1.8: Correlation function for the best fit parameters α, B 2 , a1 , a2 , a3 (magenta line) compared to
the DR11 data from Anderson et al. (2014).
size subsets of size Nw /2 (where Nw is the total number of walkers – should be divisible by 2). The walker
update will now consist of the loop over the two subsets and inner loop over walkers in each subset, during
which only walkers from the opposite (complementary) subset are used for the stretch moves of the current
subset. That inner loop can now be parallelized among processors.4 For parallelization on more processors
in the GW10 algorithm it is thus advantageous to have more independent chains (i.e., larger Nw ). Increasing
Nw does have a significant drawback: each chain will have its own burn-in period that will need to be
discarded. Thus, the number of discarded samples will increase with increasing Nw which may be a problem
when computation of the posterior π is very expensive. Nevertheless, parallelization does allow one to speed
up time to converence for the difficult to sample posteriors, such as the Rosenbrock pdf.
An efficient parallel implementation of the Goodman & Weare (2010) algorithm is implemented in the
publicly available emcee code5 by Foreman-Mackey et al. (2013). However, the algorithm is sufficiently
simple — just a few additional lines of code compared to the Metropolis-Hastings algorithm — that it is
instructive to write own’s code or examine the simple code used to produce Figure 1.5.
This figure shows the wall-clock execution time for the code in which 2D Rosenbrock pdf was sampled for
the finite number of samples N = 500 with Nw = 1600 for different number of processors as red points. The
parallel execution affords no speed up in this case, because this simple pdf is too cheap to compute to really
benefit from parallelization (the overhead associated with organizing parallel communications swamps any
speed up from parallelization), especially for the fast vectorized version of the sampler. However, for more
expensive posteriors the parallel execution would speed up wall clock. To simulate expensive pdf, a call to
a sleep function was added, which makes computation of the posterior more expensive. The green, magenta
and blue points show wall-clock time as a function of the number of processors, Nproc , as pdf calculation
is made more expensive. The dashed blue line shows the ideal parallel speed-up for the most expensive pdf
case. One can see that the actual parallelization of the algorithm is quite good and actual speed-up deviates
from the ideal only for Nproc > 10. This deviation is due to the contribution of the unavoidable serial parts
of the computations to the total time. The solid lines connecting the points show predictions of the Amdahl’s
law for the actual speed up using actual timings of serial and parallel portions of the code in each case. As
the total execution time becomes dominated more and more by the parallel part of the code, the speed-up
scaling with Nproc approaches ideal scaling.
4 For languages supporting vector operations, such as Fortran 90 and python, this inner loop can also be completely vectorized
– i.e., loop iterations can be avoided. See example MPI code for Rosenbrock banana and serial vectorized code for the BAO
analysis.
5 Available at https://github.com/dfm/emcee
Chapter 1. PDF sampling: Markov Chain Monte Carlo 11
1.5 0.002
3.0
1.0 0.000
2.5
B2
a2
a3
0.5 −0.002
2.0
0.0 −0.004
1.5
−0.006
−0.5
0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20
α a1 a1
Figure 1.9: Distribution of sampled values of the parameters from which best fit values and their statistics
were calculated. Ellipses show the 1σ and 2σ confidence intervals. The nuisance parameters exhibit degen-
eracies, but there is little correlation between the bias factor B 2 and the dilation parameter α that carries
all of the cosmological information in the BAO analysis, as could be expected from parameters controlling
vertical normalization and horizontal shift.
publications.php
7 I use cosmological functions implemented in Benedikt Diemer’s colossus python code available at: http://www.
benediktdiemer.com/code/
12 1.8. A practical example: cosmological constraints from the BAO
reasonable location (close to the best fit parameters, which could be determined by a rough “fit-by-eye” to
the correlation function). The chains converge after ≈ 500 steps for 500 walkers and give α = 1.003 ± 0.0088,
in good agreement (within 1σ) with the constraints quoted in Table 4 of Anderson et al. (2014) for the
post-recon ξ0 fit for DR11. The constraints for other parameters can be seen in Figure 1.9, but heir values
are of no interest for cosmology. The figure shows that the nuisance parameters exhibit degeneracies, which
indicates that that in principle similar results could be achieved with a different parametrization with fewer
parameters.
Bibliography
Anderson, L., Aubourg, É., Bailey, S., & et al. 2014, The clustering of galaxies in the SDSS-III Baryon
Oscillation Spectroscopic Survey: baryon acoustic oscillations in the Data Releases 10 and 11 Galaxy
samples, MNRAS, 441, 24
Brooks, S. & Gelman, A. 1998, General Methods for Monitoring Convergence of Iterative Simulations,
Journal of Computational and Graphical Statistics, 7, 434, http://www.stat.columbia.edu/~gelman/
research/published/brooksgelman2.pdf
D’Agostini, G. 2005, Fits, and especially linear fits, with errors on both axes, extra variance of the data
points and other complications, physics/0511182
Dunkley, J., Bucher, M., Ferreira, P. G., Moodley, K., & Skordis, C. 2005, Fast and reliable Markov chain
Monte Carlo technique for cosmological parameter estimation, MNRAS, 356, 925
Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, emcee: The MCMC Hammer, PASP,
125, 306
Gelman, A. & Rubin, D. 1992, Inference from iterative simulation using multiple sequences, Statistical
Science, 7, 457, http://www.stat.columbia.edu/~gelman/research/published/itsim.pdf
Giakoumatos, S., Vrontos, I., Dellaportas, P., & D.N., P. 1999, An MCMC Convergence Diagnostic using
Subsampling, J. Comput. Graph. Statistics, 8, 431
Goodman, J. & Weare, J. 2010, Ensemble samplers with affine invariance., Communications in Applied
Mathematics and Computational Science, 5, 65
Hastings, W. 1970, Monte Carlo Sampling Methods Using Markov Chains and Their Applications,
Biometrika, 57, 97
Hoekstra, H., Herbonnet, R., Muzzin, A., Babul, A., Mahdavi, A., Viola, M., & Cacciato, M. 2015, The
Canadian Cluster Comparison Project: detailed study of systematics and updated weak lensing masses,
MNRAS submitted (arXiv/1502.01883)
Lewis, A. 2013, Efficient sampling of fast and slow cosmological parameters, Phys. Rev. D, 87, 103529
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. 1953, Equation of State
Calculations by Fast Computing Machines, J. Comp. Phys., 21, 1087
Vikhlinin, A., Burenin, R. A., Ebeling, H., Forman, W. R., Hornstrup, A., Jones, C., Kravtsov, A. V.,
Murray, S. S., Nagai, D., Quintana, H., & Voevodkin, A. 2009, Chandra Cluster Cosmology Project. II.
Samples and X-Ray Data Reduction, ApJ, 692, 1033
13