2024-Leonardo RIPOLI Thesis
2024-Leonardo RIPOLI Thesis
Leonardo Ripoli
UNIVERSITY OF READING
I confirm that this is my own work and the use of all material from other sources has been
properly and fully acknowledged.
Leonardo Ripoli
Acknowledgments
I would like to express my deepest gratitude to my PhD supervisor, Richard Everitt, for his
patience, invaluable guidance, and support throughout the research journey. His insights and
expertise have been vital in shaping this work. I also wish to thank Prof. Patrizia Gamba
and Prof. Enkeleida Lushi, whose passion for Mathematics and inspiring teaching have had
a profound and lasting impact on my journey. Finally, I wish to thank Andrew Meade for
accepting to come onboard and for the support.
2
Abstract
The first part of my research concentrates on Sequential Monte Carlo (SMC) methods for
phylogenetics. The aim of the research was to deliver methods that can be used in the
inference of the spread of diseases, leveraging a widely used software platform, and enable
researchers to easily access, validate and compare. We show the results obtained from de-
veloping and integrating an adaptive SMC algorithm in Bayesian Evolutionary Analysis by
Sampling Trees version 2 (commonly known as BEAST2), a well-established software plat-
form among researchers. Our adaptive SMC algorithm embedded in BEAST2 has comparable
performances to the native Markov Chain Monte Carlo (MCMC) method, in terms of accu-
racy and efficiency. Our work can be seen as a first step, and future tuning is expected. It is
foreseen that an integration of the adaptive SMC package into BEAST2 will be done by the
owners of the platform, allowing researchers to use SMC instead of MCMC, following testing
and tuning by the platforms developers.
The focus of the second part is Active Subspaces (AS). With AS we try to identify a
smaller subspace informed by the data and to concentrate the algorithmic effort on this more
informative part, primarily to address the curse of dimensionality affecting many Monte Carlo
(MC) methods. Existing AS algorithms were mostly biased and targeting distributions only
close by some measure to the posterior, leaving users to do substantial post-validation. We
built on the foundations of an existing pseudo-marginal-based Active Subspace algorithm and
developed non-biased AS algorithms that in stationarity target the correct posterior, using
the structure provided by AS within a Gibbs-style MCMC, and within Particle Marginal
Metropolis Hastings (PMMH), Metropolis within Particle Gibbs (MwPG), SMC-squared
(SMC2 ). We have run experiments that show in specific settings to outperform existing
methods and provide explanations on the optimal running conditions for each algorithm.
Our work sheds light on the practical applications of Active Subspaces, expanding the range
of AS methods available to researchers and providing clearer guidance on which approaches
are most effective in specific scenarios.
Contents
Acknowledgments 2
1 Introduction 19
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Technical Background 22
2.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 The Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Estimating the normalization constant through IS . . . . . . . . . . . 26
2.3.2 Effective Sample Size (ESS) . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Markov Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Initial distribution of the chain . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Joint and conditional distributions of the Xi . . . . . . . . . . . . . . 30
2.4.4 Stationary property of the MCMC and invariant distribution . . . . . 31
2.4.5 Convergence of MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.6 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . . . . . 33
2.4.7 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.8 ESS for MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.9 MultiESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Annealed Importance Sampling (AIS) . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Annealed Importance Sampling vs Importance Sampling . . . . . . . 35
2.5.2 Annealed Importance Sampling vs MCMC . . . . . . . . . . . . . . . 37
2.5.3 The AIS algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Sequential Monte Carlo (SMC) . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.2 The SMC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.3 Conditional Effective Sample Size (CESS) . . . . . . . . . . . . . . . 44
2.7 Pseudo-marginals in MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1
2.7.1 GIMH pseudomarginal algorithm . . . . . . . . . . . . . . . . . . . . 45
2.8 Particle MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.8.1 Particle Marginal Metropolis-Hastings (PMMH) . . . . . . . . . . . . 46
2.8.2 Metropolis within Particle Gibbs (MwPG) . . . . . . . . . . . . . . . 47
2.9 SMC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2
3.16.2 Effective Population Size . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.16.3 Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.17 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 Active Subspaces 86
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2 Active Subspaces: general Mathematical formulation . . . . . . . . . . . . . 87
4.2.1 Structural matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.2 Estimating the dimension of Active Subspace through the spectral gap 88
4.2.3 Approximation of a function with the Active Subspace . . . . . . . . 88
4.2.4 Active Subspaces directions by principal components . . . . . . . . . 90
4.3 Active Subspaces using Monte Carlo approximations . . . . . . . . . . . . . 90
4.3.1 Monte Carlo approximation of the conditional expectation . . . . . . 90
4.3.2 Monte Carlo approximation of the structural matrix . . . . . . . . . . 90
4.3.3 Monte Carlo approximation of the Active Subspace . . . . . . . . . . 91
4.3.4 Validity of Monte Carlo approximations . . . . . . . . . . . . . . . . 91
4.4 Active Subspaces: literary review on existing methods for MCMC . . . . . . 91
4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.5 MCMC problems in high dimensions . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Biased Active Subspace MCMC algorithm . . . . . . . . . . . . . . . . . . . 93
4.7 Discussion on open points in current AS methods . . . . . . . . . . . . . . . 95
4.7.1 Open points mitigations . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.8 Bias in Active Subspaces MCMC . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8.2 Exact Active Subspace MCMC algorithm: AS-MH . . . . . . . . . . 98
4.8.3 Biasedness or unbiasedness of AS-MCMC algorithms: Mathematics
behind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8.4 Comparison of results: standard MCMC vs biased AS MCMC vs exact
AS MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.9 Prior AS vs Posterior AS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.9.1 Toy example to demonstrate prior vs posterior Active Subspace differ-
ences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.2 Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.3 Comparison of performances in the Gaussian model . . . . . . . . . . 123
5.3.4 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4 Comparison of performances of AS-PMMH with other algorithms in the “Ba-
nana” model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.2 Banana model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.3 Comparison of performances in the Banana model . . . . . . . . . . . 132
5.5 AS-PMMH-i algorithm: giving more relevance to the Active component . . . 136
5.5.1 Comparison of performances with other algorithms . . . . . . . . . . 138
5.5.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Active Subspaces Gibbs algorithm: AS-Gibbs . . . . . . . . . . . . . . . . . 141
5.6.1 Comparison of performances with other algorithms . . . . . . . . . . 142
5.6.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.7 Active Subspace Metropolis within particle Gibbs algorithms: AS-MwPG and
AS-MwPG-i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.7.1 Inverted Active Subspace Metropolis within particle Gibbs: AS-MwPG-
i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.7.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.8 Comparison of performances of AS-based MCMC algorithms in a multi-modal
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.8.1 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.9 Determining the dimension of Active Subspaces with ESS . . . . . . . . . . . 157
5.9.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4
6.4.2 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.4.3 Early results for AS-SMC-a . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.4 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Bibliography 215
5
List of Figures
3.1 Visually appealing example of phylogenetic tree, borrowed from [Hou et al.,
2022]. The different species are the plants pictured on the RHS of the figure,
the lines that combine the different species until reaching a common ancestor
(core Chiorophyta on the bottom LHS), represent genetic lineage relationships.
The points where the lines combine, represent the moments in past time when
different species started to diverge from the same lineage. . . . . . . . . . . . 51
3.2 The Figtree software [Rambaut, 2023], useful to visualize and get statistical
information on phylogenetic trees. We can see the tips on the right hand side
numbered t0 to t7 which represent the genetic sequences that are the starting
point of the analysis (the whole tree is inferred starting from these sequences).
The points where two branches merge into one are called coalescence times. . 52
3.3 Example of coalescent event of two genes, named in the figure t1 and t2 , out
of a population of N = 4. The coalescent event, as described in the paragraph,
happens j generations in the past. The diagram has been created using the
package graphviz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Example of coalescent tree composed by three taxa, named t1 , t2 and t3 in
the figure, corresponding to the internal tree state in the software BEAST2
((1 : 0.5665, 2 : 0.5665) : 0.3993, 3 : 0.9658) : 0.0;: 0.5665), as described in
the paragraph. We know that coalescent times are to be intended in the past
(see Section 3.3), therefore the timescale has to be intended from the bottom
(time of samples) to the top (time in the past until when the three species
had converged into a single ancestor, named MRCA, Most Common Recent
Ancestor). The diagram has been created using the package graphviz. . . . . . 68
3.5 The Tracer, a software part of the software package BEAST2. The software
is capable of displaying parameters relative to the MCMC runs, like estimated
distributions and ESS of the chains. . . . . . . . . . . . . . . . . . . . . . . . 70
3.6 The Figtree software [Rambaut, 2023], useful to visualize and get information
on the MCMC chain of trees generated by BEAST2. . . . . . . . . . . . . . . 71
6
3.7 Random coalescent tree with 10 leaves generated using the procedure outlined
in the first part of Section 3.13.1. This has been the generating tree for the
synthetic data of the test described in this section. Visualization via FigTree
[Rambaut, 2023] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.8 Annealing steps in the SMC run for the 10-taxa example studied in this section. 75
3.9 ESS versus annealing SMC step for the 10-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000
particles are used for the simulation). . . . . . . . . . . . . . . . . . . . . . . 76
3.10 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Gamma shape with 10 taxa. The values inside the 95% HPD interval
shown in blue and those outside the 95% HPD interval highlighted in gold.
Visualization with the software Tracer. . . . . . . . . . . . . . . . . . . . . . 77
3.11 Frequency distribution for the parameter Gamma shape with 5 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Vi-
sualization with python matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . 78
3.12 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Effective Population Size with 10 taxa.The values inside the 95% HPD
interval shown in blue and those outside the 95% HPD interval highlighted in
gold. Visualization with the software Tracer. . . . . . . . . . . . . . . . . . . 79
3.13 Frequency distribution for the parameter Effective Population Size with 10
taxa, using Annealed Adaptive SMC algorithm that we have embedded into
BEAST2. Visualization with python matplotlib. . . . . . . . . . . . . . . . . 80
3.14 Evolution of the estimated standard deviation of the population size parameter
vs the annealing step for the parameter Effective Population Size in the SMC
algorithm: we see how the standard deviation drops significantly through the
annealing journey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.15 Consensus tree for the MCMC run with visualization of 95% range for co-
alescent times. See comparison with the generating tree (which the MCMC
run tries to reconstruct) in Figure 3.7. Consensus tree has been generated
with TreeAnnotator and the visualization is with FigTree (both softwasre from
BEAST2 package). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7
3.16 Consensus tree for the Annealed Adaptive SMC run with visualization of 95%
range for coalescent times. See comparison with the generating tree (which
the SMC run tries to reconstruct) in Figure 3.7, and also with the tree re-
constructed using MCMC in Figure 3.15: we see that the SMC is able to
reconstruct the generating tree well and with a smaller uncertainty (the 95%
uncertainty ranges in the coalescent times are in general smaller compared to
the MCMC of Figure 3.15). Consensus tree has been generated with TreeAn-
notator and the visualization is with FigTree (both softwasre from BEAST2
package). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Example of spectral gap, i.e. a significant difference between two consecutive
eigenvalues arranged in descending order. In the figure the order of the system
is 4, and the spectral gap is found between eigenvalue 2 and 3 since there is a
difference of 5 orders of magnitude between them (note the y-axis is in log-scale). 88
4.2 Spectral gap for the model of (4.26) with = 0.01: the eigenvalues difffer by 4
orders of magnitude and the inactive eigenvalue 2 is less than 1 . . . . . . . 103
4.3 Posterior active and inactive marginals versus prior. Active marginal a = BaT θ
in LHS and inactive marginal i = BiT θ in RHS (both red-dotted) versus prior
components 1 and 2 (continuous line) of (4.26) with = 0.01: we see that
the inactive marginal RHS is almost identical to the prior second component,
indicating that we are in near-perfect Active Subspace and the likelihood is very
little informative on this component. See comparison with LHS chart where the
active marginal is very different from the first component of the prior indicating
that the first component is active and the differences are due to the effect of
the likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Spectral gap for the model of (4.26) with = 0.2: the eigenvalues difffer by 2
orders of magnitude and the smallest eigenvalue is around 200. . . . . . . . . 105
4.5 Posterior active and inactive marginals versus prior. Active marginal a = BaT θ
in LHS and inactive marginal i = BiT θ in RHS (both red-dotted) versus prior
components 1 and 2 (continuous line) of (4.26) with = 0.2: we see that the
inactive marginal RHS is very similar to the prior second component, although,
compared to Figure 4.3 RHS (case = 0.01), we can see some differences: in
the current case the fit is not perfect, indicating that the likelihood is, although
slightly, more informative in this case. . . . . . . . . . . . . . . . . . . . . . 106
4.6 Charts representing prior (bottom two charts) and likelihood (upper two) of
the posterior (4.35), with parameter values (4.36). Note: “Dimension 1” in
the chart titles is the first component θ1 and “Dimension 2” is θ2 . . . . . . . 109
4.7 Directions of the principal components of the covariance of gradients using
prior samples: we can see that the main direction is horizontal. See difference
with Figure 4.8 where the main direction is vertical . . . . . . . . . . . . . . 110
8
4.8 Directions of the principal components of the covariance of gradients using
posterior samples: we can see that the main direction is vertical. See differ-
ence with Figure 4.7 where the main direction is horizontal . . . . . . . . . . 111
9
5.10 Distribution of RMSE of the differences between the true posterior mean and
the mean estimated by each of the algorithms over 50 runs. Starting from
LHS: MCMC, AS-MH and AS-PMMH. We can see from the chart that the
distributions for AS-MH and AS-PMMH have lower mean and upper quartile
of MCMC, but longer tails. This may indicate very noisy estimates of the
likelihood in some of the runs of both algorithms which cause the distributions
to have long tails. The MCMC has comparatively smaller tails. . . . . . . . . 134
5.11 Example of ’sticky’ trace-plot in AS-PMMH, taken from one of the runs in Fig-
ure 5.10: a noisy estimate of the likelihood causes the outer MCMC to remain
stuck for a long time, and this causes the very long tails in the distribution of
RMSE seen in Figure 5.10 for the AS-PMMH. . . . . . . . . . . . . . . . . . 135
5.12 Distribution of RMSE of the differences between the true posterior mean and
the mean estimated by each of the algorithms over 50 runs. We see the tails
of the AS-PMMH-i distribution are smaller than the AS-PMMH, probably be-
cause, keeping constant the number of tempering of the SMC sampler between
the two (6), the size of the space the SMC has to act upon is much smaller:
4D in case of AS-PMMH-i vs 21D in the case of AS-PMMH. By contrast,
the MCMC part has to act on a much bigger space: 21D vs 4D, this probably
explains why the AS-PMMH-i has a bigger average error. . . . . . . . . . . . 140
5.13 Distribution of RMSE of the differences between the true posterior mean and
the mean estimated by each of the algorithms over 50 runs. We can see from
the chart that AS-Gibbs has lower mean RMSE. . . . . . . . . . . . . . . . . 143
5.14 Level surface of the system of equation (5.35), the combinations θ1 + θ2 = −5
and θ3 + θ4 = 5 or θ1 + θ2 = 5 and θ3 + θ4 = −5 are the ones that leave the
likelihood (5.25) invariant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.15 Level surface of the system of equation (5.35): the combination of parameters
θ1 + θ2 = ±5 are the ones that leave the likelihood (5.25) invariant . . . . . . 152
5.16 AS-MwPG-i: reconstruction of the posterior for the system having likelihood
(5.35). We see that AS-MwPG-i correctly reconstructs the bimodal posterior,
both modes (θ1 + θ2 = −5 and θ3 + θ4 = 5) and (θ1 + θ2 = 5 and θ3 + θ4 = −5)
are found (Note: in the figure “Sum components first mean” on the x-axis is
the sum µ1 = θ1 + θ2 , whereas “Sum components second mean” on the y-axis
is the sum µ2 = θ3 + θ4 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.17 AS-MwPG-i: reconstruction of the posterior for the system having likelihood
(5.35). We see in the figure θ1 (named Component 1 in the figure) vs θ2
(named Component 2): we can appreciate that MwPG correctly reconstructs
the bimodal posterior, in fact both combinations θ1 + θ2 = −5 and θ1 + θ2 = 5
are found (Note: in the figure “Component 1” on the x-axis is θ1 , whereas
“Component 2” on the y-axis is θ2 ). . . . . . . . . . . . . . . . . . . . . . . . 154
10
5.18 Standard MCMC: incorrect reconstruction of the posterior for the system
having likelihood (5.35). We see that MCMC incorrectly reconstructs the bi-
modal posterior: only the mode (θ1 + θ2 = −5 and θ3 + θ4 = 5) is found,
whereas the mode (θ1 + θ2 = 5 and θ3 + θ4 = −5) is missing, see comparison
with Figure 5.16. (Note: in the figure “Sum components first mean” on the
x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum components second mean” on
the y-axis is the sum µ2 = θ3 + θ4 ). . . . . . . . . . . . . . . . . . . . . . . . 154
5.19 Standard MCMC: incorrect reconstruction of the posterior for the system
having likelihood (5.35). We see in the figure θ1 (named Component 1 in the
figure) vs θ2 (named Component 2): we can appreciate that MCMC gets stuck
in one mode and only the combination θ1 + θ2 = −5 is found, while the mode
θ1 + θ2 = 5 is missing, see comparison with Figure 5.17 (Note: in the figure
“Component 1” on the x-axis is θ1 , whereas “Component 2” on the y-axis is θ2 ).155
5.20 AS-MH algorithm 8 of Section 4.8.2: incorrect reconstruction of the posterior
for the system having likelihood (5.35). We see that AS-MH algorithm incor-
rectly reconstructs the bimodal posterior: only the mode (θ1 + θ2 = −5 and
θ3 + θ4 = 5) is found, whereas the mode (θ1 + θ2 = 5 and θ3 + θ4 = −5) is miss-
ing, see comparison with Figure 5.16. (Note: in the figure “Sum components
first mean” on the x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum components
second mean” on the y-axis is the sum µ2 = θ3 + θ4 ). . . . . . . . . . . . . . 155
5.21 AS-MH algorithm 8 of Section 4.8.2: incorrect reconstruction of the poste-
rior for the system having likelihood (5.35). We see in the figure θ1 (named
Component 1 in the figure) vs θ2 (named Component 2): we can appreciate
that AS-Gibbs gets stuck in one mode and only the combination θ1 +θ2 = −5 is
found, while the mode θ1 + θ2 = 5 is missing, see comparison with Figure 5.17
(Note: in the figure “Component 1” on the x-axis is θ1 , whereas “Component
2” on the y-axis is θ2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.22 AS-Gibbs algorithm 13 of Section 5.6: incorrect reconstruction of the poste-
rior for the system having likelihood (5.35). We see that AS-Gibbs algorithm
incorrectly reconstructs the bimodal posterior: only the mode (θ1 +θ2 = −5 and
θ3 + θ4 = 5) is found, whereas the mode (θ1 + θ2 = 5 and θ3 + θ4 = −5) is miss-
ing, see comparison with Figure 5.16. (Note: in the figure “Sum components
first mean” on the x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum components
second mean” on the y-axis is the sum µ2 = θ3 + θ4 ). . . . . . . . . . . . . . 156
11
5.23 AS-Gibbs algorithm 13 of Section 5.6: incorrect reconstruction of the poste-
rior for the system having likelihood (5.35). We see in the figure θ1 (named
Component 1 in the figure) vs θ2 (named Component 2): we can appreciate
that AS-Gibbs gets stuck in one mode and only the combination θ1 +θ2 = −5 is
found, while the mode θ1 + θ2 = 5 is missing, see comparison with Figure 5.17
(Note: in the figure “Component 1” on the x-axis is θ1 , whereas “Component
2” on the y-axis is θ2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.24 Eigenvalues of 10D Gaussian model, we see that the estimate AS size is 1,
considering the spectral gap between eigenvalues 1 and 2. The dimension of
the Active Subspaces is na = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.25 ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Gaussian 10D model. We see from the figure that the
largest dimension ni of the inactive subspace that brings a high ESS is 9 (the
number of full bars) and we therefore set ni = 9 and the dimension of Active
Subspace is therefore na = 10 − ni = 1. See comparison with Figure 5.24 where
with the traditional eigenvalue method the active dimension result is na = 1 as
well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.26 Eigenvalues of 25D Gaussian model, we see that the estimate AS size is 1,
considering the spectral gap between eigenvalues 1 and 2. The dimension of
the Active Subspaces is na = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.27 ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Gaussian 25D model. We see from figure that the
largest dimension ni of the inactive subspace that brings a high ESS is 24 (the
number of full bars) and we therefore set ni = 24 and the dimension of Active
Subspace is therefore na = 25 − ni = 1. See comparison with Figure 5.26 where
with the traditional eigenvalue method the active dimension result is identically
na = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.28 Eigenvalues of 10D Banana model, we see that the estimate AS size is 4,
considering the spectral gap between eigenvalues 4 and 5. The dimension of
the Active Subspaces is na = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.29 ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Banana 10D model. We see from figure that the largest
dimension ni of the inactive subspace that brings a high ESS is 6 (the number
of full bars) and we therefore set ni = 6 and the dimension of Active Subspace
is therefore na = 10 − ni = 4. See comparison with Figure 5.28 where with the
traditional eigenvalue method the active dimension result is na = 4 as well. . 161
5.30 Eigenvalues of 25D Banana model, we see that the estimate AS size is 4,
considering the spectral gap between eigenvalues 4 and 5. The dimension of
the Active Subspaces is na = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . 162
12
5.31 ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Banana 25D model. We see from figure that the largest
dimension ni of the inactive subspace that brings a high ESS is 21 (the number
of full bars) and we therefore set ni = 21 and the dimension of Active Subspace
is therefore na = 25 − ni = 4. See comparison with Figure 5.30 where with the
traditional eigenvalue method the active dimension result is na = 4 as well. . 162
6.1 Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs
in the 25D Gaussian model. We see that the performances of the standard
SMC appears to be worse than the AS-SMC, this is probably due to the fact
that we have a good estimate of the likelihood in the AS-SMC (in the Gaussian
model the Importance Sampler seems to behave well on the inactive subspace
even in high dimensions), coupled with the fact that the in the AS version the
SMC operates on a 1D subspace instead of the full 25D space as the non-AS
SMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.2 Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs
in the 25D Banana model. We see that the performances of the standard SMC
and AS-SMC appear to be approximately equal in terms of mean and upper
and lower quartile, but the AS-SMC is showing some tails which suggest again
that the estimate of the likelihood may be poor in some cases and the algorithm
can get stuck in the tail of the distribution, this is probably due to the fact that
using 10 inactive variables is too little for the exploration of the 21D inactive
subspace of the Banana model, and this causes the noise in the importance
sampler estimate of the likelihood. . . . . . . . . . . . . . . . . . . . . . . . . 170
6.3 Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs
in the 25D Banana model. The SMC2 has lower mean and upper quartile
than all the algorithms, but it shows longer tails, it is a sign that, although
currently the algorithm is using significantly more likelihood evaluations than
the others, still the algorithm may get stuck in one of tails, probably indicating
that additional tuning of the algorithm is necessary. . . . . . . . . . . . . . . 175
6.4 Estimate of the mean of component θ1 of the model of Section 4.9.1 across 10
runs of AS-SMC-a. The average across runs is 0.0 with a standard deviation
of the measure of 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.5 Estimate of the mean of component θ2 of the model of Section 4.9.1 across 10
runs of AS-SMC-a. The average across runs is 0.0 with a standard deviation
of the measure of 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
13
6.6 Adaptation of the direction of the Active Subspaces in AS-SMC-a algorithm,
measured across 10 different runs: the direction during the adaptation goes
from prior AS of Figure 4.7 at tempering step 0, to posterior AS of Figure 4.8
in the final tempering steps. The same tempering steps have been used across
the 10 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
A.1 Random coalescent tree with 5 leaves generated using the procedure outlined
in the first part of Section 3.13.1. This has been the generating tree for the
synthetic data of the test described in this section. Visualization via FigTree
[Rambaut, 2023] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.2 Annealing steps in the SMC run for the 5-taxa example studied in this section. 192
A.3 ESS versus annealing SMC step for the 5-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000
particles are used for the simulation). . . . . . . . . . . . . . . . . . . . . . . 193
A.4 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Gamma shape with 5 taxa. Visualization with the software Tracer. . 194
A.5 Frequency distribution for the parameter Gamma shape with 10 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Vi-
sualization with python matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . 195
A.6 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Effective Population Size with 5 taxa. Visualization with the software
Tracer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.7 Frequency distribution for the parameter Effective Population Size with 5 taxa,
using Annealed Adaptive SMC algorithm that we have embedded into BEAST2.
Visualization with python matplotlib. . . . . . . . . . . . . . . . . . . . . . . 197
A.8 Evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size in the SMC algorithm: we see how
the standard deviation drops significantly through the annealing journey. . . . 198
A.9 Consensus tree for the MCMC run with visualization of 95% range for co-
alescent times. See comparison with the generating tree (which the MCMC
run tries to reconstruct) in Figure A.1. Consensus tree has been generated
with TreeAnnotator and the visualization is with FigTree (both softwasre from
BEAST2 package). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
14
A.10 Consensus tree for the Annealed Adaptive SMC run with visualization of 95%
range for coalescent times. See comparison with the generating tree (which
the SMC run tries to reconstruct) in Figure A.1, and also with the tree recon-
structed using MCMC in Figure A.9: we see that the SMC is able to reconstruct
the generating tree well and with a smaller uncertainty (the 95% uncertainty
ranges in the coalescent times are in general smaller compared to the MCMC
of Figure A.9). Consensus tree has been generated with TreeAnnotator and the
visualization is with FigTree (both softwasre from BEAST2 package). . . . . 200
A.11 Random coalescent tree with 20 leaves generated using the procedure outlined
in the first part of Section 3.13.1. This has been the generating tree for the
synthetic data of the test described in this section. Visualization via FigTree
[Rambaut, 2023] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.12 Annealing steps in the SMC run for the 20-taxa example studied in this section.202
A.13 ESS versus annealing SMC step for the 20-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000
particles are used for the simulation). . . . . . . . . . . . . . . . . . . . . . . 202
A.14 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Gamma shape with 20 taxa. Visualization with the software Tracer. 204
A.15 Frequency distribution for the parameter Gamma shape with 20 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Vi-
sualization with python matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . 205
A.16 Frequency distribution using the native MCMC run with BEAST2 for the pa-
rameter Effective Population Size with 20 taxa. Visualization with the software
Tracer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
A.17 Frequency distribution for the parameter Effective Population Size with 20
taxa, using Annealed Adaptive SMC algorithm that we have embedded into
BEAST2. Visualization with python matplotlib. . . . . . . . . . . . . . . . . 207
A.18 Evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size in the SMC algorithm: we see how
the standard deviation drops significantly through the annealing journey. . . . 207
A.19 Consensus tree for the MCMC run. See comparison with the generating tree
(which the MCMC run tries to reconstruct) in Figure A.11. Consensus tree
has been generated with TreeAnnotator and the visualization is with FigTree
(both softwasre from BEAST2 package). . . . . . . . . . . . . . . . . . . . . 208
15
A.20 Consensus tree for the Annealed Adaptive SMC run. See comparison with the
generating tree (which the SMC run tries to reconstruct) in Figure A.11, and
also with the tree reconstructed using MCMC in Figure A.19: we see that the
SMC is able to reconstruct the generating tree with similar level of accuracy
as the MCMC. Consensus tree has been generated with TreeAnnotator and the
visualization is with FigTree (both softwasre from BEAST2 package). . . . . 209
16
List of Tables
4.1 Comparison of MSEs for the three MCMC methods discussed in the paragraph,
using =0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Comparison of MSEs for the three MCMC methods discussed in the paragraph,
using =0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 ESS using prior as importance proposal in the system of Figure 4.6, we see
clearly that the first variable θ1 is inactive since the high ESS shows that it is
little informed by the likelihood, compared to the very low ESS of θ2 . . . . 110
17
5.6 Banana 25D multiESS out of 200000 likelihood evaluations (please see Table
D.2 for the number of corresponding output samples N per algorithm). . . . 139
5.7 Banana 25D multiESS out of 200000 likelihood evaluations (please see Table
D.2 for the number of corresponding output samples N per algorithm). . . . 142
18
Chapter 1
Introduction
1.1 Motivation
The first part of my research concentrates on Sequential Monte Carlo (SMC) methods for
phylogenetics. The aim of the research was to deliver methods that can be used in the
inference of the spread of diseases, leveraging a widely used software platform, and enable
researchers to easily access, validate and compare. We have tried to solve a common problem
in phylogenetics, the understanding the evolutionary relationships between species starting
from genetic data. Two big challenges presently stand in the way. Firstly, existing algorithms
are often computationally expensive and not adaptable to online inference, it has been shown
for example during the recent COVID outbreak, when there was a clear need for methods that
could adapt to rapid variations of the virus [Wu et al., 2020]. Secondly, when using Bayesian
inference, Monte Carlo traditional methods like Markov Chain Monte Carlo (MCMC) struggle
in high-dimensional spaces. In our research we have tried to address these two topics. We
show the results obtained from developing and integrating an adaptive SMC algorithm in
Bayesian Evolutionary Analysis by Sampling Trees version 2 (commonly known as BEAST2)
[Bouckaert et al., 2019], a well-established software platform among researchers. Our adaptive
SMC algorithm embedded in BEAST2 has comparable performances to the native Markov
Chain Monte Carlo (MCMC) method, in terms of accuracy and efficiency. Our work can
be seen as a first step, and future tuning is expected. Although we have done our work
independently, our implementation can be said to integrate all the various algorithms of
[Wang et al., 2019] in BEAST2. It is foreseen that an integration of the adaptive SMC
package into BEAST2 will be done by the owners of the platform, allowing researchers to
use SMC instead of MCMC, following testing and tuning by the platforms developers.
Following up on the aim to bring simplification in complex systems, the focus of the second
part of the thesis is Active Subspaces (AS) [Constantine, 2015]. With AS we try to identify
a smaller subspace informed by the data and to concentrate the algorithmic effort on this
19
more informative part, primarily to address the curse of dimensionality affecting many Monte
Carlo (MC) methods. Existing AS algorithms were mostly biased and targeting distributions
only close by some measure to the posterior, leaving users to do substantial post-validation
[Constantine et al., 2016]. We built on the foundations of an existing pseudo-marginal-based
Active Subspace algorithm [Schuster et al., 2017] and developed non-biased AS algorithms
that in stationarity target the correct posterior, using the structure provided by AS within
a Gibbs-style MCMC [Geman and Geman, 1984], and within Particle Marginal Metropolis
Hastings (PMMH) [Andrieu et al., 2010], Metropolis within Particle Gibbs (MwPG) [An-
drieu et al., 2010], SMC-squared (SMC2 ) [Chopin et al., 2012]. By embedding AS into the
theoretical framework of Gibbs, PMMH, MwPG, and SMC2 , we ensure that the convergence
properties of the original methods are preserved. This means that convergence does not need
to be re-demonstrated, as it is already guaranteed by the original algorithms. We have run
experiments that show in specific settings to outperform existing methods and provide expla-
nations on the optimal running conditions for each algorithm. Our work sheds light on the
practical applications of Active Subspaces, expanding the range of AS methods available to
researchers and providing clearer guidance on which approaches are most effective in specific
scenarios.
20
non exactness, use of prior for approximations. The chapter sets the ground for the
algorithmic improvements introduced in subsequent chapters.
21
Chapter 2
Technical Background
This chapter introduces the technical background for the methods and algorithms presented
in this thesis. It introduces the main concepts and techniques of Bayesian statistics and
Monte Carlo methods that will be used in subsequent chapters.
θ∈Θ (2.1)
Often in the thesis we will take Θ = Rd and similarly Y = Rm . In this context we express
the Bayes formula as
l(y|θ)p(θ) l(y|θ)p(θ)
π(θ|y) = =R (2.3)
p(y) l(y|θ)p(θ)dθ
θ ∈ Θ, y ∈ Y In equation (2.3), we define:
• posterior the term π(θ|y) on the left hand side of the equation
The prior can be considered as the a-priori information we have on the distribution of the
parameters; the likelihood is the information that allows us to update the model after the
data y that we observe.
22
Equation 2.3 is the normalised posterior, we will sometimes refer to the un-normalised
version omitting the marginal likelihood
We may also refer to the posterior as π(θ) instead of π(θ|y), omitting the conditioning
on the observations y for brevity, similarly for the likelihood l(y|θ) which we may refer to as
l(θ).
However, there are cases where the integral in expectation (2.5) cannot be computed in
closed form. For example, in the case of the posterior π introduced in equation (2.3), either
because the normalisation constant is unknown or the integral itself is intractable. Moreover,
calculation may also be computationally infeasible with traditional numerical integration
methods. As shown in [Caflisch, 1998], the complexity of a reference numerical integration
k
scheme is O(N − d ), where k is the order of the scheme and d the dimension of the space.
This slow convergence rate in high dimensions explains the term curse of dimensionality,
which refers to the way increasing the dimension of the state space progressively worsens
convergence rates.
We will, in this section, explain the foundations of Monte Carlo methods [Metropolis
and Ulam, 1949] [Robert and Casella, 2004]. If we are able to draw N independent samples
θ1 , θ2 ...θN from p(θ), we can consider the approximation
N
1 X
µ̂ = f (θi ) (2.6)
N i=1
N Z
1 X
f (θi ) ≈ f (θ)p(θ)dθ (2.7)
N i=1
The left-hand-side sum of equation (2.7) equals the right-hand-side integral almost surely
in the limit N −
→ ∞ by the strong law of large numbers, that states that the average
of many independent r.v. with common mean and finite variances tend to stabilize around
their mean (note that in the left-hand-side the p(θ) is implicitly approximated by the random
23
measure since we are drawing the i samples from it):
a.s.
µ̂ −−−→ µ (2.8)
N →∞
We will explore in the following sections of this chapter how to deal with cases when we are
not able to draw the samples θ1 , θ2 ...θN directly from the distribution p(θ), and we will see
how to write an approximation conceptually similar to equation (2.6). The estimator (2.6)
is unbiased, i.e.
E(µ̂) = µ (2.9)
N Z
1 X
N = f (θi ) − f (θ)p(θ)dθ (2.10)
N i=1
we can study the convergence of the MC method via the the Central Limit Theorem
(CLT) that states that (see for example [Caflisch, 1998]) for N large, since we are considering
samples from a population with mean µ and finite variance
σ 2 = E θ2
(2.11)
Formula (2.12) indicates that the rate of convergence of the standard Monte Carlo method is
1
O(N − 2 ), with a multiplicative constant equal to σf , independently from the dimension d of
the state space, and gives the reason for using Monte Carlo methods instead of, for example,
numerical integration methods. In fact the theoretical convergence rate of nearly all MC
1
methods is O(N − 2 ), as seen in formula (2.12), it is the constant in front of this rate that can
make a difference among the various MC methods.
24
want to calculate the expectation of a function f according to the density p in a domain
D ⊆ Rd Z
µ = Ep (f (θ)) = f (θ)p(θ)dθ (2.14)
D
we may want to approximate the integral via use of the sum as in equation (2.6), but
let’s assume we cannot draw easily samples from the distribution p. We can use a proposal
distribution, g, easier to draw from: it is sufficient that g(θ) > 0 where f (θ)p(θ) 6= 0. The
validity of the process is shown by the equivalences below [Owen, 2013]
Z
f (θ)p(θ) f (θ)p(θ)
Eg = g(θ)dθ
g(θ) g(θ)
Z (2.15)
= f (θ)p(θ)dθ = Ep (f (θ)) = µ
N N
1 X p(θi ) 1 X
µ̂ = f (θi ) = w(θi )f (θi ), θi ∼ g (2.16)
N i=1 g(θi ) N i=1
p(θ)
w(θ) = (2.17)
g(θ)
The weights w compensate for sampling from the proposal function g, instead of the origi-
nal distribution p. Therefore, we can sample θ1 , θ2 , ..., θN independently from the proposal
distribution g, and due to the LLN (as in formula (2.6)) we have
N
1 X a.s.
µ̂ = w(θi )f (θi ) −−−→ Ep (f (θ)) (2.18)
N i=1 N →∞
Similarly to equation (2.9) (see also [Johansen and Evers, 2007] and [Owen, 2013]), it is easy
to demonstrate that µ̂ of equation (2.16) is an unbiased estimator of the mean, i.e. that
Eg (µ̂) = µ (2.19)
25
therefore a function g proportional to f p, and ideally [Owen, 2013]:
|f |p
gopt = (2.22)
Ep (|f |)
although the gopt of equation (2.22) is not practically feasible because it would mean that we
can sample directly from p (which by assumption is not the case). It is therefore advisable
in a good proposal choice that g is proportional to |f |p (for example it has spikes where |f |p
does) [Owen, 2013]. We can also see from the second expression in (2.21) that small values
of g in the denominator would magnify whatever lack of proportionality in the numerator
between g and |f |p [Owen, 2013], therefore we want a proposal that has heavier tails than p
(or at least as heavy as p) [Johansen and Evers, 2007].
We can use IS to estimate the normalising constant by considering that, from equation (2.23),
we have Z Z
p̂(θ)
Eg (w(θ)) = g(θ)dθ = p̂(θ)dθ = Z (2.24)
g(θ)
Therefore using again formulae of (2.16) and (2.18) applied to equation (2.24), we have that
N
1 X
Ẑ = wi (2.25)
N i=1
The Ẑ of equation (2.25) is the estimate of the normalising constant Z up to which we know
the distribution p (a similar procedure can be applied if g is known up to a normalising
constant). Using the result of (2.25), we can write a “self normalised” version of the estimate
of equations (2.16) and (2.18), as follows
PN N
i=1 w(θi )f (θi ) X
µ̂ = PN = Wi f (θi ) (2.26)
i=1 wi i=1
wi
Wi = PN (2.27)
j=1 wj
26
Formula (2.26) is the “self normalised” version of equations (2.16) and (2.18), with the
property that the weights ŵi of equation (2.26) add up to 1 (as seen from (2.24) and (2.25),
this means that we are simulating drawing from normalised distributions). It is not difficult
to show that (see for example [Johansen and Evers, 2007])
Covg (w(θ), w(θ) · f (θ)) = Eg [(w(θ) − Eg (w(θ))) (w(θ) · f (θ) − Eg (w(θ) · f (θ)))] , (2.29)
Where
We can therefore see from (2.28) that µ̂ is biased, it has though the advantage that it
can be calculated knowing the density up to a constant, in fact a normalising constant of the
density would cancel out in the calculation of µ̂ (as shown in [Johansen and Evers, 2007]).
It has to be noted that however the bias in (2.28) decreases with increasing N .
N
¯ 1 X
I= f (θ(i) ), θ(i) ∼ p (2.32)
N i=1
27
And we call I˜ the self normalised IS approximation of (2.31) (as in (2.26))
N
X
I˜ = W (θ(i) )f (θ(i) ), θ ∼ g (2.33)
i=1
Where, in equation (2.33), W are the self-normalised weights of equation (2.27). A measure
that has been widely used [Owen, 2013] [Elvira et al., 2018] to compare the performance of
IS estimators, is the so-called Effective Sample Size (ESS), which compares the variances
of the traditional Monte Carlo estimate (2.32) and the IS approximation (2.33)
¯
V ar(I)
ESS = N (2.34)
˜
V ar(I)
We can notice that the ESS of (2.34) has some drawbacks, for example it depends on the
integrand function f (as clearly seen from (2.31), (2.32) and (2.33)), and therefore an estima-
tor that is good for an integrand function f1 may not in general be good for another function
f2 , and also in order to calculate (2.34) we will need to compute integrals that are in general
intractable as the integral (2.31) that we are trying to estimate (see for example [Elvira et al.,
2018] for a detailed expression of such integrals). Therefore some simplifications have been
used [Elvira et al., 2018] [Kong et al., 1994] that reduce significantly the complexity of (2.34),
to:
N
ESS ≈ (2.35)
1 + V arg (W )
We see from equation (2.35) that, in the ideal case where the weights are known exactly (and
therefore with zero associated variance), we have ESS = N , i.e. we are in a situation that
is as good as if we were drawing directly from the target distribution. As we see in [Elvira
et al., 2018] [Kong, 1992] [Kong et al., 1994], further simplification of (2.35) bring to
N Z2
ESS ≈ (2.36)
Eg (W 2 )
Where Z is the normalising constant expressed in (2.23). It is to be noted that (2.36) has,
as said, approximations and that these restrict the validity of (2.36) to cases where the
approximations are valid [Kong et al., 1994] [Elvira et al., 2018] (for example since there is
no dependence on the integrand function f , it is assumed that the proposal g is “reasonably”
close to the optimal proposal (2.22)). By using particle approximations for Z from (2.25) and
Eg (W 2 ) ≈ N1 N (j) 2
P
j=i (w ) which brings us to the final approximation of ESS in the version
widely used in literature:
( N (j) 2
P
ˆ j=i w ) 1
ESS = N PN = PN (2.37)
(j) 2 (j) )2
j=i (w ) j=i (W
28
Where, in (2.37), in the first equation the w(j) are unnormalised weights of equation (2.17),
whereas in the second equation the W (j) are the self-normalised weights of equation (2.27).
P (X (t) = x(t) |X (t−1) = x(t−1) , ..., X (1) = x(1) ) = P (X (t) = x(t) |X (t−1) = x(t−1) ) (2.38)
As we can see from (2.38), the value of the chain at a particular time t is only dependent
from the value at time t − 1. In the following parts of this section we will outline the basic
concepts that will help us introduce MCMC [Robert and Casella, 2004].
The kernel is a conditional probability density, we will speak more extensively about the
associated probability measure in the next subsections. In the continuous case we have that
Z
P (Xt ∈ Bi |Xt−1 = xt−1 ) = k(xt−1 , xt )dxt (2.39)
Bi
29
2.4.2 Initial distribution of the chain
The chain Xn is defined for n ∈ N, therefore there is a X0 , starting point of the chain. We
define the initial distribution of X0 as µ
Z
P (X0 ∈ A) = µ(x)dx (2.41)
A
The obvious extension to the discrete case of (2.41) is similar to formula (2.40), we have in
fact
X
P (X0 ∈ A) = µ(x0i ) (2.42)
x0i ∈A
By using the Markovian property (2.38) and the definition of the kernel we have that [Jo-
hansen and Evers, 2007]
n
Y
P (X0 = x0 , ...Xn = xn ) = µ(x0 ) k(xj−1 , xj ) (2.45)
j=1
We also introduce the notation used in literature [Robert and Casella, 2004] [Johansen and
Evers, 2007] k 1 (x, A) = k(x, A), and
Z t+s
Y
s
k (xt , xt+s ) = k(xj−1 , xj )dxt ...dxt+s−1 (2.46)
As−1 j=t+1
30
2.4.4 Stationary property of the MCMC and invariant distribution
We will in this section introduce the concept of stationary/invariant distribution of the
chain, i.e. a distribution π s.t. [Robert and Casella, 2004]:
Xn ∼ π =⇒ Xn+1 ∼ π (2.48)
“A σ-finite measure π is invariant for the kernel k(·, ·) and the chain XN ” [Jiao, 2017] if
Z
π(A) = k(x, A)π(x)dx (2.49)
χ
Equation (2.49) states the condition (2.48) (if the invariant distribution is a probability
measure it is also called stationary due to (2.48)). A theorem states [Robert and Casella, 2004]
that if XN is a recurrent chain then it has an invariant measure [Jiao, 2017] (recurrence is a
property that states that, whatever the initial condition of the chain, we will end up in a set
A having positive measure an infinite number of time as N → ∞ [Robert and Casella, 2004]).
The stationary property of the MCMC can also be related to another property, that states
that the direction of time does not matter in the dynamic of the chain, PXn+1 (Xn+1 |Xn =
x) = PXn (Xn |Xn+1 = x). This property is called reversibility and is stated as follows
[Johansen and Evers, 2007] [Robert and Casella, 2004]:
Equation (2.50) is named detailed balance condition and provides a sufficient condition for π
to be a stationary distribution for the chain. It is easy to prove that [Robert and Casella,
2004] if k and π meet criterion (2.50), then π is the invariant density of the chain [Jiao, 2017].
Convergence by LLN
Under the following conditions [Robert and Casella, 2004]: if the chain is Harris-recurrent,
with invariant measure π, then the following convergence theorem can be proved, with any
31
integrable function f where we want to estimate the expectation
Z
1 X
lim f (Xi ) = π(x)f (x)dx, ∀X0 , a.s. by LLN (2.51)
N →∞ N
We start by defining an additional property of the chain which will be auxiliary in the
formulation of the convergence. We define a periodic chain Xn which cyclically returns in
the same states: mathematically, Xn is periodic with period d if there are non empty disjoint
sets A0 ...Ad−1 s.t.
If the conditions of (2.52) are not met, then Xn is aperiodic. We can now state the follow-
ing conditions of convergence [Robert and Casella, 2004]: if the chain is Harris-recurrent,
aperiodic with invariant measure π, then there is convergence of the chain to the stationary
distribution π, whatever the initial distribution µ:
Z
lim k n (x, ·)µ(x) − π (2.53)
n→∞ A TV
Where k n is the transition kernel applied n times introduced in (2.47), and the total variation
norm is
||µ1 (A) − µ2 (A)||T V = sup|µ1 (A) − µ2 (A)| (2.54)
A
Formula (2.53) states ergodicity. In MCMC, ergodicity ensures that, as the number of
steps n → ∞, the distribution of the samples generated by the chain converges to the target
distribution, regardless of the starting point: over time, the chain will explore the entire
space in a way that is in accordance to the target distribution.
32
2.4.6 Metropolis-Hastings algorithm
We now need a way to take advantage of the properties of the chain outlined in the previ-
ous sections, and build chains that converge to a stationary distribution of our choice: the
Metropolis-Hastings algorithm (MH) [Metropolis et al., 1953, Hastings, 1970] has been
built with this purpose. Suppose we want to approximate drawing samples from a target
distribution π and we want to do so using MCMC. The MH algorithm allows us to approxi-
mate drawing samples from π using an auxiliary distribution g, named proposal distribution.
We build the transition kernel k in three steps in the following way:
3. we draw from the uniform distribution u ∼ Unif[0, 1], if u ≤ α we have Xn+1 = Xn∗ ,
otherwise Xn+1 = Xn
The kernel k built with the three-step procedure outlined above can be synthesised as follows:
Z
k(Xn , Xn+1 ) = α(Xn∗ |xn )g(Xn∗ |xn ) + 1{x∗ =xn } 1 − α(s|xn )g(s|xn )ds (2.55)
n
It can be demonstrated [Johansen and Evers, 2007, Robert and Casella, 2004] that the kernel
(2.55) satisfies the reversibility condition (2.50) and therefore, as explained in Section 2.4.4,
π is the invariant distribution of the chain. If the proposal g is chosen so that the whole space
is covered, the chain is also recurrent, and we are therefore in the conditions for convergence
by LLN of equation (2.51).
If we consider a set of parameters in our state space θ = {θ1 , θ2 , . . . , θn },the Gibbs sampler is
no different from other Monte Carlo methods in that it approximates the sampling from the
posterior distribution p(θ) = p(θ1 , θ2 , . . . , θn ). And like MCMC described in Section 2.4,the
Gibbs method produces samples that are not independent, in fact the algorithm iteratively
samples from a sequence of conditional distributions of the components i of the state space.
The Gibbs Sampling algorithm can be summarized in the following steps:
33
(0) (0) (0)
1. Initialization: Choose an initial state for the parameters θ(0) = (θ1 , θ2 , . . . , θn ).
4. Stop condition: As described for MCMC in 2.4, the stop condition of the algorithm
can be decided after a large enough number of iterations, for example after a predeter-
mined ESS is reached (see Section 2.4.8 for ESS)
Theoretical Foundation
The Gibbs sampler is a special case of Metropolis-Hastings, where the proposal is chosen
to be the full conditional on some subset of variables (keeping the others fixed). Plugging
this into the acceptance probability gives the Gibbs sampler. It is easy to demonstrate, for
example, that under mild conditions on p it is possible to have Harris recurrence on the
Gibbs sampler [Tierney, 1994] and therefore the results on total variation norm convergence
described in Section 2.4.5.
34
a useful, if approximate, measure and can also be used as a stop condition when running
the chain, i.e. for example terminating the MCMC run when a predetermined ESS has been
reached.
2.4.9 MultiESS
In addition to the ESS of Section 2.4.8, we have another measure in MCMC named multiESS
[Vats et al., 2019]. In case of multivariate samples in a chain, multiESS can give an idea of
the overall quality of samples, like the ESS. While the ESS can give a measure of the quality
of samples for a single component, the multiESS takes into consideration the covariance of
the various elements, and therefore gives a measure for the overall multivariate chain. The
multiESS formula is as per below [Vats et al., 2019]
−d
|Λ|
multiESS = N (2.57)
|Σ|
where
35
paper, and although the exact definition of ρ is outside the scope of this chapter (please see
Section 2.2 of [Agapiou et al., 2017] for the full definition), we report one of the formulae
from the paper:
N
ESS ≈ (2.58)
ρ
Equation (2.58) indicates that if the distance ρ of the proposal from the target is increased,
we’ll have to use a larger number of samples N in order to keep the ESS constant. And
comparing (2.58) with (2.35) we understand that the factor driving the variance of the im-
portance weights is the distance between the target and the proposal. Also in [Agapiou
et al., 2017] it has been shown that, in some reference scenarios, the number of importance
samples N has to be increased exponentially as the dimension of the state space increases,
to keep the ESS at a predefined level. Therefore the approximation of the target given by
the IS method in general deteriorates exponentially with the dimension. It has to be noted
that the “curse of dimensionality” for IS, does not always automatically happen, as there
may be cases where the intrinsic dimension of a system is less than the nominal dimension
(one example would be for over-parametrised systems), we discuss quite extensively the topic
in Chapter 4 where we introduce active subspaces [Constantine, 2015], we expect in these
scenarios that the number of particles to keep a predetermined ESS will be given mainly by
the dimensionality of the subspace, which is relatively constant, and will not therefore be
subject to the curse of dimensionality.
Aside from the particular cases, IS is therefore affected negatively (at an exponential
rate) by high dimensions. In contrast, it has been shown that [Beskos et al., 2011] [Beskos
et al., 2014] sequential Monte Carlo methods like AIS and Sequential Monte Carlo (SMC,
introduced later in Section 2.6), scale polynomially as the dimension d of the state space
grows large. In particular it has been shown [Beskos et al., 2011] [Beskos et al., 2014] that it
is possible to find an approximation s.t. the ESS remains at a predetermined level at a cost
O(N d2 ), where N is the number of particles and d the dimension of the state space. AIS,
as we will explain later in this section, uses intermediate distributions: to bridge the gap
between proposal and target, it creates proposals that are progressively closer to the target.
The O(N d2 ) cost of performing AIS (or SMC) comes from[Beskos et al., 2011, 2014]:
• For each intermediate distribution, the cost of a MCMC iteration performed in a state
space of dimension d is O(d) as we explain in Section 4.5
And performing the two operations above for each of the N particles brings the cost of
O(N d2 ).
Another reason why AIS will, in general, work better than IS is in those cases where the
target distribution exhibits isolated modes, especially if some important modes are found
only rarely.
36
2.5.2 Annealed Importance Sampling vs MCMC
AIS will also in general work better than MCMC in the case of isolated modes, in fact when
MCMC is used in cases of complex multi-modal distributions it will shift very infrequently
among modes, therefore showing high autocorrelation and a tendency to stabilize only after
an extended time duration [Neal, 1998]. The AIS, allowing to gradually approach the desired
distribution of interest by making use of interpolating distributions, is an approach to avoid
this.
We start the algorithm with β(j) = 0 and therefore with f0 and we arrive in the last step
to the target fn with β(j) = 1. In the particular case where we use the prior as a starting
distribution, expressing the posterior fn as prior times likelihood as in the standard Bayesian
set-up (2.3) we have:
fn (x) = f0 (x)l(x) (2.60)
where, in (2.60), f0 is the prior, and l is the likelihood. In this case, using equation (2.60) in
(2.59), we have that
37
If we indicate with πi the normalized probability distribution associated with each fi , the
algorithm, as described in Section 2 of [Neal, 1998], is as follows
step 0) x0 ∼ π0
step 1) MCMC step targeting π1 , resulting in x1
step 2) MCMC step targeting π2 , resulting in x2 (2.62)
..
.
step n) MCMC step targeting πn , resulting in xn
Where, in (2.62), the MCMC moves of the intermediate steps are performed, as explained
in Section 2.4.1, using Markov kernels ki (xi−1 , xi ) (for example with Metropolis-Hastings
2.4.6). Explaining the algorithm more in detail:
2. similarly to what we did in the previous step, we move towards π2 and we draw x2 from
the distribution π2
... ...
n. in the last step we approximate drawing a sample xn from the target density πn
Repeating the procedure, say, N times and taking each time the last sample of the procedure
(i)
(the xn ), the algorithm (2.62) produces samples xn , i = 1, 2, ...N that are drawn from the
target distribution πn , with approximations that we will discuss in the remainder of the
(i)
section. Like Importance Sampling, each particle xn has a weight that accounts for not
directly drawing from the target πn , we will see that the expression of the weight of each
particle is as follows (please note that the super index i indicating the particle has been
omitted for brevity in the following formula for the fj and, in addition, we are using fj
instead of πj , this is possible because it can be shown [Neal, 1998] that normalising constants
cancel out in ratios):
f1 (x0 ) f2 (x1 ) fn (xn−1 )
w(i) = ... (2.63)
f0 (x0 ) f1 (x1 ) fn−1 (xn−1 )
Before explaining how it is obtained mathematically, we can see from its expression that
f (xj )
(2.63) is made up by products of importance weights: each factor j+1 fj (xj )
is, as seen in (2.17),
the ratio of the target over the proposal, in fact each fj , by construction, is the proposal
for the fj+1 , and these intermediate steps allow, compared to IS, a smoother transition from
38
the proposal to the target, in fact, by tuning β(j) of equation (2.59), it is possible to have
proposals that are closer to the targets, allowing for greater efficiency of the intermediate IS
steps. From formula (2.63) we can also appreciate a major advantage of weighted particle
systems: if we start from w(i) = ff01 (x(x0 )
0)
where the target is f1 , a simple reweight operation
(i) f2 (x1 )
w f1 (x1 ) will shift the target to f2 [Chopin, 2002]. The validity of (2.63) can be shown [Neal,
1998] using the results we have already obtained in Section 2.3 for IS and 2.4 for MCMC. In
fact, if we consider an extended state space (x0 , ...xn ), with a joint distribution:
Where k̃j are backward transition kernels. We have that the marginal for xn of equation
(2.64) is the density we are looking to draw from (the target distribution). The k̃j , as
said, are backward transition kernels associated to the MCMC moves of (2.62), and can be
calculated by using the detailed balance condition of MCMC of (2.50):
If we take a look at the proposal distribution g of the procedure (2.62), starting from the first
step x0 ∼ p0 and considering all the subsequent applications of Markov kernels k(xj , xj+1 ),
it has the form:
g(x0 , ...xn ) = f0 (x0 )k0 (x0 , x1 )...kn−1 (xn−1 , xn ) (2.67)
Therefore, the AIS can be see as a multi-step importance sampler, and the expression of the
weight for the whole importance sampling process, as seen in (2.17), is
And (2.68) brings the result (2.63), which proves our case. Since, as we saw in the steps
of (2.62), at each step xj ∼ fj and therefore the function fj becomes the proposal for the
next step fj+1 (xj ), all the rules that apply to importance sampling choice of proposal hold
(please see Section 2.3). The choice of the proposal, as in importance sampling, is critical for
the success of the algorithm, and we will see in later sections for example in the application
to complex posterior distributions such as for phylogenetic analysis of genetic sequences in
Chapter 3 that in non-trivial cases smooth transitions between functions, i.e. small steps in
39
the exponent β of formula (2.59), are needed to have an acceptable ESS.
SMC vs AIS
A difference of SMC wrt AIS lies in that SMC uses a technique called resampling to account
for the fact that with the increase of algorithmic time the importance weights have a tendency
to degenerate (see for example Section 14.3.3 of [Robert and Casella, 2004]), meaning that it
is possible to end up, after a few iterations, with a significant number of particle having small
weight. We’ll explain in next Section 2.6.1 how resampling helps rejuvenating the current
set of particles, although at the trade-off of impoverishing the diversity of the set.
A second difference with AIS, is that we will see in the relevant Section 2.6.2, the SMC
also uses a Markov kernel, like AIS: in SMC the kernel can be a generic Markov kernel,
whereas in AIS, as described in Section 2.5, there is specifically a MCMC kernel.
It can be said that AIS is a subset of SMC, that has no resampling and uses MCMC
kernel.
2.6.1 Resampling
The resampling involves sampling with replacement from the current set of particles according
to their weights. Mathematically, the resampling can be described as follows:
(i) (i)
• The resampling step will generates a new set of particles {xr }N
i=1 such that each xr
is a copy of x(j) with probability proportional to w(j)
(i)
• After resampling we rename {xr }N (i) N
i=1 as {x }i=1 (so this become our current set of
particles) and all weights are reset: w(i) = N1 , i = 1, 2...N .
Resampling addresses the problem of avoiding degeneracy in the particle population. At the
same time, resampling introduces variance, and a technique usually employed is to resample
only when the ESS drops below a given threshold (for example when the effective sample
size drops below 50% of the number of particles N ) [Del Moral and Doucet, 2003].
40
2.6.2 The SMC algorithm
As said in the introduction, the Sequential Monte Carlo (SMC) [Del Moral and Doucet,
2003] algorithm will have many commonalities with the AIS seen in Section 2.5. We will,
similarly to Section 2.5, make use of consecutive “neighbouring” distributions, i.e. distribu-
tions that are not too different (we will give more precise definition below) one from another
so that the proposals and the target distributions, at each step, are sufficiently close. The
starting point is, as in the common Monte Carlo methods, that we are willing to draw sam-
ples from a target distribution πn . We proceed through intermediate targets as in equation
(2.59), and at each step the previous target becomes the proposal for the next target. We
proceed in steps, similar to (2.62), we start by drawing from an initial distribution π0 easy
to draw from, it can for example be the prior (in which case the expressions simplify as in
equations (2.60) and (2.61)), and we go on constructing the first steps as done in (2.62). We
repeat here for simplicity equation (2.59), using the same symbolism with fj not-normalised
version of πj
fj (x) = fn (x)β(j) f0 (x)1−β(j) , j = 1, 2, ..N, 0 ≤ β(j) ≤ 1 (2.69)
As we know, we start the algorithm with β(j) = 0 and therefore with f0 and we arrive in
the last step to the target fn with β(j) = 1. And, in case we use the prior as distribution
f0 , we have some significant simplification in the formula which becomes (see also (2.60) and
(2.61))
fj (x) = f0 (x)l(x)β(j) , j = 1, 2, ..N, 0 ≤ β(j) ≤ 1 (2.70)
We present here the SMC version that makes use of resampling of the particles, we will
explain further in the section what this implies. The steps of the SMC algorithm follow
[Del Moral and Doucet, 2003], we use capital W for normalised weights, and w for the
un-normalised weights:
step 0) x0 ∼ π0 (2.71)
step 1) use kernel to move from π0 to π1 , resulting in x1
1. we use the drawn particles as an importance sampler proposal (see Section 2.3) for π1
(i)
of equation (2.59), and we have a weight update of w0 = N1 ff10 (x 0)
(x0 )
, this update reflects
the weight of particles after the drawing process. We then normalise the weights to
(i)
W0
2. resampling step (technically resampling does not normally need to happen at every
step, see further Section 2.6.3 for more details) we resample the particles according
41
to their normalised weight, so the bigger the normalised weight the more the particle
will have a chance to be chosen in this resampling process: this resampling step allows
us to eliminate particles where the proposal weakly represent the posterior, and will
replicate particles where there is a strong representation of the posterior, all particles
(i)
after resampling will again have weights of W0 = N1
we see, from the last step in the above procedure, that the weight update in the last step is
Where, in (2.73), the L(x1 , x0 ) is a backward kernel, built so that f1 (x1 ) is the x1 -marginal
of the joint distribution f1 (x1 )L(x1 , x0 ). Equation (2.72) now becomes
R
f (x )L (x , x )dx
(i)
w1 =
(i)
W0 R 1 1 0 1 0 0 (2.74)
f0 (x0 )k1 (x0 , x1 )dx0
Instead of marginalising, we write the contribution in the RHS of (2.74) using the joint
distributions
(i) (i) f1 (x1 )L0 (x1 , x0 )
w1 = W0 (2.75)
f0 (x0 )k1 (x0 , x1 )
Since we can choose L0 (x1 , x0 ) of (2.75) at will (as long as equation (2.73) holds), we can
choose L s.t.
f1 (x1 )L0 (x1 , x0 ) = f1 (x0 )k(x0 , x1 ) (2.76)
We can notice that for example the condition in (2.76) is satisfied if k is a MCMC kernel with
invariant distribution π1 , since (2.76) would in that case be the expression of the detailed
balance condition of equation (2.50). By substituting (2.76) in equation (2.72) we have
42
(i)
We then normalise the weights of (2.77) to W1
(i)
(i) w
W1 = P 1 (i) (2.78)
i w1
And by repeating the steps 2 and 3 of the algorithm outlined above, we resample, if it is the
case, for example we can perform the conditional resampling step, fixing α indicating the
fraction of particles N to check degeneracy
If condition (2.79) is true, we perform the resampling as shown in Section 2.6.1, and we end
up with
(i) 1
W1 = , i = 1, 2...N (2.80)
N
(i)
Therefore, at this point, W1 will be either equal to (2.78) if no resample has taken place
(condition (2.79) false) or to equation (2.80) if resample has indeed taken place (condition
(2.79) true)
In the next step we move in the state space from x1 to x2 using a Markov kernel k2 (x1 , x2 )
having f2 of (2.59) as a target, and, with a procedure similar to the one that has brought us
from equation (2.72) to (2.77), we have that
And, generalising (2.81), the un-normalised weight update component at each generic step j
is
(i) (i) fj (xj−1 )
wj = Wj−1 (2.82)
fj−1 (xj−1 )
(i)
Where, in equation (2.82), Wj−1 will either be equal to N1 , if the weight update comes after
(i)
a resample, or to the normalised weight of wj if there has been no resample (please see
equations (2.77), (2.78), (2.79), (2.80) and (2.81) where the process is described in detail
with indexes j = 1 and j = 2).
It is to be noted that, if no resampling is employed in the SMC algorithm (equivalently
if (2.79) was always false), the full equation of the weight update becomes
And we see that equation (2.83) is the same of the AIS case of equation (2.68), which confirms
that AIS is a subset of SMC where no resampling is employed.
43
2.6.3 Conditional Effective Sample Size (CESS)
We have seen, in Section 2.6, that one of the steps of the SMC algorithm consists in resampling
the particles according to their weight. One naive way to apply the algorithm would be to
perform the resampling at each iteration, but each resampling adds to the variance of weights
[Zhou et al., 2013] and ultimately, remembering that the Effective Sample Size (ESS) we
mentioned in Section 2.3.2 is a measure of performance of the estimator and depends on the
variance of the weights, we understand that an increase in variance of the weights would
bring a worse estimator. A better way to perform the resampling step in SMC is to do so
adaptively, for example only when the ESS falls below a certain threshold, this would reduce
the number of times resampling occurs.
As we have spoken (see for example in the IS Section 2.3), the better the choice of a
proposal for a target distribution, the better the performance of a IS estimator. In SMC
algorithm we are moving from an initial distribution, say the prior, to the posterior, through
a series of intermediate distribution (2.70), where at each step of the tempering process, the
current distribution acts de-facto as a IS proposal for the next, and in [Zhou et al., 2013]
a quantity is shown, named Conditional Effective Sample Size (CESS), that helps
determine in automatic way the next best tempering exponent of (2.70), so that the ESS,
calculated on the new tempering, remains high. We report below the formula of CESS (for
full technical details see for example algorithm 4 in [Zhou et al., 2013])
N
!2 N
X X 2
(j) (j) (k) (k)
CESS(Wt−1 , wt ) = Wt−1 wt Wt−1 wt , (2.84)
j=1 k=1
The correct value of CESS helps ensure that the convergence of the SMC algorithm from
the initial distribution to the posterior happens keeping enough diversity of particles by
controlling the weight update through the annealing exponent. CESS of (2.84) and ESS
become equivalent if resampling is done at every iteration. Otherwise ESS will contain
the information of the discrepancy of the current iteration approximation versus the target,
whereas CESS will contain information on the quality of the current IS step [Amaya et al.,
2022]. It is considered a good choice [Amaya et al., 2021] as a threshold of CESS so that
CESS
N
is close to 1. In practical terms, fixing a predefined CESS value brings to automatic
choices of the next tempering exponent. The “perfect” choice of CESS will anyway depend
on implementation as choosing a high CESS threshold will normally result in more tempering
steps and therefore longer runs of the algorithm, therefore a trade-off will have to be made,
depending on the application [Amaya et al., 2021]. In our algorithms we have chosen CESSN
=
0.9.
44
2.7 Pseudo-marginals in MCMC
We have described in Section 2.4 the MCMC and the Metropolis Hastings (MH) algorithm,
and in particular, we report here again the formulation of the acceptance ratio of MH
With g in (2.85) the proposal distribution, p as usual the prior, and l the likelihood.
There are cases where the likelihood l of (2.85) is non-tractable or not convenient to
calculate, and it could be useful to introduce in the algorithm an estimate of the likelihood.
Assuming that the state space can be partitioned into two sets of variables and expressing
the likelihood as
l = l(a, i) (2.86)
Of course, having made a change in the original MCMC, we need a justification, mainly to
understand if and how the change of using (2.88) instead of (2.85) as MH ratio, affects the
convergence of MCMC.
It has been shown in [Beaumont, 2003, Andrieu and Roberts, 2009] that as long as the
estimate ˆl is unbiased, and so E[ˆl(a)] = l(a), the conditions of convergence for MCMC covered
in Section 2.4 are still valid. It is to be noted that the variance of ˆl will affect the efficiency
of the MCMC algorithm [Beaumont, 2003, Andrieu and Roberts, 2009]. This will be clear
by looking at equation (2.88), and considering, for example, that a higher variance of ˆl(a)
will normally imply a higher variance of the ratio α of (2.88). It is shown that the variance
of the pseudo-marginal is greater or equal of the exact marginal [Andrieu and Vihola, 2015].
45
via Importance Sampling (see Section 2.3) the estimate
Ni
ˆl(a) = 1 l(a, in )
X
(2.89)
Ni n=1 q(in )
R
Equation 2.89 approximates, in the Monte Carlo sense, the marginal l(a) = l(a, i)di and
this is the reason behind the name pseudo-marginal for the estimate of the likelihood. We
know that Importance Sampling produces unbiased estimates (see Section 2.3), and therefore
the estimate of likelihood ˆl(a) of (2.89) can be used in equation (2.88), keeping the MCMC
theoretical conditions for convergence intact.
46
Alg. 1 Particle Marginal Metropolis-Hastings (PMMH)[Andrieu et al., 2010]
1: Initialize a(0) and estimate ˆl(a(0) ) using SMC
2: for k = 1 to K do . start MCMC iteration
3: Propose a∗ from q(a∗ |a(k−1) )
4: Run SMC with a∗ to estimate ˆl(a∗ )
pa (a∗ )l̂(a∗ )q(a(k−1) |a∗ )
5: Calculate acceptance probability α(a(k−1) , a∗ ) = min 1, p (a(k−1) )l̂(a(k−1) )q(a∗ |a(k−1) )
a
The basic idea of PG is to use an outer MCMC performing Gibbs sampling at each iteration
(t)
t, by firstly updating the i|a, and then obtaining an update of the survivor path a1:T . We’ll
dedicate some attention to the conditioning on the path. That some conditioning appears in
the SMC part of MwPG was to be expected, as the SMC estimate is part of an outer Gibbs
MCMC, and Gibbs uses conditional updates: in the case of MwPG the correct theoretical
framework is ensured by keeping the SMC algorithm conditioned to a particular
path [Andrieu et al., 2010] (in Algorithm 2 it is the path of particle N of the SMC, sampled
on line 16), this special particle path is a guaranteed survivor throughout the SMC
(t−1)
algorithm, in fact the path a1:T is guaranteed not to be eliminated by resampling until the
end of the execution of the SMC algorithm of time t (from lines 6 to 15 of Algorithm 2): the
(t)
new reference path updated at time t, a1:T , is only sampled at the end of the SMC algorithm
execution (line 16 of Algorithm 2).
47
MwPG algorithm
Some auxiliary notes on Algorithm 2: we see a reference to T , which is the final time-step
of tempering in the SMC algorithm (see Section 2.6), for simplicity we will assume here that
the tempering path is fixed for all the iterations, and assume N fixed as number of particles
in the SMC algorithm. Comments in the algorithm are in italics.
2.9 SMC2
The SMC2 algorithm [Chopin et al., 2012] uses a particle MCMC inside an outer SMC
algorithm. In a way we can see the SMC2 as an advancement of PMMH, where instead of
an outer MCMC we have a SMC. The proofs of convergence of the algorithms can be found
in [Chopin et al., 2012], and use a common technique of augmenting the number of variables
and then showing that the algorithm performs an SMC on the augmented space targeting
a distribution that has the original intended target as marginal. Below a summary of the
algorithm in its basic form
48
Alg. 3 SMC2 Algorithm
1: Initialize θm from prior p(θ) for m = 1, . . . , Nθ
2: Set initial weights ω0m = 1/Nθ for all m
3: Set ESS threshold α
4: Set number of internal SMC particles Nx
5: for t = 1 to T do . Start outer SMC tempering loop
6: for all θm do
7: Run internal SMC Sampler with Nx particles to estimate incremental likelihood
ˆl(yt |y1:t−1 , θm )
8: end for
9: m
Update weights ωtm = ωt−1 · ˆl(yt |y1:t−1 , θm )
ωm
10: Normalize weights Wtm = PNθt n
n=1 ωt
11: if ESS < α then . Resampling criterion
12: Resample θm particles based on weights Wtm
13: Set Wtm = 1/Nθ
14: end if
15: for all θm do . MCMC rejuvenation step
16: Perform MCMC step on θm to obtain θm∗
17: Set θm = θm∗
18: end for
19: end for . End outer SMC tempering loop
While in general the number of particles Nx of the internal SMC of Algorithm 3 is fixed
throughout all the internal SMC steps, it is discussed in [Chopin et al., 2012] the case for
adaptation of the number of particles Nx within the steps of the algorithm. Such possibility
is interesting both on the methodological side and on the practical side, we will be discussing
such extension of the algorithm more in depth when discussing Active Subspaces, in Chapter
4 and subsequent.
49
Chapter 3
3.1 Introduction
The recent outbreak of COVID-19 [Wu et al., 2020] has underscored the importance of anal-
ysis in understanding the spread and evolution of infectious diseases, in particular through
the study of mutations and relationships among the various strains of viruses spreading
in different time periods and different geographical locations. Phylogenetic analysis, at its
core, involves the study of the evolutionary relationships among biological species, typically
through the analysis of genetic sequences, for example DNA or RNA, which are coded via
sequences of nucleotides. By examining these genetic sequences, scientists can infer the ge-
nealogy and evolutionary history of organisms, and how the different species have diverged
and evolved over time, and these relationships are often expressed via phylogenetic trees. As
an introductory visually appealing example, borrowed from [Hou et al., 2022], we can see
below in Figure 3.1, an instance of a phylogenetic tree:
50
Figure 3.1: Visually appealing example of phylogenetic tree, borrowed from [Hou et al.,
2022]. The different species are the plants pictured on the RHS of the figure, the lines that
combine the different species until reaching a common ancestor (core Chiorophyta on the
bottom LHS), represent genetic lineage relationships. The points where the lines combine,
represent the moments in past time when different species started to diverge from the same
lineage.
In Figure 3.1,
• The plants pictured on the RHS of the figure are different species
• The lines on the that combine the different species until reaching a common ancestor
(core Chiorophyta on the bottom LHS), represent genetic lineage relationships
• The points where the lines combine, represent the moments in past time when different
species started to diverge from the same lineage
Commonly, phylogenetic trees like the one pictured in Figure 3.1 are translated into a format
that algorithms can act upon. Several software can be used for the representation, for example
in Figure 3.2 we see the screenshot of a software tool named FigTree [Rambaut, 2023], which
51
is used for phylogenetic trees analysis and will be introduced later in the chapter. Although
(possibly) less visually appealing than Figure 3.1, Figure 3.2 shows the reconstruction of a
phylogenetic tree with 7 sequences (the 7 tips of the tree named t0 to t7 which can be found
on the RHS of the picture), the lines representing the lineages (in Figure 3.2 we can see the
estimated times reported on the lineages lines), and the points when two species merge in a
single branch of the tree, these are called coalescence times
Figure 3.2: The Figtree software [Rambaut, 2023], useful to visualize and get statistical
information on phylogenetic trees. We can see the tips on the right hand side numbered t0 to
t7 which represent the genetic sequences that are the starting point of the analysis (the whole
tree is inferred starting from these sequences). The points where two branches merge into one
are called coalescence times.
The task of Bayesian inference in phylogenetics is, at its core, to find the posterior distribu-
tion over genealogies g, given genetic sequences data y. This chapter will give an introduction
of phylogenetics, of the relevant studies on Bayesian statistics applied to phylogenetics and of
some of the most commonly used software tools. A few traditional studies are considered im-
portant in the field. The Wright-Fisher model has been produced in a famous study [Wright,
1931, Fisher, 1930], it provides a basic but functional framework for describing genetic evo-
lution, we describe the model in Section 3.2.3. The coalescent is a prior distribution on trees
[Kingman, 1982] that we will use in the following sections, we describe the coalescent theory
in Section 3.3. The likelihood used in our analysis is the Felsenstein’s likelihood [Felsenstein,
1981], a formula which evaluates trees given the input genetic sequences, we describe the
Felsenstein’s likelihood in Section 3.6.
52
Existing Methods and Software
Probably one of the most famous statistical software tool available for Bayesian phylogenetic
analysis is named Bayesian Evolutionary Analysis by Sampling Trees, in short BEAST
[Drummond and Rambaut, 2007, Drummond et al., 2012], and, born successively as a de-
velopment branch, BEAST2 [Bouckaert et al., 2019]. Both software have been developed
by researchers, and offer tools for the solution of many real-world research problems. We
will in Section 3.8 give a bit of introduction to BEAST and BEAST2. For our research we
have used BEAST2 only, and therefore the main focus will be about this platform. BEAST2
uses MCMC as its main MC sampling method [Bouckaert et al., 2019]. We describe in some
detail BEAST2 software environment in Section 3.9.
MCMC, which is the native Monte Carlo method in BEAST2, is commonly used for the
exploration of the parameter space in phylogenetics [Bouckaert et al., 2019]. We have im-
plemented annealed importance sampling (AIS) and sequential Monte Carlo (SMC) in the
BEAST2 software platform. The implementation in a platform that is widely used for phylo-
genetics allows for a direct comparison of algorithm performance, both from a statistical and
from a computational point of view. Although we have developed our work independently,
our implementation in BEAST2 can be said to integrate all the various algorithms of [Wang
et al., 2019] in BEAST2 environment, in particular the results we report here are for the
most advanced of the algorithm mentioned in [Wang et al., 2019], i.e. the Annealed Adaptive
SMC. The results obtained from our implementation, discussed in detail in Section 3.12,
demonstrate that our Annealed Adaptive SMC algorithm achieves performance comparable
to the native MCMC method of BEAST2 in terms of both statistical accuracy and compu-
tational efficiency. To our knowledge, this is the first implementation in BEAST2 of a SMC
algorithm. One of the advantages of the SMC method is that we have been able to achieve
similar performances to the MCMC by using far fewer output samples, if we pick the 10
taxa example we discuss in Section 3.14, our SMC implementation has used 1000 particles,
resulting in as many output samples, that can be used, for example, to compute expecta-
tions. In the native MCMC of BEAST2, in the set up to achieve similar number of likelihood
evaluations, 350000 iterations have been used, resulting in as many output samples, with sig-
nificant additional computational time if we want to compute expectations, compared to the
SMC case, especially for information-dense objects as trees. We assume that, with growing
number of leaves, the difference will probably be even more remarkable.
It has required a non trivial amount of work to integrate the algorithms in a complex
platform like BEAST2, we see the equivalence in statistical results with the native BEAST2
MCMC as a first step, and future tuning of the algorithm and of the integration within
BEAST2 can improve the results.
53
3.2 Modelling genetic evolution
The main goal of population genetics and of phylogenetics (we explain the differences between
the two in Section 3.2.2) “is to infer the past history of populations and describe the evolu-
tionary forces that have shaped their genetic variations” [Tataru et al., 2017]. The current
section will give a short introduction.
3.2.1 DNA
DNA is packaged into chromosomes. Taking as example the human species, there are 46
chromosomes situated in the cells nuclei, 23 pairs, one of each chromosome is inherited by
each parent. Genes are sections of the chromosome situated in so-called loci, each gene is
responsible for a trait, for example hair colour. Different variations of the same gene are
called alleles. The expression of different alleles of the same gene will result in different
characteristics, for example a different colour of hair, say brown or blonde [Alberts et al.,
2002].
• selection: mutations that are more advantageous and become more likely to be passed
to the following generations.
Population genetics has usually the time-scale of a single generation [Tataru et al., 2017],
and the goal is to understand evolution of allele frequencies within the same generation,
which is done for example using the Wright-Fisher model, introduced in Section 3.2.3.
Phylogenetics time-scale is usually longer and cross-generations [Tataru et al., 2017], and
the aim is usually to infer the coalescent times of different species [Tataru et al., 2017]. Such
task is accomplished for example by the coalescent model, introduced in Section 3.3.
Of course the above subdivision between population genetics and phylogenetics is to be taken
as a reference and differences between the two can be blurred [Tataru et al., 2017]. In the rest
of the work we will most of the times, for ease of notation, refer to phylogenetics, meaning
either analysis related to population genetics or phylogenetics, therefore both for single and
multi-generational data or a combination of the two, and the relative time-scale will be clear
from the context.
54
3.2.3 The Wright-Fisher model
The Wright-Fisher model [Wright, 1931, Fisher, 1930] describes changes in allele frequencies.
It assumes random sampling (i.e. if we consider two successive generations we can randomly
assign parents from the generation before) and a constant population size [Tataru et al.,
2017]. Assuming population size N (as we said it is a constant of the model), we describe
below a diploid (i.e. with two sets of chromosomes, one coming from each parent) model
of individuals and we see the basic laws behind changes in allele frequencies. Let’s assume
for the example that we have two alleles AA and AB, only subject to random drift (as
said the model is reasonable for short timescales). We provide below two slightly different
derivations of the Wright-Fisher, one with probability given by allele frequency, the other
with probability given by the population size.
We use the description and the same notation of [Tataru et al., 2017]. We want to express
a time relationship for the frequency of the alleles [Tataru et al., 2017], so let r be the
generation indicator and let z(r) be the number of individuals that have, say, the allele AA
in generation r. The proportion within the population N is therefore
z(r)
x(r) = (3.1)
N
Keeping the population N constant, we use a binomial for the conditional distribution of z
in the following generation [Tataru et al., 2017]
And, plugging (3.1) into (3.2), we have that the mean and variance of the binomial (3.2) are
as follows [Tataru et al., 2017]:
1
V ar[x(r + 1)|x(r)] = x(r)(1 − x(r)) (3.5)
N
By iterating the two expressions we have that [Tataru et al., 2017]:
55
r
1
V ar[x(r + 1)|x(0)] = x(0)(1 − x(0)) 1 − 1 − (3.7)
N
And, for big N we can use the approximation [Tataru et al., 2017]
−t
V ar[x(r + 1)|x(0)] ≈ x(0)(1 − x(0)) 1 − 1 − e (3.8)
where t in (3.8) is
r
t(r, N ) = (3.9)
N
We can see from (3.9) that in the Wright-Fisher we can estimate the population size N only
if the generation r is known, otherwise we can only have an estimate of the combined t(r, N )
of (3.9), that we can name generation time [Tataru et al., 2017, Drummond et al., 2002].
From equations (3.4) and (3.5) we can see that there are two equilibria:
1. x(r) = 0, when in a generation we reach zero number of individuals with the specific
allele, this causes the expected value for the following generations to be zero as well,
with zero variance, so the particular allele is extinct;
2. x(r) = N , i.e. all the individuals have the allele, and in the future generations all
individuals will have the allele as well, with zero variance, we have full spread of the
allele.
The above conclusion brings us to say that, under the conditions of the model, if a certain
alelle has small frequency, it is more likely to disappear after a few generations, as it is more
likely to reach the equilibrium x(r + n) = 0 for some n, whereas if it has a frequency close
to 1 (nearly all the population has the allele), it is more likely to reach the equilibrium point
x(r + n) = N for some n, i.e. all the population will end up having the specific allele [Tataru
et al., 2017].
We report here also a slightly different mathematical derivation of the Wright-Fisher model
from [Hein et al., 2004], as we will use some of the results in the following sections on
coalescent theory. In the same setting as the previous section, we consider the frequency of
the allele in the generation r + 1 distributed binomially
In the case of (3.10), differently from equation (3.2), the probability parameter of the binomial
is inversely proportional to the population size
1
p= (3.11)
N
56
Expressing the binomial probability in full we have
k N −k
N 1 1
P (x(r + 1) = k) = 1− (3.12)
k N N
For N large P (x(r + 1) = k) of equation (3.12) becomes Poisson distributed [Hein et al.,
2004]
e−1
P (x(r + 1)) ≈ (3.13)
k!
Using (3.13), we see that the probability that a particular allele has no expression in the next
generation is
P (x(r + 1) = 0) ≈ e−1 ≈ 0.37 (3.14)
And therefore, from the result of (3.14), the probability of at least one expression of the allele
is approximately
P (x(r + 1) 6= 0) ≈ 1 − e−1 ≈ 0.63 (3.15)
Extending the result of (3.15) at t generations in the future, considering the independence
of the events, as per hypotheses of the Wright-Fisher model expressed at the beginning of
Section 3.2.3, we see that
And so, under the hypotheses of the model, after a few generations only a few lineages
contribute to the current population [Hein et al., 2004], in fact, taking as an example a
population size of N = 10000, after t = 15 generations, a number of approximately 10
lineages will have contributed to the current allele population
10000(0.63)15 ≈ 10 (3.17)
The remaining 10000 − 10 = 9990 lineages that, in the example given, were present 15
generations ago, will not have survived [Hein et al., 2004].
57
3.3.1 Coalescent of a sample of two different genes
In the same setting of the Wright-Fisher model of Section 3.2.3, i.e. a constant population
size of N , discrete generation and full mixing of individuals, we want to infer the distribution
of the coalescent times of two genes in a population of size N . Assuming that both genes
are sampled at the same time t = 0 we will be going backwards in the estimation of the
time when they had a common ancestor. Therefore, considering discrete generations we
want to calculate the probability that two genes had the Most Recent Common Ancestor
(MRCA) j generations back, we see in Figure 3.3 a graphical representation of a sample case
of population size N = 4 and of a coalescent event happening j generations in the past
Figure 3.3: Example of coalescent event of two genes, named in the figure t1 and t2 , out
of a population of N = 4. The coalescent event, as described in the paragraph, happens j
generations in the past. The diagram has been created using the package graphviz.
Therefore we have to express the probability that the two genes don’t have a common
ancestor in the previous j − 1 generations and they do have a common ancestor in the jth
generation: since sampling in different generations is independent of each other, and given
the probability N1 that they have a common ancestor in any generation (and therefore 1 − N1
that they don’t), the time of MRCA is distributed as follows [Hein et al., 2004] :
j−1
1 1
P r (TM RCA = j) = 1− (3.18)
N N
From equation (3.18) we can see that the time to the common ancestor is geometrically
distributed with parameter N1 . Equation (3.18) is derived, under the same assumptions of
the Wright-Fisher model, from equation (3.10): the geometric distribution of (3.18) comes
from the binomial (3.10) where we focus on the number of “failures” (i.e. the number of
generations where there is no coalescent event) until the first “success” (the coalescent event
of two samples). Using the properties of the geometric distribution in (3.18), we can calculate
58
the expected TM RCA
1
E (TM RCA ) = 1 =N (3.19)
N
We see, in equation (3.19), that a bigger population size N means an equally bigger average
time to the common ancestor.
Please note in equation (3.20), as explained before, that times, in the coalescent model are
counted backwards, therefore TM RCA = 1 is one generation back. Since we assume that the
population N is significantly larger than k, the term O( N12 ) in (3.20) can be neglected and
therefore the probability of no coalescent events in the previous generation becomes:
k (k − 1)
P r (TM RCA 6= 1) ≈ 1 − (3.21)
2N
And therefore
k (k − 1) k 1
P r (TM RCA = 1) = 1 − P r (TM RCA 6 1) ≈
= = (3.22)
2N 2 N
And, similar to what we derived in equation (3.18), the probability that two genes out of k
different genes in a population of N have a common ancestor j generations back, is given
by the probability of no common ancestor for j − 1 generations, i.e. equation (3.21) applied
j − 1 times, and then a common ancestor, i.e. equation (3.22), applied once:
j−1
k (k − 1) k (k − 1)
P r (TM RCA = j) = 1− (3.23)
2N 2N
59
Firstly we can notice that, by scaling the time by a factor of N [Hein et al., 2004] we have
that:
j
tj = (3.24)
N
Using the time as in (3.24) allows us to express the results independently from the population
size N . It can be shown that [Hein et al., 2004], using the time-scale transformation of
equation (3.24) and the assumption that the population size is much bigger than the number
of samples N k, the geometric distribution converges to an exponential distribution, and
in fact it is shown in [Kingman, 1982] that as N grows the coalescent process converges to a
continuous-time process [Drummond and Bouckaert, 2015]
j−1
k (k − 1) k (k − 1) N k
P r (TM RCA = j) = 1− −−−→ λ exp−λj (3.25)
2N 2N
k (k − 1)
λ= (3.26)
2N
Equation (3.25) gives us the distribution of coalescent times in continuous time. Rewriting
(3.25), and using τ to express the time instead of the discrete j, we have that the density
expressing the probability that two lineages out of k coalesce at time τ is given by:
k(k−1)τ k τ
P r (TM RCA = τ ) = exp− 2N = exp−(2) N (3.27)
The expected value of (3.27), and therefore the average first coalescent time when we have
k different lineages and a population size of N , using the properties of the exponential
distribution, is λ1 , i.e., using equation (3.26).
2N
E (τ |k, N ) = (3.28)
k(k − 1)
So, using equation (3.27), if we want to express the density for all the times so that all the k
different samples arrive to a unique common ancestor, considering that, as per hypotheses,
the generations are not overlapping, there is complete mixing of the population, and that
the population size is constant, due to independence, we multiply the probabilities of the
occurrences [Heled and Drummond, 2008]:
k−1
Y 1 ki (ki −1)τi
f (τ0 ,... τk |N ) ∝ exp− 2N (3.29)
i=1
N
where, in equation (3.29), the ki express the number of different samples at each coalescent
event, so for example if we start the analysis with k = 5 samples (lineages), at the first
coalescent event we will have k1 = 5, then, since two of the lineages will have merged, in
60
the second coalescent event we have k2 = 5 − 2 = 3 lineages, etc, for a number of coalescent
events equal to k − 1 to arrive to the common ancestor of all.
h
dH = (3.30)
l
The alphabet of possible nucleotides in a genetic sequence is relatively small with just four
nucleotides {A, G, C, T } (in RNA we have U in place of T ). Bases may undergo multiple
recombinations, therefore the Hamming distance usually underestimates the actual genetic
distance, and more complex mathematical models have been developed to correct the formula.
The Jukes-Cantor model provides the following correction to the Hamming distance
3 4
dJC = − ln 1 − dH (3.31)
4 3
where dH is the Hamming distance of (3.30). Equation (3.30) assumes that the nucleotides
have equally likely transitions among them, and that their equilibrium frequencies are all the
same. The transitions among the four nucleotides are described as continuous time Markov
process, using a transition matrix [Drummond and Bouckaert, 2015]
qAA qAC qAG qAT
qCA qCC qCG qCT
Q=
q
(3.32)
GA qGC qGG qGT
qT A qT C qT G qT T
While all the non-diagonal elements are non negative and represent the rates of transition
between two nucleotides, the element in the diagonal are negative and represent the total
flow out of each state towards all other nucleotides. If we use variables i and j to indicate
two generic nucleotides, the diagonal element of (3.32) are
X
qii = − qij (3.33)
j6=i
therefore, the total rate of change per site per unit time is [Drummond and Bouckaert, 2015]
X
µ=− πi qii (3.34)
i
61
It is possible to give transitional probabilities through the following [Drummond and Bouck-
aert, 2015]
P (t) = exp(Qt) (3.35)
where Q is the matrix (3.32). We will see in Section 3.4.1 the expression of the substitution
formulae for the Jukes Cantor.
By using the Q matrix of (3.36) and the P (t) = exp(Qt) relationship, we end up with the
following transition probability matrix [Drummond and Bouckaert, 2015]:
1 3 4
pii (dJC ) = + exp(− dJC ) (3.37)
4 4 3
1 1 4
pij (dJC ) = − exp(− dJC ) (3.38)
4 4 3
1
Where dJC is the genetic distance of equation (3.31). All the entries of (3.37) tend to 4
as
dJC grows.
ki
!
Y1
2
f (g|θ) = exp − ti (3.39)
i∈Y
θ θ
62
where g in (3.39) is the tree and will be defined below in equation (3.41), θ is a quantity
named effective population size, and is related to the population size N via a conversion
factor ρ used for time conversions [Drummond et al., 2002]
θ = Nρ (3.40)
We did a similar operation to (3.40) of multiplying quantities related to population size and
time in equation (3.24), and the concept is similar in (3.40) [Drummond et al., 2002]. To
express what (3.39) means, let’s say we have N leaf nodes (genetic sequences), with fixed
ages ti , each i ∈ I corresponding to an individual leaf. Define tY as the coalescent times, and
let the edge < i, j > with i > j be the lineage involving nodes i and j, then if Eg is the edge
set [Drummond et al., 2002]
g = (Eg , tY ) (3.41)
Equation (3.41) represents a realization of a coalescent process, given the leaf nodes and the
times tI (remember each i ∈ I represents an individual and its associated time). We define
the set of trees as Γ = (Eg , tY ), as explained in [Drummond et al., 2002], the integration of
Γ is wrt dg = dtN +1 ...dt2N −1 , i.e., with the exception of the tI times of the N tree leaves.
The tree g of (3.41) is characterised by the following distribution (recall equation (3.39) and
(3.27)) [Drummond et al., 2002]:
−1
2N
!
ki
1 Y
2
fG (g|θ) = exp − (ti − ti − 1) (3.42)
θN −1 i=2
θ
Formula (3.42) becomes the coalescent prior that will be used in the expression of the pos-
terior in Section 3.7, for a detailed derivation of (3.42) please refer to [Drummond et al.,
2002], we give here a brief explanation of some key parts: the product in (3.42) comes from
the independence of events in forming the coalescent events at times ti , ki represents the
branches between ti−1 and ti , and the expression in the exponential comes from combina-
torics considering that the number of possibilities to form a coalescent event from 2 out of
the ki branches is k2 [Drummond et al., 2002].
63
phylogenetic tree structure (as seen in Section 3.5). The Felsenstein’s likelihood can be
expressed as [Felsenstein, 1981]:
X Y L
Y
F (D|g, Q, µ) = exp (Qµ(ti − tj )) (3.43)
DY ∈D <i,j>∈Eg k=1 si,k sj,k
Where, in (3.43), Eg is the set of edges of the tree, as expressed in (3.41), D is the data
and DY the data relative to internal edges, whereas Q (transition matrix) and µ (mutation
rate) are parameters of the substitution model, as seen in Section 3.4 (in particular equations
(3.32) and (3.34) respectively).
Digging a bit more into the terms of (3.43), the inner exponential term represents the prob-
ability of transitioning from nucleotide si,k at node i to nucleotide sj,k in node j, given the
substitution model expressed by Q and µ, and ti − tj represents the coalescent time between
i and j (therefore the length of the section of the tree between the two nodes i and j), there
is a product in k which is extended over the length L of the genetic sequence, which repre-
sents the product of probabilities of each single site, and this is due to independence, in fact
one of the assumptions of the Felsenstein likelihood is that transitions in each site happen
independently of each other. Then moving out in the formula (3.43) we see a product in
< i, j >∈ Eg , and as explained in the previous sentence, i and j are two nodes connected
by an edge, therefore < i, j > represents an edge in the set Eg (as seen in equation (3.41)),
and therefore the product is over all edges of the tree. Moving finally to the outer summa-
tion over all the possible realizations DY of internal states, given the DNA sequences at the
tips D, that is to be considered like the discrete version of an integral over the state space
of all possible realizations of internal states of the tree DY , which ensures all the possible
combinations are considered.
• θ is the effective population size, as seen in equation (3.40) of Section 3.5, and can be
expressed as a product of effective population size and a parameter ρ that represents
the conversion of coalescent times in calendar units, as explained in Section 3.5
64
• g = {Eg , t} is the tree, with the set of edges < i, j >∈ R and t the coalescent times
(see equation (3.41))
• Ω = Ω(Q, µ) includes parameters of the substitution model, and the mutation rate,
as seen in equation (3.32) for the expression of Q, and in equation (3.34) for µ
And, in equation (3.44), the following is the explanation of the terms:
• P (θ) is the prior on the population size. For our simulations we have chosen a expo-
nential distribution
−1
2N
!
ki
1 Y
2
P ost(θ, g, Ω|D) ∝ λ exp (−λθ) exp − (ti − ti − 1) ·
θN −1 i=2
θ
L (3.45)
X Y Y
· exp (Qµ(tii − tj ))
DY ∈D <ii,j>∈Eg k=1 sii,k sj,k
Equations (3.43) (and consequently (3.44) and (3.45)) assumes that the mutation rate µ
be constant across all sites, which is often not the case (see for example [Bouckaert and
Lockhart, 2015]). As explained in [Bouckaert and Lockhart, 2015] and [Yang, 1994], good
results have been obtained by assuming that the mutation rate µ varies across sites according
to a gamma distribution Γ(α, α1 ), in this case the Felsenstein likelihood of equation (3.43)
can be expressed as
L Z ∞ X L
Y 1 Y Y
F (D|g, Ω, α) = Γ(α, ) exp (Qr(ti − tj )) dr (3.46)
k=1 0
α D ∈D <i,j>∈E k=1 si,k sj,k
Y g
Although it is common to approximate the integral of equation (3.46) with a sum over KΓ
categories, and in such case equation (3.46) becomes
Y KΓ X
L X Y
F (D|g, Ω, α) = exp (Qrc (α)(ti − tj )) (3.47)
k=1 c=1 DY ∈D <i,j>∈Eg si,k sj,k
Therefore in this latter case of using a Γ-distributed mutation rate µ, an additional prior is
needed in equation (3.44), to account for the shape α of the Γ(α, α1 ) distribution, therefore,
ultimately, the full expression of the posterior, formerly in equation (3.44), is updated as
follows
P osterior(θ, g, Ω, α|D) = P (θ)Pα (α)Pc (g|N )F (D|g, Ω, α) (3.48)
Where, in (3.48), Pα (α) represents the prior for the parameter α which determines the shape
of the additional Γ distribution
65
3.8 Some history of software for phylogenetics: from
BEAST to BEAST2
A group of researchers in the field of phylogenetics created a software named Bayesian
Evolutionary Analysis Sampling Trees, or with its acronym BEAST, [Drummond and
Rambaut, 2007, Drummond et al., 2012] and a community around it. Mixing knowledge in
software engineering, phylogenetics and statistics, they have been able to create a software
platform to enable researchers to solve real-world phylogenetic problems, for example like
the one in the format of Section 3.7, in a Bayesian framework. As the size of the software
project grew over the years, some in the community felt the need to create a second branch
of work, naming the new software BEAST2 [Bouckaert et al., 2019], to indicate clearly
that it was born from BEAST. One of the main developments that BEAST2 brought was
in terms of application of software engineering principles, bringing more modularity, better
scalability and management of packages and features additions [Bouckaert et al., 2019]. To
our knowledge, both BEAST and BEAST2 continue to exist and are updated independently.
We have developed our work entirely in BEAST2, and will therefore from now on only
concentrate and discuss the BEAST2 platform.
#NEXUS
begin data;
66
dimensions ntax=3;
format datatype=dna interleave=no gap=-;
matrix
Tarsius_syrichta AAGTTTCATTGGAGCCACCA...
Lemur_catta AAGCTTCATAGGAGCAACC...
Homo_sapiens AAGCTTCACCGGCGCAGTC...
;
end;
The software BEAUTI allows to select the Bayesian model settings such as the priors to
be used, and parameters of the substitution model (see Sections 3.3 and subsequent), and
also auxiliary parameters such as the desired number of MCMC iterations to be used for the
analysis. The output file of BEAUTI is a xml configuration file which will be fed into the
main software component of the package, named BEAST, which is to be used to perform
the actual Bayesian statistical analysis.
67
t r e e STATE 0 = ( ( 1 : 0 . 5 6 6 5 , 2 : 0 . 5 6 6 5 ) : 0 . 3 9 9 3 , 3 : 0 . 9 6 5 8 ) : 0 . 0 ;
We see in Figure 3.4 below a graphical representation of the three in listing 3.1 with internal
BEAST2 representation treeST AT E 0 = ((1 : 0.5665, 2 : 0.5665) : 0.3993, 3 : 0.9658) : 0.0;:
0.5665 is the time in the past of estimated coalescent between t1 and t2 , then after an
additional time of 0.3993 there is an additional coalescent event of t3 with the branch formed
by t1 and t2 , until the MRCA, i.e. the estimated Most Common Recent Ancestor of all the
sequences.
Figure 3.4: Example of coalescent tree composed by three taxa, named t1 , t2 and t3 in
the figure, corresponding to the internal tree state in the software BEAST2 ((1 : 0.5665, 2 :
0.5665) : 0.3993, 3 : 0.9658) : 0.0;: 0.5665), as described in the paragraph. We know that
coalescent times are to be intended in the past (see Section 3.3), therefore the timescale has
to be intended from the bottom (time of samples) to the top (time in the past until when
the three species had converged into a single ancestor, named MRCA, Most Common Recent
Ancestor). The diagram has been created using the package graphviz.
The output of BEAST2 MCMC runs will be a sequence of trees, like the one represented
in Figure 3.4, plus the MCMC chain of the other components of the state space (of which we
talked about when describing the posterior in Section 3.7)
68
3.10 MCMC example with the standard coalescent in
BEAST2
We show now an example of analysis performed on a sample set of 7 taxa using BEAST2.
As explained in Section 3.5, for the standard coalescent model with N sequences, we need to
estimate the following parameters:
• Population size;
• Tree;
• Substitution model;
• Possibly other parameters (for example the shape of the Gamma distribution for equa-
tions (3.46) or (3.48));
Through the configuration software BEAUTI, introduced in Section 3.9.1, we set the config-
uration of the model, and in particular choose the priors, and for the parameters above we
have chosen the priors as follows
• Substitution model: Jukes Cantor 69, introduced in Section 3.4.1, will be used. As a
reminder, the Jukes Cantor assumes all rates equal for the nucleotides (elements of the
transition matrix (3.36)), and as a consequence assumes equal equilibrium frequencies;
• In addition to the parameters of the substitution model expressed in the previous point,
we also want to account for variability of rates across sites; to model rate variability
we will use a gamma site model with 4 categories, i.e. we use 4 groups representing
4 quantiles for the variability of the gamma shape (see for example [Bouckaert and
Lockhart, 2015]): looking at the discretised equation of the full posterior (3.47), this
means using KΓ = 4.
3.10.1 Results
A run of the standard BEAST2 software has been run on the sample set of 7 taxa introduced
in this section. A run with BEAST2 will produce two main outputs:
• A log file containing the samples of the MCMC chain for the parameters of the state
space, except trees, and in addition other elements such as the values of prior, likelihood
and posterior at each point;
69
• A trees file, containing the MCMC chain of the coalescent tree at each iteration of the
MCMC. The internal representation of the tree is as described in Section 3.9.2 (in the
listing 3.1), with nested brackets indicating coalescent events, and with the associated
coalescent times.
Figure 3.5: The Tracer, a software part of the software package BEAST2. The software is
capable of displaying parameters relative to the MCMC runs, like estimated distributions and
ESS of the chains.
70
3.11 Visualization of results for coalescent trees
It is not easy to visualize a complex object like a coalescent tree. And it is also not immediate
to understand how to measure and estimate convergence in a MCMC chain of coalescent trees
which are in the format shown in Section 3.9.2 (in the listing 3.1).
3.11.1 TreeAnnotator
Firstly, there is a BEAST2 software named TreeAnnotator, which performs Consensus
Tree analysis, which gives information on the uncertainty related to the estimated value of
the coalescent tree.
3.11.2 FigTree
The software FigTree is a freely downloadable software [Rambaut, 2023] which can take
as input the trees MCMC chain and perform some visualization of the samples. Below a
screenshot of the interface
Figure 3.6: The Figtree software [Rambaut, 2023], useful to visualize and get information
on the MCMC chain of trees generated by BEAST2.
71
MCMC. Firstly, we remind the format of the posterior and we rewrite here, for convenience,
some parts of Section 3.7 where the full posterior was explained. The posterior is as follows
Where, in (3.49)
• P (θ) is the prior on the effective population size θ, which in our simulations we have
chosen exponential λ exp (−λθ) with λ = 0.33
• Pα (α) is the prior on the α parameter of the shape of the Γ distributed mutation rate,
which we have chosen to be Γ (3, 2)
−1
2N
!
ki
1 Y
2
exp − (ti − ti − 1)
θN −1 i=2
θ
L Z ∞ X L
Y 1 Y Y
Γ(α, ) exp (Qr(ti − tj )) dr (3.50)
k=1 0 α D ∈D <i,j>∈E k=1 si,k sj,k
Y g
We will therefore present the results of the simulations with respect to the ability of the
algorithms to reconstruct correctly the parameters of the state space:
• population size
• coalescent tree
72
1. Tree Generation: We employed the R library ’ape’ to simulate random coalescent
trees with a specified number of leaves, the output of this process was random coalescent
trees having specified number of leaves.
2. Sequence Evolution: We used the ’seq-gen’ software [Rambaut and Grassly, 1997] to
simulate DNA sequences evolving along these generated trees, according to the evolu-
tion model that we have chosen in advance (we worked therefore backwards, i.e. knowing
what priors and models we would use in BEAST2, we generated data accordingly), the
specific parameters used in ’seq-gen’ were as follows:
The GTR model mentioned above [Tavaré, 1986] is a more general substitution model than
the Jukes Cantor we described earlier in Section 3.4.1, but with the settings above (equal mu-
tation rates and equal equilibrium frequencies) the GTR reduces to Jukes Cantor, therefore,
from the setting above, the sequences have been generated according to the Jukes Cantor
substitution model.
Accordingly to the settings used for the synthetic data, we generated a BEAST2 XML file
with consistent substitution model and priors. In particular, by looking at the posterior
expression that we outlined in equation (3.49) in Section 3.12 we can spot in the code snippet
below from the BEAST2 xml configuration file some of the parameters configurations
<substModel i d=\ t e x t c o l o r { b l a c k } { ‘ ‘ } JC69 . s : S i m T r e e ” s p e c=\ t e x t c o l o r { b l a c k
<f r e q u e n c i e s>0 . 2 5 0 . 2 5 0 . 2 5 0 . 2 5</ f r e q u e n c i e s>
</ substModel>
<s i t e M o d e l i d=\ t e x t c o l o r { b l a c k } { ‘ ‘ } S i t e M o d e l . s : S i m T r e e ” s p e c=\ t e x t c o l o r { b
gammaCategoryCount=”4”
shape=”@gammaShape . s : S i m T r e e ”>
<substModel>JC69 . s : S i m T r e e</ substModel>
</ s i t e M o d e l>
73
3.14 Data with 10 Taxa
Following the procedure outlined in Section 3.13.1, we have generated a synthetic model with
10 taxa. Additional results for 5 and 20 taxa cases are reported in Appendix A.
Generator tree
The first step has been as described in 3.13.1 to generate a coalescent tree with 10 leaves,
and the tree has been randomly generated as below in Figure 3.7
Figure 3.7: Random coalescent tree with 10 leaves generated using the procedure outlined in
the first part of Section 3.13.1. This has been the generating tree for the synthetic data of the
test described in this section. Visualization via FigTree [Rambaut, 2023]
Using the tree generated in the previous step, synthetic sequences have been generated using
’seq-gen’ program, as explained in Section 3.13.1
74
3.15 Annealed Adaptive SMC vs MCMC in BEAST2,
problem set up with 10 Taxa
We have run BEAST2 both with the traditional MCMC algorithm and with our Annealed
Adaptive SMC embedded in BEAST2, and we report here the comparison. A fair comparison
in terms of likelihood evaluations has been kept between the two methods. For the comparison
of results we have used a similar set-up and metrics of [Wang et al., 2019], in fact we have
a number of iterations of MCMC which is comparable with the likelihood evaluations of the
SMC algorithm, given by the number of particle times number of intermediate tempering
steps of the annealing procedure, times the number of MCMC moves per annealing step. So,
considering the comparison fair we report below the results for the various parameters of the
state space.
Figure 3.8: Annealing steps in the SMC run for the 10-taxa example studied in this section.
Therefore the total number of likelihood evaluation for the algorithm has been 1000 ×
55 × 5 = 275000. The adaptive annealing steps have been determined using CESS with a
threshold of 90%, and resampling of particles is done when ESS falls below 50% of particles,
we can see below in Figure 3.9 the ESS chart related to the SMC run
75
Figure 3.9: ESS versus annealing SMC step for the 10-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000 particles are
used for the simulation).
• Coalescent Tree
76
scattered as we can appreciate from Figures 3.10 for MCMC and 3.11 for SMC, and this is
due to the fact that the probability of MCMC moves on the gamma shape parameter has
been kept to the default value that the configuration software BEAUTi (see Section 3.9.1)
gives, and MCMC moves are less likely to happen than moves on effective population size
and trees (as an example, an MCMC move on the gamma shape is 30 times less likely than a
move on the Effective Population Size parameter), therefore the low ESS and the scattered
distributions are a result of this. A better tuning of the frequency of moves should improve
the results.
The mean of the MCMC run is close to the true value of 1, we can see the full statistics in
the following table. The acronym HPD in table 3.16.1 stands for Highest Posterior Density
Interval, which represents the range within which the parameter falls with 95% probability
given the data.
Statistic Value
Mean 1.1159
Standard Deviation 0.1208
Value Range [0.9516, 1.6006]
95% HPD Interval [0.9516, 1.3179]
Effective Sample Size (ESS) 54
Figure 3.10: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Gamma shape with 10 taxa. The values inside the 95% HPD interval shown in
blue and those outside the 95% HPD interval highlighted in gold. Visualization with the
software Tracer.
77
Ammealed Adaptive SMC results for Gamma shape
We can see in the table below the statistics for the SMC run, and although the mean is very
close to the true value of 1, from Figure 3.11 we see that these results should be taken with
a pinch of salt, since the particle diversity is not good, future tuning of the frequency of the
moves on this parameter should improve the particle diversity
Statistic Value
Mean 0.9871
Standard Deviation 0.066
Value Range [0.8282, 1.3063]
95% HPD Interval [0.9743, 1.2840]
Figure 3.11: Frequency distribution for the parameter Gamma shape with 5 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization with
python matplotlib.
78
Population Size. Anyway, keeping this constraint in mind, we report the results of the
analysis in this section.
The mean of the MCMC run is 1.73, we can see the full statistics in the following table
Statistic Value
Mean 1.73
Standard Deviation 0.74
Value Range [0.488, 8.205]
95% HPD Interval [0.721, 3.078]
Effective Sample Size (ESS) 1027
Figure 3.12: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Effective Population Size with 10 taxa.The values inside the 95% HPD interval
shown in blue and those outside the 95% HPD interval highlighted in gold. Visualization with
the software Tracer.
The statistics for the SMC run are in general better than the MCMC run, we can see a
lower variance for example. And we can see from Figure 3.13, that it has the same peak
of the correspective MCMC Figure 3.12, but in the MCMC case the bigger variance and
right-skewness is causing a slightly higher value of the mean:
79
Statistic Value
Mean 1.683
Standard Deviation 0.556
Value Range [0.61, 4.29]
95% HPD Interval [0.81, 2.96]
Figure 3.13: Frequency distribution for the parameter Effective Population Size with 10 taxa,
using Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization
with python matplotlib.
We can also appreciate the SMC algorithm at work by looking at Figure 3.14 below,
showing the evolution of the estimated standard deviation of the population size parameter vs
the annealing step for the parameter Effective Population Size, and we see how the standard
deviation drops significantly through the annealing journey
80
Figure 3.14: Evolution of the estimated standard deviation of the population size parameter
vs the annealing step for the parameter Effective Population Size in the SMC algorithm: we
see how the standard deviation drops significantly through the annealing journey.
3.16.3 Tree
For the tree analysis we use a methodology similar to [Wang et al., 2019] and we compare
trees using the majority-rule consensus. So we will have a the consensus-tree, which is a
“summary” tree for the MCMC run and one for the SMC run, and we will compare them
to the generating tree shown in Section 3.14 to assess how each of the two algorithms has
performed. In addition to visualizing the “summary” trees for the two runs, we will also give
a basic topological metric of performance, the Robinson-Foulds (RF) “symmetric difference”
metric [Robinson and Foulds, 1981], which will identify possible topology mismatch with the
reference tree. The consensus tree has been generated using TreeAnnotator and then the
visualization using FigTree. For the SMC algorithm, the particles have been resampled in
order to be able to compare SMC tree samples without the need to consider the particle
weights when building the consensus, for ease of calculation.
The RF metric result for the run is 0, meaning a match from a topological point of view with
the reference tree of Section 3.14, and we can see from the picture below 3.15 the consensus
tree created with visualization of the 95% confidence range in the coalescent times (see the
comparison with the generator tree of Figure 3.7)
81
Figure 3.15: Consensus tree for the MCMC run with visualization of 95% range for coalescent
times. See comparison with the generating tree (which the MCMC run tries to reconstruct)
in Figure 3.7. Consensus tree has been generated with TreeAnnotator and the visualization
is with FigTree (both softwasre from BEAST2 package).
The RF metric result for the run is 0, meaning a match from a topological point of view with
the reference tree of Section 3.14, and we can see from the picture below 3.16 the consensus
tree created with visualization of the 95% confidence range in the coalescent times (see the
comparison with the generator tree of Figure 3.7, and with the MCMC-generated consensus
tree of Figure 3.15)
82
Figure 3.16: Consensus tree for the Annealed Adaptive SMC run with visualization of 95%
range for coalescent times. See comparison with the generating tree (which the SMC run
tries to reconstruct) in Figure 3.7, and also with the tree reconstructed using MCMC in
Figure 3.15: we see that the SMC is able to reconstruct the generating tree well and with
a smaller uncertainty (the 95% uncertainty ranges in the coalescent times are in general
smaller compared to the MCMC of Figure 3.15). Consensus tree has been generated with
TreeAnnotator and the visualization is with FigTree (both softwasre from BEAST2 package).
By comparing Figure 3.16 with Figure 3.15 we can see that the SMC algorithm has been
able to reconstruct the generating tree with similar performances to the MCMC algorithm.
We may suppose that as the complexity of the posterior increases (the current examples have
been produced synthetically), the Annealed Adaptive SMC will outperform the MCMC for
the reasons outlined in sections 2.5 and 2.6, namely that Annealed SMC navigates better
than MCMC in complex distributions. On the other hand MCMC is often simpler and more
efficient for exploring posteriors that are not highly multimodal or complex.
3.17 Conclusion
In this section we have shown how we successfully integrated a Sequential Monte Carlo
algorithm into the BEAST2 platform. Although we have developed our work independently
from [Wang et al., 2019], our implementation in BEAST2 can be said to implement all the
various algorithms of [Wang et al., 2019] in BEAST2 environment, in particular the results we
83
have reported in Section 3.12 are for the most advanced of the algorithm mentioned in [Wang
et al., 2019], i.e. the Annealed Adaptive SMC. Our implementation is integrated within the
BEAST2 platform, complementing the Markov Chain Monte Carlo (MCMC) method used
natively in phylogenetic analyses by BEAST2.
To our knowledge, this is the first implementation in BEAST2 of a SMC algorithm. The
results obtained from our implementation, discussed in detail in Section 3.12, demonstrate
that, for the particular cases analysed, the annealed adaptive SMC algorithm achieves per-
formance comparable to the native MCMC method of BEAST2 in terms of both statistical
accuracy and computational efficiency. One of the advantages of the SMC method is that
we have been able to achieve similar performances to the MCMC by using far fewer output
samples. We may suppose that as the complexity of the posterior increases (the current ex-
amples have been produced synthetically), the Annealed Adaptive SMC will outperform the
MCMC for the reasons outlined in sections 2.5 and 2.6, namely that Annealed SMC navigates
better than MCMC in complex distributions. On the other hand MCMC is often simpler
and more efficient for exploring posteriors that are not highly multimodal or complex.. If we
pick the 10 taxa example we discussed in Section 3.14: as explained in Section 3.15.1, our
SMC implementation has used 1000 particles, resulting in as many output samples, that can
be used, for example, to compute expectations. In the native MCMC of BEAST2, in the set
up to achieve similar number of likelihood evaluations of the SMC, 350000 iterations have
been used, resulting in as many output samples. Even if we were to discard, say, the first
20% as burn-in, in order to calculate an expectation we would still have to use an impressive
280000 MCMC samples versus the 1000 of the SMC, with significant additional computa-
tional time. And in cases of many leaves, things are likely to get worse quickly, especially
for information-dense objects as trees: for example, in the case of 20 taxa of Appendix A.4,
the software TreeAnnotator, described in Section 3.11.1, needs to be run in a low memory
configuration when processing the tree output samples from MCMC (480000 samples as ex-
plained in Appendix A.5.2) in order not to hang up, whereas it runs smoothly on the 1000
SMC samples. We assume that, with growing number of leaves, the difference will probably
be even more remarkable.
It has required a non trivial amount of work to integrate the algorithms in a complex
platform like BEAST2, as we had to dig deep into the software details of the platform.
This involved understanding and modifying core components to ensure compatibility and
efficiency. We see the equivalence in results with the native BEAST2 MCMC as a first step,
and future tuning of the algorithm and of the integration within BEAST2 can improve the
results.
Our analysis has so far been limited to synthetic data. Future research should focus on
more complex scenarios, particularly those involving multimodal distributions. Such distri-
butions are likely to present a more useful test for the Annealed Adaptive SMC algorithm.
We may suppose that in these more complex scenarios, the Annealed Adaptive SMC will out-
84
perform the MCMC for the reasons outlined in sections 2.5 and 2.6, namely that Annealed
SMC navigates better than MCMC in complex distributions.
Moreover, we see a significant scope for advancement in refining the tuning of parameters
within the BEAST2 platform to optimize the performance and the integration of the Annealed
Adaptive SMC algorithm.
In summary, we see our work as a step forward in the application of Monte Carlo methods
in phylogenetic analysis. By embedding the Annealed Adaptive SMC algorithm into the
BEAST2 platform, we have opened to the possibility of using additional algorithms in a
widely used statistical software platform.
85
Chapter 4
Active Subspaces
4.1 Introduction
As we have explained in previous chapters, Monte Carlo (MC) methods have been widely
used in the exploration of complex and intractable distributions. However, their effectiveness
diminishes in some scenarios, for example in high-dimensional settings with a phenomenon
known as the curse of dimensionality (we have discussed it in Section 2.2). The dimension-
ality problem is particularly visible for MC methods such as Importance Sampling (IS) (see
Section 2.3 and 2.5.1), Markov Chain Monte Carlo (MCMC) (sections 2.4 and 2.5.2) and
Sequential Monte Carlo (SMC) (Section 2.6): all these methods are impacted in general in
a worse-than-linear way [Agapiou et al., 2017, Beskos et al., 2011, 2014]. Active Subspaces
(AS) have been introduced primarily as a method to mitigate the high-dimensional challenges
in generic mathematical systems [Constantine, 2015], with some later applications to Monte
Carlo algorithms (see for example [Constantine et al., 2016, Schuster et al., 2017, Parente,
2020]).
AS represent a smaller dimension of the system that is less than the nominal dimension
of the state space, this intrinsic lower-dimensional subspace is present for example in over-
parametrised systems [Agapiou et al., 2017]. Giving a more precise definition, we call Active
Subspace a subspace informed by the data identified by the directions of biggest change
of the negative log-likelihood; the remaining part of the state space is called inactive, is
orthogonal to the Active Subspace and will in the ideal case only be informed by the prior.
The exploitation of the Active Subspaces can significantly improve the efficiency of MC
algorithms [Constantine et al., 2016]. However there are some challenges that come with the
application of AS and there are limitations in the cases it can be used.
We will in this chapter give a mathematical background of the general theory behind AS
4.2, and then we’ll examine its estimation with Monte Carlo methods in Section 4.3.
In Section 4.4 we will review select literature and give an example on how existing AS
methods have been applied to MCMC with the aim to improve efficiency. The main part
of the chapter is the description and results obtained by using two algorithms. The first
86
algorithm is from [Constantine et al., 2016], and is historically the first AS-based MCMC
algorithm. The algorithm approximately integrates out inactive variables, estimating the
marginal likelihood of the active variables. The algorithm is biased and does not target the
intended posterior but rather a distribution close in Hellinger distance, we call this algorithm
AS-MH-Bias.
In Section 4.7 we discuss some unresolved questions that we spotted in current AS, with
a list of open points and some indications on possible mitigations.
In Section 4.8 we discuss one of the open points, i.e. the bias in AS-based MC algorithms,
and we introduce the second AS-based MCMC algorithm [Schuster et al., 2017], that we name
AS-MH. The AS-MH, like AS-MH-Bias produces an estimate of the marginal likelihood where
inactive variables are approximately integrated out. But the AS-MH, unlike AS-MH-Bias, is
unbiased and in stationarity samples from the desired posterior. We will outline some of the
advantages and disadvantages of using either algorithm compared to standard MCMC, and
show some results in a toy example in 4.8.4.
We conclude the chapter with Section 4.9 where we discuss another open point, namely
that in some cases samples from the prior are used in order to build the Active Subspace
structure, rather than posterior points [Constantine et al., 2016]. We discuss the issue in the
section and propose a possible solution.
f : Rm → R (4.1)
the active subspace can be found through eigenvalues analysis. Formally, by considering the
following matrix [Constantine, 2015]
Z
C= ∇f (θ)∇f (θ)T ρ(θ)dθ (4.2)
87
4.2.2 Estimating the dimension of Active Subspace through the
spectral gap
Assuming the matrix C of (4.2) is m × m, after sorting the eigenvalues λi in decreasing
order, so that λi ≥ λi+1 , i = 0, ...m − 1 we look for a spectral gap, i.e. a significant gap in the
difference of two consecutive eigenvalues λi , λi+1 , see for example Figure 4.1 which represents
a system with m = 4 and the spectral gap is found between eigenvalue 2 and 3 where there
is a difference of 5 orders of magnitude between them (note the y-axis is in log-scale)
Figure 4.1: Example of spectral gap, i.e. a significant difference between two consecutive
eigenvalues arranged in descending order. In the figure the order of the system is 4, and the
spectral gap is found between eigenvalue 2 and 3 since there is a difference of 5 orders of
magnitude between them (note the y-axis is in log-scale).
Say that the spectral gap is found at n < m (n would be 2 in the example of Figure
4.1), that gives an indication that the dimension of the active subspace is n. Each eigenvalue
of (4.2) is, in fact, “the average squared directional derivative of f along the corresponding
eigenvector” [Constantine et al., 2016], and therefore a null eigenvalue will identify directions
where f does not change.
88
where W is the orthonormal basis composed of eigenvectors, and Ba and Bi are the active and
inactive matrices respectively, whose columns are composed by the subsets of eigenvectors
corresponding to eigenvalues λi ≤ λn and λi > λn . With (4.3) we identify a partition of the
space, so that
θ = Ba a + Bi i (4.4)
where a is the active variable, i the inactive variable. The variables a and i are defined once
the matrices Ba and Bi are built with (4.3), since by construction Ba and Bi are orthonormal,
we have that
a = BaT θ (4.5)
and
i = BiT θ (4.6)
It can be shown [Constantine, 2015] that the function f of (4.1) can be approximated with its
conditional average given the active variables, we name the approximate conditional average
g
f ≈ g(BaT θ) (4.7)
with ρi (i|a) the conditional density of i given a. As for the quality of the approximation, we
have that the difference of f (θ) of (4.1) and g(a) of (4.8) is bounded in Hellinger distance by
a quantity that depends on the size of the “inactive” eigenvalues [Constantine et al., 2016],
in the following way
sZ
(f (θ) − g(BaT θ))2 ρ(θ)dθ ≤ C
p
λn+1 + ... + λm (4.9)
where in (4.9) we have used that since Ba and Bi are orthonormal, it is a = BaT θ (and
similarly i = BiT θ). The Hellinger distance is used in (4.9) as it provides an upper bound on
the posterior mean and covariance [Constantine et al., 2016]. It is stated in (4.9) that, via
some constant C, the size of the inactive eigenvalues bounds the Hellinger distance of the
Active Subspace approximation g(a) with the exact function f (θ) It is therefore important
the eigenvalues after the spectral gap are as small as possible: in the ideal case where λj =
0, j > n, g(a) will be an exact approximation of the function f (θ), in the other cases we have
a biased approximation, controlled by the size of the λj = 0, j > n, via (4.9).
89
4.2.4 Active Subspaces directions by principal components
Finding an Active Subspace, as we have seen in the previous sections, corresponds to finding
the eigenvalues of the structural matrix of (4.2), this corresponds, as we can appreciate by
simply taking a look at the equation (4.2) itself, to performing the principal components
analysis (PCA) of the uncentered covariance matrix of the gradient of the function (see also
for example [Cui et al., 2019]), for points taken from a distribution ρ. In fact, the formula
for the covariance used in the unscaled PCA for generic matrix data Z is as follows
Z
Σ= ZZ> ρ(θ)dθ (4.10)
And we can appreciate that equation (4.10) is the same as (4.2) when we substitute the
generic X with the gradient data ∇f (θ). Therefore we will, in the remainder, often make
use of principal components as a tool to find the Active Subspace directions as an alternative
to the traditional eigenvalue method described previously in Sections 4.2 and 4.2.3: the two
methods are equivalent.
M
1 X
ĝ(a) = f (Ba a + Bi ij ), ij ∼ ρi (i|a) (4.11)
M j=1
where θj for j = 1, . . . , N are drawn independently from the the density ρ. Accordingly,
matrix approximations B̂a and B̂i will be created from the eigenvectors of (4.12), similarly
90
to what has been shown in Section 4.2.3.
91
In Section 4.5, considering that AS was introduced in MCMC setting first [Constantine
et al., 2016] in order to decrease the effect of high-dimensions, we will quantify the problems
of standard MCMC in high dimensions and see that the complexity of the algorithm scales
with d2 with d the dimension of the state space.
In Section 4.6 we introduce historically the first AS algorithm to be used in MCMC
[Constantine et al., 2016]. The algorithm is biased and targets a distribution that is not
the intended posterior, but is only close in Hellinger distance. The closeness to the intended
target distribution depends on the quality of the AS approximation.
In Section 4.8.2 we introduce a second AS-based MCMC algorithm, from [Schuster et al.,
2017], that is unbiased and in addition it targets the intended posterior. It can therefore be
considered a theoretical advancement from the biased algorithm mentioned earlier.
It is shown in [Gelman et al., 1997, Roberts and Rosenthal, 2001] firstly that (in reference
settings having d independent components) with an optimal covariance of the proposal that
92
is proportional to 1/d it is possible to achieve a constant acceptance rate of 0.234. With this
optimal proposal, as d grows, it requires a number of iterations that is linear in d to keep
constant the mean squared error of an expectation of a function h
!2
n
1 X
M SE = E h(θj ) − E(h) j = 1, ...n (4.15)
n j=1
where in equation (4.15) n1 nj=1 h(θj ) is the approximation of the expectation of h using
P
MCMC samples θj , whereas E(h) is the true expectation value. In addition the operation of
proposing and evaluating a new point θ∗ = (θ1∗ , θ2∗ , . . . , θd∗ ) (lines 4 and 5 of Algorithm 4) with
RWMH, has a cost of d, given by the d evaluations of the marginals π(θ∗ ) = j πi (θj∗ ) (with
Q
π the posterior and πi its marginals) due to the independence of the components. Therefore
the complexity of RWMH can be estimated in O(d2 ), which comes from the O(d) iterations
to keep the error constant, times O(d) operations per iteration to evaluate each new proposed
point.
f = − log(l(θ)) (4.16)
Therefore this section will explain the particularization of AS in MCMC. We have seen how
the use of gradients allows in Active Subspaces to determine the directions of maximum
change of the negative log-likelihood, as we explained in Section 4.2.3, specifically by using
equation (4.2). The integration of (4.2) is performed with respect to a distribution that can
be for example prior or posterior (the use of prior or posterior for integration will be discussed
in Section 4.9). The eigenvectors of equation (4.2) give the directions of the data-informed
subspace, calculated using the gradient of the log-likelihood. We call Active Subspace
the subspace generated by the eigenvectors associated to the biggest eigenvalues, these will
93
represent the directions of maximum variations of the likelihood: it makes sense that, in order
to maximise efficiency, we look for an enhancement of the algorithms by concentrating the
AS algorithmic power on the Active Subspace. The subspace of vectors which is orthogonal
to the Active Subspace is informed by the prior mainly, and is called inactive subspace.
We report below the algorithms from [Constantine et al., 2016] which performs the MCMC
on the Active Subspace. There is a main MCMC loop in Algorithm 6 which acts on the
approximate active marginal created from the conditional expectation of equation (4.11): see
row 6 of Algorithm 6 where the MH ratio shows the use of the estimate of the negative log-
likelihood ĝ (a) and so we have exp(−ĝ (a)) that we use in place of the likelihood. Algorithm
5 is the algorithm that calculates the approximate conditional marginal ĝ (a) of (4.11).
PNi
3: Compute ĝ (a) = N1i n=1 f (Ba a + Bi in ).
4: return ĝ (a).
5: end function
Alg. 6 AS-MH-Bias
1: Compute the AS and using the procedure outlined in Section 4.2.3 estimate matrices Ba
and Bi .
2: Initialize the algorithm by choosing an initial value a1 and use Algorithm 5 to estimate
ĝ (a1 ).
3: for k = 2 to T do
4: a∗ ∼ qa (·|ak−1 ).
5: Use Algorithm 5 to compute ĝ (a∗ )
∗ ) exp(−ĝ (a∗ ))q (a∗ |at−1 )
6: Set ak = a∗ with probability 1 ∧ p(ap(a a
k−1 ) exp(−ĝ (ak−1 ))q (at−1 |a∗ )
a
7: Else let ak = ak−1 .
8: end for
9: To map back to the original space, for each k = 1, ...T draw {ikj }N k
j=1 ∼ qi (·|a ) and
i
94
4.7 Discussion on open points in current AS methods
Now that we have introduced the concept of AS and the first AS-based MCMC algorithm
AS-MH-Bias (algorithm 6), we report here a list of some Open Points (OP), which identify
some unresolved questions in current literature.
OP1: Bias
The AS-MH-Bias algorithm is biased and in stationarity samples from a distribution that
is only close in Hellinger distance to the intended posterior [Constantine et al., 2016]. This
problem is acknowledged in the paper [Constantine et al., 2016] itself when the authors have
to revert to some empirical methods to understand if the samples collected through the
AS-MH-Bias are close enough to the intended posterior.
We have seen that in AS-based algorithms in MCMC we may use estimates of the marginal
likelihood, obtained by approximately marginalising out inactive (or in some cases active)
variables, one such example of approximate marginal likelihood is Algorithm 5. In some cases
the estimates that we obtain may be very noisy, to the point that the benefit of using AS
becomes questionable.
The current AS Monte Carlo applications mainly employ marginalisation of the inactive
variables and perform MCMC on the active part [Constantine et al., 2016, Schuster et al.,
2017]: if the Active Subspace marginal is complex (for instance high-dimensional or multi-
modal), the MCMC may not perform well.
Some decisions on the structure of the AS, needs to be done preliminarily by using informa-
tion from the Bayesian prior and the resulting structure is maintained throughout the MC
algorithm, but ideally the structure should be defined on the posterior, and posterior-defined
may be quite different from prior-defined.
OP5: Dimensionality
Although AS aims to decrease the impact of the curse of dimensionality some AS-algorithms
may still suffer from it and overall risk to produce a worse outcome than the non-AS counter-
parts, this happens for example when methods like Importance Sampling are used to create
likelihood marginal estimates.
95
OP6: Linearity
Existing literature focuses mainly on partition of the state space in active/inactive through
linearization, but real-world cases are likely to require non-linear transformations.
96
Also open point OP3, on the weaknesses of MCMC, i.e. in cases when the Active Sub-
space is complex to explore and the MCMC may encounter problems in navigating the active
marginal, will be addressed in Chapter 6, in fact we will introduce algorithms where the
“backbone” is given by an SMC sampler, not MCMC.
Additionally, in Sections 5 and 6 we will address OP5, on possible dimensionality prob-
lems faced by some AS algorithms, for example by the AS-MH that, by using Importance
Sampling (IS) on the inactive subspace, may suffer in high dimensions [Agapiou et al., 2017],
in fact we will be introducing algorithms where the marginal estimate is produced not by IS
but, for example, by a SMC sampler.
In Section 4.9 we discuss the OP4 and we present a toy model which highlights the
problems of using prior samples to build the matrices Ba and Bi , instead of the posterior
samples. We then give a possible solution to the issue in Section 6.4, with AS-SMC-a which
does not need prior samples to build the AS structure, but rather adaptively builds the
structure at each step of a SMC algorithm.
We have not analysed OP6 which deals with linearity of the transformation (4.4) in
the present thesis. But considering that real-world scenarios are likely to require greater
complexity of the transformation, future work should be foreseen in that direction.
97
in an application in Section 4.8.4) and then a more real-world 100D problem. While the
2D example is created ad-hoc with a perfect scenario having the inactive variables that by
construction do not influence the likelihood, allowing for a direct and smooth application
of Active Subspace theory, the 100D example highlights the challenges. The application
of theory becomes less straightforward in this case. Although a spectral gap is found, the
inactive eigenvalues are not small in size, and the authors have to acknowledge this and
have to resort to empirical methods to assess convergence, stressing the need of hands-on
validations because the theoretical part becomes weak [Constantine et al., 2016].
where in equation (4.19) pa is the marginal prior for active variables, and pi is the conditional
prior of inactive given the active. Substituting equation (4.19) in (4.18)
Z
la (a) = pa (a) pi (i|a)l(a, i)di (4.20)
98
The final step is to approximate (4.20) with the Monte Carlo integration performed with
Importance Sampling via pseudo-marginal [Beaumont, 2003, Andrieu and Roberts, 2009]:
Ni
ˆla (a) = pa (a) 1 pi (i|a)l(a, i)
X
(4.21)
Ni n=1 qi (i)
where, in (4.21), Ni is the number of samples of inactive variables chosen per each active
variable, and qi is an Importance proposal. Compare the estimate (4.20) with (4.8) of the
non-exact and biased version, and also the Monte Carlo approximations (4.21) with (4.11)
to appreciate the conceptual differences.
By looking at Importance Sampling in (4.21), and remembering from Section 2.3 that the
ideal importance proposal is proportional to the posterior, if the likelihood remains constant
on the inactive subspace (as in the case of perfect Active Subspace), it is clear that an ideal
proposal of the inactive variables is the conditional prior of inactive given the active
w̃n
wn = PNi ;
p=1 w̃p
99
Alg. 8 AS-MH
1: Compute the AS and using the procedure outlined in Section 4.2.3 estimate matrices Ba
and Bi .
2: Initialize the algorithm by choosing an initial value a1 and use Algorithm 7 to estimate
ˆla (a1 ).
3: for k = 2 to Na do
4: a∗ ∼ qa (·|ak−1 ).
5: Use Algorithm 7 to get ˆla (a∗ ), i∗u and {i∗ }Ni
j=1
pa (a∗ )l̂a (a∗ )qa (a∗ |at−1 )
6: Set ak = a∗ and θk = a∗ + i∗u with probability 1 ∧ pa (ak−1 )l̂a (ak−1 )qa (at−1 |a∗ )
7: Else let ak = ak−1 and θk = θk−1 .
8: end for
We see that Algorithm 8 generates one i point for each a point, the index u which is drawn
on line 6 of the auxiliary pseudo-marginal Algorithm 7 ensures that one inactive particle is
drawn according to the importance weights. There is a way to reuse all the inactive points,
following [Andrieu et al., 2010]. In fact the output from Algorithm 8 may be used to estimate
an integral Z
Eπ [h (θ)] = h (θ) π (θ) dθ
θ
with respect to the posterior distribution π, for some function h in two possible ways [Andrieu
et al., 2010].
1. Using one i-point for each a-point (the i-point which is drawn, for each active
particle, using the u index, based on weights on line 6 of the auxiliary pseudo-marginal
Algorithm 7):
Na
1 X k
Êπ [h (θ)]1 = h Ba ak + Bi iu ,k (4.23)
Na k=0
Na X Ni
1 X
wn,k h Ba ak + Bi in,k .
Êπ [h (θ)]2 = (4.24)
Na k=0 n=1
In case of perfect Active Subspaces the AS-MH Algorithm 8 simplifies. Let’s remember that,
in case of perfect Active Subspaces, the inactive variables do not affect at all the likelihood.
In this case, if we use the prior as a proposal for inactive variables, we see that the estimate
of the likelihood of equation (4.21) can be simplified as
Ni Ni
ˆla (a) = pa (a) 1 pi (i|a)l(a, i) 1 X pi (i)
X
= pa (a)l(a) = pa (a)l(a) (4.25)
Ni n=1 qi (i) Ni n=1 pi (i)
100
where, in equation (4.25) we have used that pi (i|a) = pi (i) considering that in case of perfect
Active Subspaces, active and inactive components are independent, and we have also used
that since the likelihood is not affected by the inactive variables, we have that l(a, i) = l(a).
So in case of perfectly inactive variables the integral itself of the active marginal can be
calculated exactly, and in this case Algorithm 8 simplifies in Algorithm 21, reported in
Appendix C, which becomes a MCMC targeting the exact marginal posterior πa = pa la .
101
• The standard MCMC Algorithm 4
We’ll do the comparisons using a simple but illustrative toy example from [Constantine et al.,
2016]. For generality on Bayesian inverse models applied in the case of Active Subspaces,
please refer to Appendix B. We use here a two-parameter quadratic model: θ = [θ1 , θ2 ]T , in
the equation below the negative log-likelihood f of equation (4.16), is
" # "√ √ #
1 1 1 2 2
f (θ) = θT Aθ, A=Q QT , Q= √ √ (4.26)
2 2 − 2 2
The noise parameter in the likelihood is σ = .1, and it is assumed a single data point d = .9
[Constantine et al., 2016] (for the meaning of d and σ see equation (B.1) and subsequent).
The parameter of (4.26) can be tuned to determine the characteristics of the system: in
[Constantine et al., 2016] the value = 0.01 is used. We can see from the pictures below, for
example Figure 4.2 and 4.4, the effect of choosing progressively higher values of : in Figure
4.2, with = 0.01, we see a spectral gap (differences between the two eigenvalues) of 4 orders
of magnitude, in Figure 4.4, with = 0.2, we see a spectral gap of 2 orders of magnitude
and we can also notice that the smaller eigenvalue is bigger than 100: it is worth mentioning
again that the upper bound formulae (4.9) and similar, indicate that the spectral gap is an
important parameter, and also the size of inactive eigenvalues is important, in fact the Active
Subspace will be closer to ideal the more the inactive eigenvalues are close to zero, whereas
for relatively big inactive eigenvalues the exsistence of Active Subspaces is more uncertain,
as the upper bound in Hellinger distance becomes too loose. We can notice in fact that in
the case = 0.01 where the inactive eigenvalue has order of magnitude roughly 0, the AS
approximation of Figure 4.3 is noticeably better than the case of = 0.2, see Figure 4.5,
where the inactive eigenvalue has order of magnitude 2.
case = 0.01
The case of model of equation (4.26) when using = 0.01 is ideal for Active Subspaces, and
even in the traditional biased and non-exact formulation of [Constantine et al., 2016] will
give near-exact results: in fact, firstly there is a spectral gap of nearly 4 orders of magnitude,
as we see from Figure 4.2, and in addition the Hellinger bound of equation (4.9) is dominated
by a number that is relatively small, close to 0 (the size of the smallest eigenvalue).
102
Figure 4.2: Spectral gap for the model of (4.26) with = 0.01: the eigenvalues difffer by 4
orders of magnitude and the inactive eigenvalue 2 is less than 1
Table 4.1: Comparison of MSEs for the three MCMC methods discussed in the paragraph,
using =0.01
We see in table 4.1 results obtained by averaging 100 runs each of standard MCMC, AS-
MH-Bias and AS-MH on the model (4.26) and calculating the mean squared error (MSE).
We see that MCMC and AS-MH have comparable errors, and AS-MH has the advantage
of performing MCMC on a space having half dimension. We report below charts of the
posterior marginals composed by the active component a = BaT θ and inactive i = BiT θ (both
red-dotted) versus the prior components θ1 and θ2 (continuous line in both cases) in chart
4.3
103
Figure 4.3: Posterior active and inactive marginals versus prior. Active marginal a = BaT θ
in LHS and inactive marginal i = BiT θ in RHS (both red-dotted) versus prior components
1 and 2 (continuous line) of (4.26) with = 0.01: we see that the inactive marginal RHS is
almost identical to the prior second component, indicating that we are in near-perfect Active
Subspace and the likelihood is very little informative on this component. See comparison with
LHS chart where the active marginal is very different from the first component of the prior
indicating that the first component is active and the differences are due to the effect of the
likelihood.
case = 0.2
The second case of equation (4.26) when using = 0.2 is less ideal for the traditional biased
and non-exact algorithm of Active Subspaces than the one discussed in Section 4.8.4: in fact,
firstly there is a spectral gap of nearly 2 orders of magnitude (compared to 4 of Section 4.8.4),
as we see from Figure 4.4, but the real weak point is that the Hellinger bound of equation
(4.9) is too loose, in fact it is dominated by a number that is around 200 (the size of the
smallest eigenvalue). See for example the difference in the chart right hand side of Figures
4.5 (where we can spot differences between the red-dotted projected posterior and the black
solid prior) with the same item in Figure 4.3 (where the dotted-red and black solid shapes
are nearly identical).
104
Figure 4.4: Spectral gap for the model of (4.26) with = 0.2: the eigenvalues difffer by 2
orders of magnitude and the smallest eigenvalue is around 200.
Table 4.2: Comparison of MSEs for the three MCMC methods discussed in the paragraph,
using =0.2
We see in table 4.2 results obtained by averaging 100 runs each of standard MCMC, AS-
MH-Bias and AS-MH on the model (4.26) and calculating the mean squared error (MSE). As
the approximation of the AS becomes worse with = 0.2, we see that AS-MH performance
worsen compared to standard MCMC (see table 4.1 for comparison). We report below charts
of the posterior marginals composed by the active component a = BaT θ and inactive i = BiT θ
(both red-dotted) versus the prior components θ1 and θ2 (continuous line in both cases) in
chart 4.5: we see that, compared to the case = 0.01 of Figure 4.3, the likelihood is slightly
more informative in the RHS part of the graph and there is a worse fit
105
Figure 4.5: Posterior active and inactive marginals versus prior. Active marginal a = BaT θ
in LHS and inactive marginal i = BiT θ in RHS (both red-dotted) versus prior components
1 and 2 (continuous line) of (4.26) with = 0.2: we see that the inactive marginal RHS
is very similar to the prior second component, although, compared to Figure 4.3 RHS (case
= 0.01), we can see some differences: in the current case the fit is not perfect, indicating
that the likelihood is, although slightly, more informative in this case.
4.8.5 Conclusion
The toy example shown in Section 4.8.4 is simple but illustrative. We have seen how both
cases of = 0.01 and = 0.2 are fairly good candidates for using AS, since both have a
significant spectral gap, which is one of the pre-requirements [Constantine, 2015, Constantine
et al., 2016]. But there is a significant problem, among the others, that is clear in the example:
while in the case = 0.001 we can see from Figure 4.2 that the inactive eigenvalue is fairly
close to 0 making it a close-to-ideal case, in the second example = 0.2 we see from Figure
4.4 that, although there is order of magnitude 2 between the two eigenvalues, which can be
considered fairly good as a spectral gap, the inactive eigenvalue is bigger in size than 100: not
being 0 or even close to 0 forces the users of the biased Algorithm 6 (second column results in
Table 4.2) to have to use additional empirical methods to check the quality of approximation.
This in particular refers to OP1 in the list of Section 4.7.
106
[Constantine et al., 2016] come up with a solution: when it is not possible or convenient to
draw from the true posterior π in (4.2) or its MC version (4.12), approximations of matrices
C and Ĉ can be built drawing samples from the prior distribution instead of the full posterior
π. If we use the familiar notation
π(θ) ∝ p(θ)l(θ) (4.27)
where p is the prior and l is the likelihood, we can then build a new variant of equation
(4.12), as follows
N
X
Ĉpri = ∇f (θi )∇f (θi )T , θi ∼ p (4.28)
i=1
The matrix Ĉpri is built approximating integration against the prior p and not the full pos-
terior π. Drawing from the prior is assumed to be easy and will bring an approximation
which is again bounded in Hellinger distance, and is quantified in [Constantine et al., 2016]
(see equations (3.9) and subsequent in [Constantine et al., 2016]). But there are cases where
the Active Subspace generated by the prior, that we have called prior Active Subspace
may be different from that generated by the posterior, the posterior Active Subspace, in
such cases the approximation obtained using the prior will be poor. We show one example
of significant differences between the prior Active Subspace and the posterior Active
Subspace in the following Section 4.9.1.
107
The gradient of the log-likelihood of equation (4.29) is:
d
" 2 !#
X θj2 θj
∇ log l(θ) = ∇ − 2 − log 1 + (4.30)
j=1
σj γj
θ2
d
Y exp − σj2
j
= ∇ log (4.31)
2
θj
j=1 1 + γj
d
" 2 !#
X θj2 θj
=∇ − 2 − log 1 + (4.32)
j=1
σj γj
−2θ1 σ12 + θ2 +γ 1
2
1 1 1
..
= . (4.33)
.
−2θd σ12 + θ2 +γ 1
2
d d d
We see from the expression of the likelihood in equation (4.29) that it is composed of two
terms: the Gaussian-like term given by the exponential at the numerator, and the Cauchy-
like term given at the denominator.
If we fix the dimension d = 2 so that θ ∈ R2 and we choose a prior p of the form
" # " #!
0 5000 0
p(θ) = N , (4.34)
0 0 5000
And we then set, as an example, the following values for the parameters in 4.29
σ1 = 10
γ1 = 1012
σ2 = 50
γ2 = 0.1 (4.36)
The behaviour of the likelihood will be different in regions where θ is close to 0, where
the Cauchy term will dominate, whereas far from the origin the Gaussian term will be
predominant.
108
Figure 4.6: Charts representing prior (bottom two charts) and likelihood (upper two) of the
posterior (4.35), with parameter values (4.36). Note: “Dimension 1” in the chart titles is
the first component θ1 and “Dimension 2” is θ2 .
The prior (4.34), as we can appreciate in Figure 4.6 in the bottom two charts, is very wide
relative to the likelihood, therefore the posterior behaviour of (4.35) will be constrained by
the likelihood, which is more narrowly condensed around 0, as we can appreciate by looking
at Figure 4.6 in the top two charts.
Firstly, we can ask ourselves, by just having a look at the charts in 4.6, which one will be,
if any, the predominant direction in the Active Subspaces. Recalling what we said in Section
4.8.2, that when the variables are inactive the likelihood is little or not informative at all
and the prior is a good Importance Sampling proposal for the Monte Carlo integration of the
LHS of (4.21), we see from Figure 4.6 that clearly the first dimension (top and bottom charts
left hand side) will be inactive, whereas the second dimension (top and bottom charts right
hand side), will be the active variable, as we can appreciate from the very small variance of
the likelihood on the second dimension (top chart right hand side).
To prove even further that θ2 would be the right choice as direction of the Active Subspace,
we can also use the ESS, in fact by drawing 1000 importance points from the prior and
measuring the ESS in each of the two dimensions, we have the following results
109
ESS θ1 ESS θ2
140 5
Table 4.3: ESS using prior as importance proposal in the system of Figure 4.6, we see clearly
that the first variable θ1 is inactive since the high ESS shows that it is little informed by the
likelihood, compared to the very low ESS of θ2
We see from Table 4.3 that, as we would expect from the visual inspection of Figure
4.6, the first state space dimension θ1 is little informed by the likelihood, and therefore, by
definition, inactive: this is shown by the relatively high ESS, which shows that the prior is a
good Importance Sampling proposal for the posterior on that dimension. Compare with the
very low level of ESS on θ2 , and we can see that on the second dimension the likelihood is
very informative, the direction is therefore, by definition, active.
We can give a graphical representation of the directions of the Active Subspace in Figures
4.7 and 4.8, where we make use of the principal components to determine the Active Subspace
directions, as described in Section 4.2.4. We see in Figure 4.7 the directions of the Active
Subspace generated using the prior (4.34), and we see that the predominant direction of the
gradient will be horizontal and, wrongly, the first dimension θ1 would be chosen as active
variable.
Figure 4.7: Directions of the principal components of the covariance of gradients using
prior samples: we can see that the main direction is horizontal. See difference with Figure
4.8 where the main direction is vertical
If instead we plot the directions of the principal components estimated using the posterior
points (a MCMC run with 100000 iterations has been used to sample the points), we obtain
110
Figure 4.8, where the main direction is instead vertical, and in this case, correctly, the second
dimension of the state space θ2 is chosen as active component.
Figure 4.8: Directions of the principal components of the covariance of gradients using
posterior samples: we can see that the main direction is vertical. See difference with Figure
4.7 where the main direction is horizontal
Comparing therefore Figures 4.7 for the prior, and 4.8 for the posterior, we understand
that in this model the estimate of the Active Subspace using a prior AS will pose a chal-
lenge, in fact the true predominant direction of the Active Subspace in the posterior will be
perpendicular to the one which would be wrongly estimated using the prior.
4.10 Conclusion
Active Subspaces are a way to address high-dimensionality challenges of MCMC, by identi-
fying a sub-dimension of the state space which is less than the nominal dimension and by
concentrating the main efforts of the algorithms on this Active Subspace. However there
are some limitations that can show up in real-world situations, one for all is represented
by the Hellinger bound equation (4.9), when the bound is too loose, empirical validations
may become necessary to check the quality of the AS approximation. We have presented
algorithm AS-MH-Bias in Section 4.6, historically the first AS-based algorithm to be applied
to MCMC. Then we have shown in Section 4.8.2, AS-MH, an unbiased version, which uses
pseudo-marginals, and targets the intended posterior, and we have compared the perfor-
mances on a toy model in Section 4.8.4. In addition, we have seen in Section 4.9 that the
prior distribution is used to generate samples for the Monte Carlo integration of (4.2) (or its
approximation (4.12)) [Constantine et al., 2016], but there are cases where the Active Sub-
111
space generated by the prior, that we have called prior Active Subspace may be different
from that generated by the posterior, the posterior Active Subspace, in such cases the
approximation obtained using the prior will be poor. We will address this problem in Sec-
tion 6.4 when a method to have progressive approximations of the Active Subspace matrices
will be presented. There are alternative methods, such as using a Laplace approximation,
which could potentially provide an efficient way to approximate the posterior Active Sub-
space instead of drawing from the prior in MCMC settings. Although we do not investigate
these methods in this thesis, they may offer a valid alternative for future work [Tierney and
Kadane, 1986, Rue et al., 2009].
112
Chapter 5
5.1 Introduction
A lot of existing scientific literature on Active Subspaces (AS) start their work based on
the concepts and methods that can be found both in [Constantine, 2015] and, for Monte
Carlo (specifically MCMC) in [Constantine et al., 2016] (we have discussed the first AS
algorithm to be applied to MCMC [Constantine et al., 2016], the AS-MH-Bias algorithm in
Section 4.6). Both papers [Constantine, 2015] and [Constantine et al., 2016] can be said to
be foundational, precisely because they made the way to a lot of subsequent studies (for
example just to mention two among many [Cui et al., 2019] and [Parente, 2020]).
In our work we have chosen a different path, opting instead to build on the foundations laid
by the AS-MH exact algorithm that uses an unbiased estimate of the likelihood in [Schuster
et al., 2017] (we recap the algorithm in 4.8.2). And this decision, as already mentioned in
Section 4.8, starts from the consideration that while both papers provide a full theoretical
support for AS, in real-world cases their practical use seems to be difficult. Some weak points
of the theory are evident in the very same paper [Constantine et al., 2016] which introduced
it, when, after a 2D toy example is discussed and AS is successfully applied, with a more
realistic 100D model the authors have to use empirical methods to assess convergence to the
true posterior.
Fo this reason, from this section forward we will start building upon the AS-MH, trying to
address some of the open points outlined in Section 4.7.
In Section 5.2 we introduce our first proposed novel AS algorithm, named AS-PMMH, that
is based on the AS integration of the particle MCMC [Andrieu et al., 2010]. The AS-PMMH
can be considered an equivalent of AS-MH, where the inactive variables are marginalised out
by using a SMC sampler rather than Importance Sampling (IS). The reason for introducing
AS-PMMH is that, in cases of high dimension of the inactive subspace, we know that the
performances of IS worsen significantly as the number of points required to keep the ESS at
113
a predetermined level scales exponentially with the dimension d of the space [Agapiou et al.,
2017], whereas we know that in SMC the cost is polynomial [Beskos et al., 2011, 2014]. We
compare the performances in a Gaussian model in Section 5.3, the comparison with AS-MH
and standard MCMC seem to show worse performances of AS-PMMH compared to the other
methods, in terms of distribution of RMSE between the true posterior mean and the mean
estimated with the method. This is possibly due to the fact that having kept the number
of likelihood evaluations equal between the algorithms, for a fair comparison, and by using
100000 likelihood evaluations while we have as many iterations in standard MCMC, for AS-
PMMH 100000 likelihood evaluation result in a mere 1666 final samples. The small number
of samples highlights one of the challenges of AS-PMMH: the biggest share of computational
effort is spent on the marginalisation of the inactive subspace, i.e. on the part of the space
we are interested the least. We then introduced the more complex Banana model with some
non linearities in Section 5.4, and the behaviour of the AS-PMMH shows long tails in the
distribution of RMSE, indicating that sometimes the algorithm may get stuck in one of the
tails. This is a behaviour that can happen when estimates of the likelihood are used in lieu
of the actual likelihood, due to noise in the estimate [Andrieu and Vihola, 2015].
The consideration that AS-PMMH uses the computationally more intensive SMC sampler
on the part of the state space we are interested the least, the inactive variable i, which may
lead to a sub-optimal estimate of the active part a, brought us to devise an inverted algorithm,
AS-PMMH-i in Section 5.5, where we switch roles and we use the internal SMC sampler on
the active component whereas the outer MCMC is on the inactive part. But comparison
of results with standard MCMC and AS-MH still shows long tails in the Banana model,
indicating that the noise in the estimate of the likelihood still bring out issues of sticky
behaviour, i.e. the algorithm getting stuck in one of the modes.
So we moved to a version that has no estimate of the likelihood. the AS-Gibbs of Section
5.6 integrates Gibbs sampling into AS. We show that, in the case of perfect or near-perfect
Active Subspaces, when active and inactive variables are independent, the AS-Gibbs proves
a winning strategy, outperforming all other algorithms.
Continuing on the strategy employed with AS-Gibbs, we have shown that, in the case
of perfect or near-perfect Active Subspaces, by integrating MwPG [Chopin, 2002] into AS,
we have devised AS-MwPG and AS-MwPG-i where sampling of particles in the inactive or
active case respectively is done via an internal SMC sampler [Chopin, 2002]. We have shown
in Section 5.8 that by using a “challenging” proposal in a bimodal posterior, the AS-MwPG-i
has been the only algorithm capable of reconstructing correctly both modes, while the other
algorithms remained stuck in one of the modes.
We finally introduced, in Section 5.9, a novel, alternative way to the traditional “eigen-
based” method of [Constantine, 2015, Constantine et al., 2016] to determine the size of the
Active Subspace. The new approach determines the dimension of the inactive subspace as
the largest dimension that yields an ESS that does not drop below some threshold, by using
114
the prior as importance proposal and, on the models used, brings identical results to the
traditional method.
115
Alg. 9 SMC on inactive variables for a given a and t.
1: Simulate Ni points, {in0 }N n
n=1 ∼ pi (· | a) and set each weight w0 = 1/Ni ;
i
2: for s = 1 : t do
3: for n = 1 : Ni do . reweight
4: if s = 1 then
w̃sn = ws−1
n
l1:s (Ba a + Bi in ) ;
5: else
n
l1:s Ba a + Bi i s−1
w̃sn = ws−1
n
;
l1:s−1 Ba a + Bi ins−1
6: end if
7: end for
8: {wsn }Ni
n=1 ← normalise { w̃ n Ni
}
s n=1 ;
9: If s = t, go to line 32;
n
10: for n = 1 : Ni do Simulate the index vs−1 ∼ M ws1 , ..., wsNi of the ancestor of
particle n;
11: end for
12: if some degeneracy condition is nmet then . resample
vs−1
13: for n = 1 : Ni do Set ins = is−1 ;
14: end for
15: wsn = 1/Ni for n = 1 : Ni ;
16: else
17: for n = 1 : Ni do
18: Set ins = ins−1 ;
19: end for
20: end if
21: for n = 1 : Ni do . move
22: Simulate in∗ n
s ∼ qt,i (· | is , a);
n n∗
23: Set is = is with probability
24:
l1:s (Ba a + Bi in∗ n∗ n
s ) pi (is | a) qt,i (is | a)
1∧ ,
l1:s (Ba a + Bi ins ) pi (ins | a) qt,i (in∗
s | a)
116
Alg. 10 Active subspace particle marginal Metropolis-Hastings
1: Initialise a0 ;
Run Algorithm 0 ¯ 0
2: 9 for a = a and t = T , obtaining lT,a (a ) and weights, denoted
wT1,0 , ..., wTNi ,0 ;
0 1,0 Ni ,0
3: u ∼ M wT , ..., wT ;
4: Let ¯la = ¯lT,a (a );
0 0
5: for m = 1 : Na do
6: a∗m ∼ qa (· | am−1 );
∗m ¯ ∗m
7: Run Algorithm 9 for a = a and t = T , obtaining lT,a (a ) and weights, denoted
wT∗1,m , ..., wT∗Ni ,m ;
8: u∗m ∼ M wT∗1,m , ..., wT∗Ni ,m ;
m ¯m ∗m ¯
9: Set am , {in,m , wn,m }N i
n=1 , u , la = a ∗m
, {i∗n,m
, w ∗n,m Ni
} n=1 , u , lT,a (a ∗m
) with
probability
m pa (a∗m ) ¯lT,a (a∗m ) qa (am−1 | a∗m )
αa = 1 ∧ ;
pa (am−1 ) ¯lam−1 qa (a∗m | am−1 )
m n,m n,m Ni m ¯m m−1 n,m−1 n,m−1 Ni m−1 ¯m−1
10: Else let a , {i , w }n=1 , u , la = a , {i ,w }n=1 , u , la ;
11: end for
117
Likelihood
In the following, the vector of parameters is referred to as θ = (θ1 , θ2 , . . . , θn ), and the sum of
the parameters represent the estimated mean of the Gaussians that compose the likelihood
n
X
µ̂ = θj (5.1)
j=1
Here, µ̂ is from equation (5.1), P represents the number of independent observations, and
n is the size of the state space. The normal distribution, for each independent observation,
will model the probability of observing the data yi given the array of parameters θ.
Prior
We will use a prior given by a Gaussian centered at 0 and with variance 5000 in all directions:
n
Y
p(θ) = N (0, σ 2 ) (5.3)
i=1
Gradient of log-likelihood
With the likelihood of equation (5.2), with the explicitation of the normal terms we have
P
Y 1 1 Pn 2
l(y|θ) = √ e− 2 (yi − j=1 θj ) (5.4)
i=1
2π
P
X 1 1 Pn 2
log l(y|θ) = log √ e− 2 (yi − j=1 θj )
i=1
2π
P
" n
# (5.5)
X 1 1 X
= − log(2π) − (yi − θj )2
i=1
2 2 j=1
Differentiating the negative log likelihood w.r.t. a generic parameter θk , k=1,...n (by
118
looking at (5.5), all the ∂θk will have the same expression)
P X n
∂ X
(− log l(y|θ)) = ( θj − yi ), k=1,...n (5.6)
∂θk i=1 j=1
We take a look at level surfaces for this model in Section 5.3.2, Figures 5.1 and 5.2.
Overparametrization
The overparameterization of the system is evident in the fact that we only need one parameter
to infer the mean of a Gaussian of equation (5.2), therefore our model, for n > 1 will include
more parameters than strictly necessary. For instance, in the case where n = 2, the level
surface for the likelihood can be represented as:
Which means that any value of (θ1 , θ2 ) satisfying (5.7) will leave the likelihood invariant.
Similarly, for n = 3, the level surface is defined by:
The overparameterization can be further extended by adding more components to the system.
This increases the dimensionality of the state space, yet the actual dimension required to
represent the system remains one: by what we said in the previous sections, our expectations
when analysing the system are to find an Active Subspace having dimension 1, and an
Inactive Subspace having dimension n − 1 (increasing the number of parameters will,
therefore, increase the dimension of the Inactive Subspace). We further expect that the
data will show that the direction of the Active Subspace will be perpendicular to, in the
case of n = 2 or n = 3, the line represented by (5.7) or the plane represented by (5.8)
respectively, in fact along the above mentioned lines the likelihood will remain constant.
Visual representation
We can see what the level surface of the likelihood will look like, projected on the same 3D
chart of the prior, for the particular case where θ1 + θ2 + θ3 = 0, we give two snapshots of the
3D graph done using python library plotly [Plotly, 2015]. We can appreciate from Figures
5.1 and 5.2 a cross-section showing the prior and the level surface of the log-likelihood
119
Figure 5.1: Graphical representation of the model described in 5.3.2: we see a 2D slice of the
Gaussian prior on the horizontal plane θ1 = 0 (black color indicates low probability zones,
whereas progressively warmer color towards the center indicate zones of higher probability,
as indicated by the colorbar) together with the level surface of the likelihood in the particular
case θ1 + θ2 + θ3 = 0 (green plane), created using python library plotly [Plotly, 2015]
120
Figure 5.2: Different viewpoint of Figure 5.1, created using python library plotly [Plotly,
2015]
By performing the eigenvalue analysis as described in Section 4.2.2, using the gradient of log
likelihood (5.6) we see that for a 10D system we have the eigenvalues in Figure 5.3
Figure 5.3: Eigenvalues of 10D Gaussian model, we see that the estimate AS size is 1, con-
sidering the spectral gap between eigenvalues 1 and 2. The dimension of the Active Subspaces
is na = 1.
121
For a 25D system the chart is in Figure 5.4
Figure 5.4: Eigenvalues of 25D Gaussian model, we see that the estimate AS size is 1, con-
sidering the spectral gap between eigenvalues 1 and 2. The dimension of the Active Subspaces
is na = 1.
We see that for both 10D and 25D the size of the Active Subspace is 1, according to the
procedure that we described in Section 4.2.2.
For the tests, we calculated both the estimated posterior mean µπ and the estimated posterior
covariance Σπ of the 10D and 25D Gaussian model by running a test SMC with N = 50M
particles each, so if we name wj the weights and xj the particles we have
N
X
µπ = w j xj (5.9)
j=1
n
X
Σπ = wj (xj − µπ )(xj − µπ )T (5.10)
j=1
We then used both µπ of equation (5.9) and Σπ of equation (5.10) as the “true” posterior
values for reference. We used µπ to calculate the RMSE of the difference between the mean
estimated by each of the algorithm runs and the true mean µπ (see for example Figure 5.5).
We then used Σπ to build both the optimal coviariance of the proposal in standard MCMC
2.382
Σπ (5.11)
d
122
with d the dimension of the state space [Roberts and Rosenthal, 2001], and by remembering
that θ = Ba a + Bi i, following the same concept of optimal scaling [Roberts and Rosenthal,
2001] on the active subspace, we used
2.382 T
B Σπ Ba (5.12)
da a
as optimal covariance for proposal of MCMC moves on active marginals, whereas we used
2.382 T
Bi Σπ Bi (5.13)
di
• AS-MH Algorithm 8;
• AS-PMMH Algorithm 10, which will perform an outer MCMC and several inner SMC
samplers: for each step the algorithm will perform an SMC sampler on the 24D inactive
subspace to obtain an unbiased estimate of the likelihood to be used in the outer active
MCMC as estimate of the marginal likelihood.
Likelihood evaluations
When comparing the performances of the algorithms, we have tried to keep the number of
likelihood evaluations constant in each run, to ensure a fair comparison. Due to the structure
of the algorithms, the same number of likelihood evaluations may result in a different number
of output samples. Taking a reference figure of 100000 likelihood evaluations, it will result
in:
• AS-PMMH: if we use 10 inner inactive variables and 6 tempering steps of the inner
SMC sampler,then we have 1666 outer MCMC steps and therefore as many output
123
samples (1666 × 10 × 6 = 99960 which is the closest integer to 100000), we summarise
in Table 5.1
Table 5.1: Comparison of number of output samples when performing 100000 likelihood
evaluations in different MCMC methods. ∗ For AS-MH the figure in the table indicates the
number of samples when one inactive sample is used per active variable, like in formula
(4.23). If instead all inactive particles are used, like in formula (4.24), the relative figure
must be multiplied by the number of inactive variables used, 10 in this case.
In Table 5.1, for AS-MH the figure in the table indicates number of samples when one
inactive sample is used per active variable, like in formula (4.23). If instead all inactive
particles are used, like in formula (4.24), the relative figure must be multiplied by 10 (number
of inactive variables used) to consider all the samples.
From the numbers above, we understand that AS-PMMH carries a problem, since 100000
likelihood evaluations result in a mere 1666 output samples. Structurally, in AS-PMMH, a
lot of computational effort is spent in the calculation of the inactive marginal, therefore it is
inefficient since we spend a lot of computational effort on the part of the space, the inactive,
that we are interested the least.
For ease of reference in the rest of the sections, we also report the table for 200000
likelihood evaluations (it is the above Table 5.1 with numbers ×2 )
Table 5.2: Comparison of number of output samples when performing 200000 likelihood
evaluations in different MCMC methods, this is the equivalent of Table 5.1, adapted for
200000. ∗ For AS-MH the figure in the table indicates number of samples when one inactive
sample is used per active variable, like in formula (4.23). If instead all inactive particles are
used, like in formula (4.24), the relative figure must be multiplied by the number of inactive
variables used, 10 in this case.
We have reported additional data for other algorithms in Appendix D, tables D.1 and
D.2.
The first comparison is using the multiESS [Vats et al., 2019] that we described in Section
2.4.8, we use the R software implementation of it. We have run the algorithms in one test
124
run with 200000 likelihood evaluations for each algorithm, using no burn-in (the choice of
not having burn-in, here and in the rest of the experiments, has been made considering that
we start already in the posterior set), and we used optimal covariances as per equations
(5.11) in the MCMC proposal, and (5.12) for AS-MH and AS-PMMH in the active marginal
MCMC proposal. As we explained 200000 likelihood evaluations will give a different number
of output samples in each of the three algorithms, look at Table 5.2 for reference.
The acceptance rate of the MCMC parts has been around 25% for all algorithms. Results
are in Tables 5.3 and 5.4.
Table 5.3: Gaussian 10D multiESS out of 200000 likelihood evaluations (please see Table 5.2
for the number of corresponding output samples N per algorithm).
Table 5.4: Gaussian 25D multiESS out of 200000 likelihood evaluations (please see Table 5.2
for the number of corresponding output samples N per algorithm).
We see in Table 5.3 for 10D and Table 5.4 for the 25D case, that the performance of
AS-MH seem to remain fairly constant, accounting for some random variability, between the
10D and the 25D runs, that is possibly due to the Gaussian system being a near-perfect
Active Subspace, and so the AS-MH pseudo-marginal becomes close to equation (4.25),
and performance may remain constant since the dimension of the Active Subspace remains
constant to 1 from 10D to 25D. We also have to remember that, by construction, AS-MH
and AS-PMMH output samples are more likely to show less correlation than the standard
MCMC, which in turn will cause higher multiESS. The reason of lower correlation is that
even when using small steps in the active marginal MCMC, points in the state space may
contain inactive parts that may be very different. In fact we see significantly higher multiESS
per sample in Tables 5.3 and 5.4.
We also see from Table 5.4 that the overhead of running AS-PMMH with its internal
SMC samplers, seems not to pay off compared to AS-MH in terms of multiESS, this may be
due to the fact that the Gaussian system is fairly simple to explore, and the AS-MH may be,
at least in the dimensions up to 25D explored here, a better choice.
125
Comparison of estimate of expectation
To compare the estimates of the mean coming from the three algorithms, we have performed
50 runs of each algorithm, with 100000 likelihood evaluations in each algorithm, on both the
10D and 25D models, and we have measured the Root Mean Squared Error (RMSE) of the
difference of the true mean and the mean estimated by each of the three algorithms. As
“true” mean we used µp as explained in equation (5.9), and we used optimal covariances as
per equations (5.11) in MCMC, and (5.12) for AS-MH and AS-PMMH in the active marginal
MCMC proposal.
We report the chart with the violin plots showing mean and upper and lower quartile of
the distribution of the RMSE.
Figure 5.5: Distribution of RMSE of the differences between the true posterior mean and
the mean estimated by each of the algorithms over 50 runs. We see that the standard MCMC
in both 10D and 25D has a lower error. Second best performer is AS-MH and third best is
AS-PMMH. The algorithms are in order from the LHS: MCMC, AS-MH and AS-PMMH
(10D first then 25D)
5.3.4 Review
In the analysis of performance using the Gaussian model we see that there is a trade-off:
using the AS-MH algorithm instead of the standard MCMC brings the advantage of having
better multiESS, as we see in Tables 5.3 and 5.4, which indicates that the samples have lower
autocorrelation and better mixing, but seems to do so at the cost of accuracy (see Figure
5.5), in fact the RMSE of the expectation is higher than standard MCMC. The AS-MH in
this case seems not to suffer the curse of dimensionality when increasing from 10D to 25D
system, and this is possibly because we are in a near perfect Active Subspace, and in this
case the AS-MH outer MCMC targets a relatively constant space dimension of 1.
126
The AS-PMMH algorithm has worse multiESS (see Tables 5.3 and 5.4), and worse RMSE
than AS-MH in Figure 5.5: this is possibly due to the lower number of output samples which
makes the estimate poorer (see Table 5.1 for reference, AS-PMMH has 1666 output samples
per every 100000 likelihood evaluations).
P
Y
l(θ) = N (yi | µ̂, 1) (5.14)
i=1
Where in (5.14), the term µ̂ represents the estimate of the mean of the Gaussians using the
parameters of the state space model θ = (θ1 , θ2 , . . . , θn )
µ̂ = θ1 + θ2 + . . . + θn + c + b θn2 + θn−1
2 2
+ . . . + θn−H+1 (5.15)
We see that the case where contemporarily b = 0 and c = 0 in (5.15) brings again the
Gaussian model of Section 5.3.2, and also, for the model to make sense, we need to have
127
H ≤ n. The log likelihood of (5.14) is
P
!
Y
log l(θ) = log N (yi | µ̂, 1)
i=1
P
X
= log (N (yi | µ̂, 1))
i=1
P
(5.16)
2
X 1 (y − µ̂)
= √ exp − i
log
i=1
2π 2
P
(yi − µ̂)2
X 1
= − log(2π) −
i=1
2 2
P
∂ X
(− log l(θ)) = (µ̂ − yi ) (5.17)
∂θj i=1
P
∂ X
(− log l(θ)) = (µ̂ − yi ) (1 + 2bθj ) (5.18)
∂θj i=1
The components of the state space that will be affected by the quadratic part will be
(θn−H+1 , θn−H+2 , ..., θn ).
Prior
We will use the same prior of the previous model of Section 5.3.2: a Gaussian centered at 0
and with variance 5000 in all directions:
n
Y
p(θ) = N (0, σ 2 ) (5.19)
i=1
Visual representation
We can have an idea of what the level surface of the log-likelihood will look like in the banana
family of models models. For the particular case where the model has θ1 +θ2 +θ3 +b (θ22 + θ32 ) =
0, using the value b = 0.001, so that the curvature is mild but still visible, we give a couple
of snapshots of the 3D graph of the level surface projected on the same chart of the prior. We
can appreciate from Figures 5.6 and 5.7 a cross-section showing the prior and the level surface
of the log-likelihood, we can appreciate the curvature of the level surface (green oblique plane)
if compared to Figures 5.1 and 5.2
128
Figure 5.6: Graphical representation of the model described in Section 5.4.2: we see a 2D
slice of the Gaussian prior on the plane θ1 = 0 (black color indicates low probability zones,
whereas progressively warmer color towards the center indicate zones of higher probability,
as indicated by the colorbar) together with the level surface of the likelihood in the particular
case θ1 + θ2 + θ3 + b (θ22 + θ32 ) = 0 (green plane), with b = 0.001, so that the curvature is
mild but still visible. The curvature can be appreciated by comparing with Figures 5.1 and
5.2, where the curvature was absent. Created using python library plotly [Plotly, 2015]
129
Figure 5.7: Different viewpoint of Figure 5.6, created using python library plotly [Plotly,
2015]
By performing the eigenvalue analysis as described in Section 4.2.2, using the gradient of log
likelihood (5.6) we see that for a 10D system we have the eigenvalues in Figure 5.8
130
Figure 5.8: Eigenvalues of 10D Banana model, we see that the estimate AS size is 4, con-
sidering the spectral gap between eigenvalues 4 and 5. The dimension of the Active Subspaces
is na = 4.
Figure 5.9: Eigenvalues of 25D Banana model, we see that the estimate AS size is 4, con-
sidering the spectral gap between eigenvalues 4 and 5. The dimension of the Active Subspaces
is na = 4.
For the tests, we calculated both the estimated posterior mean µπ and the estimated posterior
covariance Σπ of the 10D and 25D Banana model by running a test SMC with N = 50M
131
particles each, and calculating µπ as in equation 5.9 and Σπ as in equation (5.10). We then
used both µπ and Σπ as the “true” posterior values for reference. We used µπ to calculate
the RMSE of the difference between the mean estimated by each of the algorithm runs and
the true mean µπ (see for example Figure 5.5).
We then used Σπ to build both the optimal covariance of the proposal in standard MCMC
using equation (5.11) and (5.12) as optimal covariance for proposal of MCMC moves on
active marginals, whereas we used (5.13) as optimal covariance for proposal of MCMC
moves on inactive marginals. In equations (5.12) and (5.13), da and di are the dimensions
of the active and inactive subspaces respectively.
We will, in this section, compare the performances of standard MCMC, AS-MH and AS-
PMMH. For the rest of the chapter we will be using the 25D Banana model. We have
used the following parameters in the model of equation (5.15): b = 0.001 (mild curvature),
c = 0 (absence of the constant term), H = 3 (the last 3 components will be affected by the
quadratic part of (5.15)).
Similarly to what we have done in Section 5.3.3, we will compare:
• AS-MH Algorithm 8;
• AS-PMMH Algorithm 10, which will perform an outer MCMC and several inner SMC
samplers: for each step the algorithm will perform an SMC sampler on the 21D inactive
subspace to obtain an unbiased estimate of the likelihood to be used in the outer active
MCMC as an estimate of the marginal likelihood.
The first comparison is using the multiESS [Vats et al., 2019] that we described in Section
2.4.8, we use the R software implementation of it. We have run the algorithms in one test
run with 200000 likelihood evaluations for each algorithm, using no burn-in, and we used
optimal covariances as per equations (5.11) in the MCMC proposal, and (5.12) for AS-MH
and AS-PMMH in the active marginal MCMC proposal. As we explained 200000 likelihood
evaluations will give a different number of output samples in each of the three algorithms,
look at Table 5.2 for reference. All algorithms have shown an acceptance rate of around 10%
in their main MCMC. Results are in Table 5.5.
132
Algorithm MultiESS/N MultiESS
MCMC 0.1 200
AS-MH 5.7 1140
AS-PMMH 15.5 516
Table 5.5: Banana 25D multiESS out of 200000 likelihood evaluations (please see Table 5.2
for the number of corresponding output samples N per algorithm).
We see in Table 5.5 that as the posterior becomes more complex, the multiESS drops:
look for comparison with Table 5.4 in the simpler posterior of the Gaussian case where all
percentages were higher. We remind that, by construction, AS-MH and AS-PMMH output
samples are more likely to show less correlation than the standard MCMC, which in turn
will cause higher multiESS. The reason of lower correlation is that even when using small
steps in the active marginal MCMC, point in the state space may contain inactive parts
that may be very different. We see, in fact, significantly higher multiESS per sample for
the two methods in Table 5.5, with AS-MH having the highest absolute multiESS. We also
remind that the output samples N produced by the 200000 likelihood evaluations are different
for each algorithm, with numbers reported in Table 5.2: in the AS-PMMH case the higher
relative multiESS compared to AS-MH results in a lower absolute multiESS, due to the low
number of output samples which, for AS-PMMH.
To compare the estimates of the mean coming from the three algorithms, we have performed
50 runs of each algorithm, with 100000 likelihood evaluations in each algorithm in the 25D
Banana model, and we have measured the Root Mean Squared Error (RMSE) of the difference
of the true mean and the mean estimated by each of the three algorithms. As “true” mean
we used µp as explained in equation (5.9), and we used optimal covariances as per equations
(5.11) in MCMC, and (5.12) for AS-MH and AS-PMMH in the active marginal MCMC
proposal.
We report the chart with the violin plots showing mean and upper and lower quartile of
the distribution of the RMSE.
133
Figure 5.10: Distribution of RMSE of the differences between the true posterior mean and the
mean estimated by each of the algorithms over 50 runs. Starting from LHS: MCMC, AS-MH
and AS-PMMH. We can see from the chart that the distributions for AS-MH and AS-PMMH
have lower mean and upper quartile of MCMC, but longer tails. This may indicate very noisy
estimates of the likelihood in some of the runs of both algorithms which cause the distributions
to have long tails. The MCMC has comparatively smaller tails.
We see in Figure 5.10 that although AS-MH and AS-PMMH both have lower mean
and upper quartile of the standard MCMC, their distribution is characterised by long tails,
although fairly thin. It is not uncommon that MCMC algorithms that use estimates of the
likelihood will show sticky behaviour (i.e. the MCMC getting stuck), like the one shown by
AS-MH and AS-PMMH in Figure 5.10. This is due to the chain getting stuck, possibly due
to noisiness of the likelihood estimates that we use in the AS-MH and AS-PMMH, and also
to the very few effective samples of the two algorithms.
An example taken from one of the runs of the AS-PMMH shows the stickiness in action:
Figure 5.11 shows the trace-plot of one of the components during a AS-PMMH run, we
see how a noisy estimate of the likelihood causes the outer MCMC to become stuck. One
potential fix for the observed sticky behavior could be to increase the number of internal
SMC tempering steps or to increase the number of outer MCMC iterations. This approach
may help reduce the noise in likelihood estimates, allowing the MCMC chain to explore the
posterior more effectively. However, this improvement would come at the cost of a significant
increase in the number of likelihood evaluations.
134
Figure 5.11: Example of ’sticky’ trace-plot in AS-PMMH, taken from one of the runs in
Figure 5.10: a noisy estimate of the likelihood causes the outer MCMC to remain stuck for
a long time, and this causes the very long tails in the distribution of RMSE seen in Figure
5.10 for the AS-PMMH.
Review
Although on the theoretical level AS-PMMH may be seen as an advancement of the AS-MH
algorithm, there are challenges that the AS-PMMH faces that are common to the AS-MH.
We saw previously in Section 5.3.2 how, in a simpler model, all algorithms were not showing
noisy likelihood behaviours, see for example Figure 5.5 where distributions in the simpler
Gaussian model case show relatively short tails, and compare with Figure 5.10 of the more
complex Banana model. On the positive side, AS-PMMH shows the highest multiESS per
sample, as seen in Table 5.5.
When performing tests on the accuracy of the algorithm, we have run the three algorithms
using 100000 likelihood evaluations for each run, and for the AS-PMMH this has meant to
have 1666 outer MCMC step, 10 inner inactive variables, 6 tempering step of the SMC sampler
(1666×10×6 = 99960 which is the closest integer to 100000). One way to decrease the chance
of having a noisy likelihood estimate coming from the SMC sampler could be for example to
increase the number of tempering steps, which is likely to reduce the variance of the estimate
of the likelihood, but adding tempering steps comes at the cost of increasing the number
of likelihood evaluations: if we keep the 1666 outer MCMC iterations and the 10 inactive
variables, each extra tempering step will add 1666 × 10 = 16660 likelihood evaluations.
Another consideration is that we have relatively few MCMC iterations (1666 in this
135
example) to explore the Active Subspace, while reserving the bigger computational effort of
the SMC sampler on the inactive part. We have devised therefore a revised version of the AS-
PMMH algorithm that we have named AS-PMMH-i, which stands for inverted AS-PMMH,
introduced in Section 5.5, it is an algorithm where we will switch roles: it will use the MCMC
for the inactive and the SMC sampler for the active part.
in AS-PMMH we get an estimate of the likelihood by marginalising out the inactive variables
i through an SMC sampler, and we use the estimate in an outer MCMC on the active
variable a. One of the problems of this approach is that it uses the computationally more
intensive SMC sampler on the part of the state space we are interested the least, the inactive
variable i. This may lead to a sub-optimal estimate of a. We have therefore devised an
inverted algorithm, we name it AS-PMMH-i, that on the theoretical level has exactly the
same framework of AS-PMMH, the difference is that in AS-PMMH-i we switch roles and
we use the internal SMC sampler on the active component whereas the outer MCMC is on
the inactive part: the aim is clear, and is to use the most computationally intensive part
on the component of the state space we are interested in the most, i.e. the active one. The
algorithms are reported below: 12 is the outer MCMC on the inactive, whereas 11 is the SMC
sampler on the active variables (they are conceptually the same as the algorithms introduced
in Section 5, only the roles are inverted).
136
Alg. 11 SMC on active variables for a given i and t.
1: Simulate Na points, {an0 }N n
n=1 ∼ pa (· | i) and set each weight w0 = 1/Na ;
a
2: for s = 1 : t do
3: for n = 1 : Na do . reweight
4: if s = 1 then
5:
w̃sn = ws−1
n
l1:s (Bi i + Ba an ) ;
6: else
7:
l1:s Bi i + Ba ans−1
w̃sn = n
ws−1 ;
l1:s−1 Bi i + Ba ans−1
8: end if
9: end for
n Na n Na
10: {ws }n=1 ← normalise {w̃s }n=1 ;
11: If s = t, go to line 30;
12: for n = 1 : Na do
n
13: Simulate the index vs−1 ∼ M ws1 , ..., wsNa of the ancestor of particle n;
14: end for
15: if some degeneracy condition is met then . resample
16: for n = 1 : Na ndo
vs−1
17: Set ans = as−1 ;
18: end for
19: wsn = 1/Na for n = 1 : Na ;
20: else
21: for n = 1 : Na do
22: Set ans = ans−1 ;
23: end for
24: end if
25: for n = 1 : Na do . move
26: Simulate an∗s ∼ q t,a (· | a n
s , i);
27: Set ans = an∗
s with probability
137
Alg. 12 Inactive subspace particle marginal Metropolis-Hastings: AS-PMMH-i
1: Initialise i0 ;
2: Run
Algorithm 11 for i = i0 and t = T , obtaining ¯lT,i (i0 ) and weights, denoted
wT1,0 , ..., wTNa ,0 ;
0 1,0 Ni ,0
3: u ∼ M wT , ..., wT ;
4: Let ¯li = ¯lT,i (i );
0 0
5: for m = 1 : Ni do
6: i∗m ∼ qi (· | im−1 );
∗m ¯ ∗m
7: Run Algorithm 11 for i = i and t = T , obtaining lT,i (i ) and weights, denoted
wT∗1,m , ..., wT∗Na ,m ;
8: u∗m ∼ M wT∗1,m , ..., wT∗Na ,m ;
m ¯m ∗m ¯
9: Set im , {an,m , wn,m }N a
n=1 , u , li = i∗m
, {a ∗n,m
, w ∗n,m Na
} n=1 , u , lT,i (i∗m
) with
probability
m pi (i∗m ) ¯lT,i (i∗m ) qi (im−1 | i∗m )
αi = 1 ∧ ;
pi (im−1 ) ¯lim−1 qi (i∗m | im−1 )
m ¯m n,m−1 Na m−1 ¯m−1
10: Else let im , {an,m , wn,m }N a
n=1 , u , li = i m−1
, {a n,m−1
, w } n=1 , u , li ;
11: end for
• AS-MH Algorithm 8;
• AS-PMMH Algorithm 10, which will perform an outer MCMC and several inner SMC
samplers: for each step the algorithm will perform an SMC sampler on the 21D inactive
subspace to obtain an unbiased estimate of the likelihood to be used in the outer active
MCMC as estimate of the marginal likelihood.
• AS-PMMH-i Algorithm 12, which is similar to the AS-PMMH above, only with inverted
roles: for each step the algorithm will perform an SMC sampler on the 4D active
subspace to obtain an unbiased estimate of the likelihood to be used in the outer
inactive MCMC as estimate of the marginal likelihood.
138
Comparison using MultiESS
The first comparison is using the multiESS [Vats et al., 2019] that we described in Section
2.4.8, we use the R software implementation of it. We have run the algorithms in one test
run with 200000 likelihood evaluations for each algorithm, using no burn-in, and we used
optimal covariances as per equations (5.11) in the MCMC proposal, (5.12) for AS-MH and
AS-PMMH in the active marginal MCMC proposal, (5.13) for AS-PMMH-i in the inactive
marginal MCMC proposal. As we explained 200000 likelihood evaluations will give a different
number of output samples in each of the algorithms, look at Table D.2 for reference. All
algorithms have shown an acceptance rate of around 10% in their main MCMC, except AS-
PMMH-i which has shown a very high acceptance rate of around 60% in the outer inactive
MCMC. Results are in Table 5.5.
Table 5.6: Banana 25D multiESS out of 200000 likelihood evaluations (please see Table D.2
for the number of corresponding output samples N per algorithm).
We remind again that, by construction, AS-MH and AS-PMMH output samples are
more likely to show less correlation than the standard MCMC, which in turn will cause
higher multiESS. The reason of lower correlation is that even when using small steps in the
active marginal MCMC, points in the state space may contain inactive parts that may be
very different. The same applies to AS-PMMH-i, only with inverted roles active/inactive.
Therefore the relatively high multiESS figures for AS-PMMH-i in Table 5.6 are likely to be
due to the much higher acceptance rate of around 60% for AS-PMMH-i in the test run,
the more the samples are accepted the higher the multiESS, as the samples will have little
correlation.
We perform the RMSE test on the 25D Banana model of Section 5.4.2. In this analysis,
we concentrate on the estimates of individual components, as our primary interest lies in
understanding the ’stickiness’ behavior of the algorithms shown by some of the components.
Focusing on component-wise RMSE allows us to better highlight how each algorithm performs
across different parts of the state space. However, future work could incorporate additional
distributional metrics for a more comprehensive analysis. We have run the algorithms with
100000 likelihood evaluations, which means for both AS-PMMH and AS-PMMH-i 1666 outer
MCMC steps, 10 inner particles, 6 tempering step of the SMC sampler (1666×10×6 = 99960
which is the closest integer to 100000), as explained in Table 5.1 (AS-PMMH-i figures are
139
the same as AS-PMMH). The results in terms of distribution of the RMSE of the differences
between the true posterior mean and the mean estimated by each of the algorithms over 50
runs can be seen in Figure 5.12
Figure 5.12: Distribution of RMSE of the differences between the true posterior mean and the
mean estimated by each of the algorithms over 50 runs. We see the tails of the AS-PMMH-i
distribution are smaller than the AS-PMMH, probably because, keeping constant the number
of tempering of the SMC sampler between the two (6), the size of the space the SMC has
to act upon is much smaller: 4D in case of AS-PMMH-i vs 21D in the case of AS-PMMH.
By contrast, the MCMC part has to act on a much bigger space: 21D vs 4D, this probably
explains why the AS-PMMH-i has a bigger average error.
5.5.2 Review
We see in Figure 5.12 that by switching roles and concentrating the efforts on the smaller
subspace the tail of the RMSE distribution for AS-PMMH-i become smaller, compared to
the AS-PMMH. One of the possible reasons is that the estimate of the likelihood is probably
less noisy when running on the active 4D space rather than on the inactive 21D space, but
we also see that the average RMSE is higher than the AS-PMMH, one possible reason is that
the outer MCMC runs on a bigger space, 21D in the AS-PMMH-i compared to the 4D in the
case of AS-PMMH.
We believe that using a marginal estimate brings a trade-off, where by increasing the
number of likelihood evaluations the quality of the approximation should generally improve,
but that comes at the cost of increasing, sometimes considerably, the algorithmic cost: as said
in Section 5.4.3, in the example of Section 5.5.1 if we keep the 1666 outer MCMC iterations
and the 10 active variables, each additional tempering step will add 1666 × 10 = 16660
likelihood evaluations.
The noisiness of marginal estimates, especially when we want to keep down the number
140
of likelihood evaluations (and therefore the algorithmic complexity) is what triggered our
research into alternative ways to estimate the Active Subspace, for example with AS-Gibbs
in Section 5.6 and AS-MwPG in 5.7.
Alg. 13 AS-Gibbs
1: Initialize θ(0) = Ba a(0) + Bi i(0)
2: for t = 1 to T do
3: i∗ ∼ pi (·|a(t−1) ) . propose inactive
pi (i∗ |a(t−1) )l(a(t−1) ,i∗ )pi (i|a(t−1) ) l(a(t−1) ,i∗ )
4: Set i(t) = i∗ with probability 1 ∧ pi (i|a(t−1) )l(a(t−1) ,i)pi (i∗ |a(t−1) )
= l(a(t−1) ,i)
(t) (t−1)
5: Else let i = i
6: a∗ ∼ qa (·|a(t−1) ) . propose active
pa (a∗ )pi (i(t) |a∗ )l(a∗ ,it )qa (a(t−1) |a∗ )
7: Set a(t) = a∗ with probability 1 ∧ pa (a(t−1) )pi (it |a(t−1) )l(a(t−1) ,it )qa (a∗ |a(t−1) )
(t) (t−1)
8: Else let a =a
9: end for
The core idea into using the Gibbs method in AS is that if we have perfectly inactive
variables then we will accept all moves on the inactive variables, in the MH ratio of line 3 in
Algorithm 13. In that case the acceptance probability for the inactive part in the MH ratio
would in fact be:
141
where the final step in (5.21) follows from the assumption that the likelihood function
l(a(t−1) , i) remains invariant with respect to changes in i. Specifically, we assume that
l(a(t−1) , i) = l(a(t−1) , i∗ ), which allows us to simplify the expression:
l(a(t−1) , i∗ )
= 1.
l(a(t−1) , i)
This simplification comes from the likelihood remaining constant over the inactive variable
i, enabling the cancellation of terms.
The first comparison is using the multiESS [Vats et al., 2019] that we described in Section
2.4.8, we use the R software implementation of it. We have run the algorithms in one test
run with 200000 likelihood evaluations for each algorithm, using no burn-in, and we used
optimal covariances as per equations (5.11) in the MCMC proposal, (5.12) for AS-MH, AS-
PMMH and AS-Gibbs in the active marginal MCMC proposal, (5.13) for AS-PMMH-i in
the inactive marginal MCMC proposal. As we explained 200000 likelihood evaluations will
give a different number of output samples in each of the algorithms, look at Table D.2 for
reference. All algorithms have shown an acceptance rate of around 10% in their main MCMC,
except AS-PMMH-i which has shown a very high acceptance rate of around 60% in the outer
inactive MCMC. Results are in Table 5.5.
Table 5.7: Banana 25D multiESS out of 200000 likelihood evaluations (please see Table D.2
for the number of corresponding output samples N per algorithm).
We have explained earlier, for example in the multiESS section of 5.5.1, the reason why,
by construction the samples in AS-MH, AS-PMMH and AS-PMMH-i are likely to be less
correlated, and therefore to show higher MultiESS/N , please refer to the section for expla-
nation.
In AS-Gibbs too, by construction, chain samples are likely to show little correlation, and
therefore have high MultiESS/N , as we can appreciate in Table 5.7, especially as we get
near perfect Active Subspaces. One of the reasons is that, by looking at line 4 of AS-Gibbs,
142
in case of near perfect Active Subspaces it becomes like in equation (5.21), so it is almost
always accepted, which means that consecutive θ samples of the state space in the chain will
in general have different inactive part, even if small steps are taken in the active MH of line
7, and will therefore tend to have little correlation and show high multiESS.
We can appreciate in Table 5.7 that the AS-Gibbs is a clear winner as it shows a much
bigger multiESS: the AS-Gibbs, in terms of multiESS is a much more efficient algorithm than
the other listed, since with the same number of likelihood evaluations it brings a much higher
effective sample size.
To compare the estimates of the mean coming from the algorithm, we have performed 50
runs of each algorithm, with 100000 likelihood evaluation in each run, and we have measured
the Root Mean Squared Error (RMSE) of the difference of the true mean and the mean
estimated by each of the algorithms. As a reminder, 100000 likelihood evaluations will result
in different number of output samples per each algorithm, please refer to Table D.1. We
report the chart with the violin plots showing mean and upper and lower quartile of the
distribution of the RMSE.
Figure 5.13: Distribution of RMSE of the differences between the true posterior mean and
the mean estimated by each of the algorithms over 50 runs. We can see from the chart that
AS-Gibbs has lower mean RMSE.
5.6.2 Review
We have introduced a novel way of performing AS-MCMC on Active Subspaces that uses
Gibbs sampling, we named it AS-Gibbs. We have shown on the Banana model that it
performs better than the algorithms discussed in this section: the standard MCMC, the AS-
143
MH Algorithm 8, the AS-PMMH Algorithm 10 and the AS-PMMH-i Algorithm 12. It has in
fact a much higher multiESS (see Table 5.7), and the distribution of RMSE of the differences
between the true posterior mean and the mean estimated by each of the algorithms shows
lower mean and lower upper quartile than all the others (see Figure 5.13).
We expect the Gibbs Algorithm 13 to perform better than the standard MCMC Algorithm
4 in case of perfectly Active Subspace (so eigenvalues of equation (4.9) all equal to zero). In
fact, considering that the inactive proposal is always accepted (see equation 5.21), algorithm
13 is de facto an MCMC acting on a marginal of the posterior on a sub-dimension of the
space that is da < d, with da the dimension of the active component and d the dimension of
the full space.
We also expect AS-Gibbs to outperform, in general, algorithms that use an estimate of
the marginal likelihood, like AS-MH, AS-PMMH and AS-PMMH-i, in cases of perfect Active
Subspaces: the reason is that MCMC chains that use estimates of marginal likelihoods (like
AS-MH, AS-PMMH and AS-PMMH-i) will in general be noisier, have higher variance and
lower acceptance rate than those that use exact marginals [Andrieu and Vihola, 2015]: in
summary, we expect Algorithm 13 to perform better because it uses an exact marginal,
whereas Algorithm 8 uses an estimate of the marginal likelihood [Andrieu and Vihola, 2015].
If applied on the earlier examples, such as the Gaussian model discussed in Section 5.3.2,
AS-Gibbs is expected to perform similarly to traditional MCMC. The reason is that, in a
25D Gaussian model, the reduction brought by AS-Gibbs is de facto a MCMC performed
on a 24D space instead of the original 25D space. However, this reduction comes with the
additional computational overhead for the calculation of the Active Subspace and of the
structural matrices, described in section 4.2. The advantage of using AS-Gibbs compared
to, for example, traditional MCMC becomes more apparent as the dimension of the Active
Subspace increases. For instance, in the Banana model analysed in this section (with the
Active Subspace having dimension 4).
144
algorithm AS-MwPG. We explain the rationale behind: in cases where the active and inactive
are independent (and so the right case to use AS-Gibbs), but the inactive subspace is complex
to explore, it could be difficult to find a good proposal for the inactive part in the AS-Gibbs
(line 4 of the AS-Gibbs algorithm): the inactive subspace could be for example multimodal,
in which case using an SMC sampler can perform better since the SMC transitions smoothly
from a starting distribution to the posterior.
With this in mind, we have devised the AS-MwPG. We have discussed the MwPG in
Section 2.8.2. The fundamental idea takes the root from what we have done in the AS-Gibbs,
i.e. exploit the cases where there is independence between active and inactive component: in
addition, the AS-MwPG will ease the problems in cases where the inactive part is challenging
to explore. The AS-MwPG will draw from the inactive part using a inner SMC sampler
embedded in an outer MCMC performed on the active component. The inactive SMC will
be conditioned on a path of the inactives that will “survive” all resampling, as we explained
in 2.8.2, and that is the part that plays the role of the conditioning in this extended Gibbs
algorithm. The AS-MwPG algorithm is performed by using Algorithms 15 (outer active
MCMC) and 14 (inner inactive conditioned SMC). As a note, in the below Algorithms 15
and 14, l(θ) is the likelihood, ls (θ) is as per equation (5.22)
where ηt is the tempering exponent (see Section 2.6 for details on the tempering)
t
Y
l1:t (θ) = ls (θ) = l(θ)ηt (5.23)
s=1
145
Alg. 14 Conditional SMC on active variables for a given a, i10:T , w̃0:T
1
and t.
1: Simulate Ni − 1 points, {in0 }N n
n=2 ∼ pi (· | a) and set each weight w0 = 1/Na ;
i
2: for s = 1 : t do
3: for ( don = 2 : Ni ) . reweight
4: if s = 1 then
w̃sn = ws−1
n
l1:s (Ba a + Bi in ) ;
5: else
l1:s Ba a + Bi ins−1
w̃sn = n
ws−1 ;
l1:s−1 Ba i + Bi ins−1
6: end if
7: end for
n Ni n Ni
8: {ws }n=1 ← normalise {w̃s }n=1 ;
9: If s = t, terminate the algorithm;
n
10: for n = 2 : Ni do Simulate the index vs−1 ∼ M ws1 , ..., wsNi of the ancestor of
particle n;
11: end for
12: if ( then some degeneracy condition is met) . resample
n
vs−1
13: for n = 2 : Ni do Set ins = is−1 ;
n
14: end forws = 1/Na for n = 1 : Ni ;
15: else
16: for n = 2 : Ni do Set ins = ins−1 ;
17: end for
18: end if
19: for ( don = 2 : Ni ) . move
20: Simulate in∗ n
s ∼ qt,i (· | is , a);
n n∗
21: Set is = is with probability
22:
l1:s (Ba a + Bi in∗ n∗ n
s ) pi (is | a) qt,i (is | a)
1∧ ,
l1:s (Ba a + Bi ins ) pi (ins | a) qt,i (in∗
s | a)
146
Alg. 15 Active Subspace Metropolis within particle Gibbs
1: Initialise a0 ;
2: Initialise i1,0
t for t = 0 : T ;
3: for m = 1 : Ni do
4: a∗m ∼ qa (· | am−1 );
5: Set am = a∗m with probability
147
Alg. 16 Conditional SMC on active variables for a given i, a10:T , w̃0:T
1
and t.
1: Simulate Na − 1 points, {an0 }N n
n=2 ∼ pa (· | i) and set each weight w0 = 1/Na ;
a
2: for s = 1 : t do
3: for ( don = 2 : Na ) . reweight
4: if s = 1 then
w̃sn = ws−1
n
l1:s (Bi i + Ba an ) ;
5: else
l1:s Bi i + Ba ans−1
w̃sn = n
ws−1 ;
l1:s−1 Bi i + Ba ans−1
6: end if
7: end for
n Na n Na
8: {ws }n=1 ← normalise {w̃s }n=1 ;
9: If s = t, terminate the algorithm;
n
10: for n = 2 : Na do Simulate the index vs−1 ∼M ws1 , ..., wsNa of the ancestor of
particle n;
11: end for
12: if ( then some degeneracy condition is met) . resample
n
vs−1
13: for n = 2 : Na do Set ans = as−1 ;
n
14: end forws = 1/Na for n = 1 : Na ;
15: else
16: for n = 2 : Na do Set ans = ans−1 ;
17: end for
18: end if
19: for ( don = 2 : Na ) . move
20: Simulate an∗ n
s ∼ qt,a (· | as , i);
n n∗
21: Set as = as with probability
22:
l1:s (Bi i + Ba an∗ n∗ n
s ) pa (as | i) qt,a (as | i)
1∧ ,
l1:s (Bi i + Ba ans ) pa (ans | i) qt,a (an∗
s | i)
148
Alg. 17 Inactive subspace Metropolis within particle Gibbs
1: Initialise i0 ;
2: Initialise a1,0
t for t = 0 : T ;
3: for m = 1 : Ni do
4: i∗m ∼ qi (· | im−1 );
5: Set im = i∗m with probability
5.7.2 Review
We have introduced two versions of a novel method that are based on the application of the
Metropolis within Particle Gibbs (MwPG) [Andrieu et al., 2010] to Active Subspaces and
we named the algorithms AS-MwPG (when the internal SMC sampler is used to sample the
inactive variables) and AS-MwPG-i (when the internal SMC sampler is used to sample the
active variables). Some of the conditions where we expect the novel algorithms to perform at
their best are similar to those of AS-Gibbs introduced in Section 5.6 (i.e. that the conditional
distributions are easy to draw from, or in case for perfect AS).
We expect the use of AS-MwPG-i to be prominent with respect to AS-MwPG, as it ded-
icates most of the computational power to the Active Subspace. We expect the AS-MwPG-i
to perform better than AS-Gibbs in those cases where the Active Subspace is complex to ex-
plore (for example multimodal). We will see the AS-MwPG-i in action on one such example
in Section 5.8.
149
5.8.1 Gaussian mixture model
Gaussian mixture models are a class of multimodal distributions that are often used for
example in approximating complex distributions [Reynolds, 2009]. We use here a mixture of
two Gaussians, in a 4D space. For a single observation y, the likelihood of the model has the
form:
1 1
l(θ) = N (y|θ1 + θ2 , 1) + N (y|θ3 + θ4 , 1) (5.24)
2 2
We see in (5.24) that with the state space parameter θ = (θ1 , θ2 , θ3 , θ4 ), the sum of the first
two elements θ1 + θ2 is used to estimate the mean of the first Gaussian, and the sum θ3 + θ4 is
used to estimate the mean of the second Gaussian. The system is clearly over-parametrised
since we would actually strictly only need two parameters in the state space, one for the
estimate of the mean of each Gaussian. When we have multiple observations, say P , in
a vector y = (y1 , y2 , ..., yn ), considering independence of observations, the likelihood (5.24)
becomes:
P
Y 1 1
l(θ) = N (yi |θ1 + θ2 , 1) + N (yi |θ3 + θ4 , 1) (5.25)
i=1
2 2
Prior
We will use a prior given by a Gaussian centered at 0 and with variance 25 in all directions:
n
Y
p(θ) = N (0, σ 2 ) (5.26)
i=1
Gradient of log-likelihood
P
X 1 1
log l(θ) = log N (yi |θ1 + θ2 , 1) + N (yi |θ3 + θ4 , 1) (5.27)
i=1
2 2
P
δ( log l(θ)) X 1 1 δ
= 1 1
N (yi |θ1 + θ2 , 1) (5.28)
δθ1 i=1 2
N (yi |θ1 + θ2 , 1) + 2 N (yi |θ3 + θ4 , 1) 2 δθ1
We have
(yi − (θ1 + θ2 ))2
δ ∂ 1
N (yi |θ1 + θ2 , 1) = √ exp −
δθ1 ∂θ1 2π 2
(yi − (θ1 + θ2 ))2
1 (5.29)
= (yi − (θ1 + θ2 )) √ exp −
2π 2
= (yi − (θ1 + θ2 ))N (yi |θ1 + θ2 , 1)
150
Considering that we are interested in the gradient of the negative log-likelihood, combining
(5.28) and (5.29) we have
P 1
δ( − log l(θ)) X 2
N (yi |θ1+ θ2 , 1)
= 1
(θ1 + θ2 − yi ) (5.30)
δθ1 i=1 2
N (yi |θ1 + θ2 , 1) + 21 N (yi |θ3 + θ4 , 1)
To generalise equation (5.30), it easy to check that, with likelihood given in (5.25), for i = 1, 2
the components of the gradient of the negative log-likelihood are
P 1
δ( − log l(θ)) X 2
N (yi |θ1+ θ2 , 1)
= 1
(θ1 + θ2 − yi ), i=1,2 (5.31)
δθi i=1 2
N (yi |θ1 + θ2 , 1) + 21 N (yi |θ3 + θ4 , 1)
P 1
δ( − log l(θ)) X 2
N (yi |θ3+ θ4 , 1)
= 1
(θ3 + θ4 − yi ), i=3,4 (5.32)
δθi i=1 2
N (yi |θ1 + θ2 , 1) + 21 N (yi |θ3 + θ4 , 1)
Considering that the yi are noisy observations of the mean of either of the two Gaussians,
we understand from equations (5.31) and (5.32) that the surface level of the log likelihood
are those that have either
θ1 + θ2 = µ1 and θ3 + θ4 = µ2 (5.33)
or
θ1 + θ2 = µ2 and θ3 + θ4 = µ1 (5.34)
Where µ1 and µ2 are respectively the means of each of the two Gaussians. We show this
visually in next subsection.
Visual representation
In a realization of the Gaussian mixture model obtained by generating synthetic data from
the underlying model
1 1
N (−5, 1) + N (5, 1) (5.35)
2 2
We show what we said earlier at the beginning of Section 5.8.1, specifically in equations
(5.33) and (5.34) we expect the following combinations to leave the likelihood invariant:
• θ1 + θ2 = −5 and θ3 + θ4 = 5
• θ1 + θ2 = 5 and θ3 + θ4 = −5
After generating noisy synthetic data from (5.35), in order to visualize the posterior, we have
run a SMC with 1M particles with likelihood of the form of equation (5.25) and Figure 5.14
shows the output combinations that leave the likelihood invariant, which, as expected, are
θ1 + θ2 = −5 and θ3 + θ4 = 5 or vice-versa
151
Figure 5.14: Level surface of the system of equation (5.35), the combinations θ1 + θ2 = −5
and θ3 + θ4 = 5 or θ1 + θ2 = 5 and θ3 + θ4 = −5 are the ones that leave the likelihood (5.25)
invariant
In the following scatter-plot, as a confirmation of what we said, we see that for the couple
of parameter θ1 and θ2 , the expected combinations are the ones that bring θ1 + θ2 = ±5
(similar results hold for the other couple θ3 and θ4 ).
Figure 5.15: Level surface of the system of equation (5.35): the combination of parameters
θ1 + θ2 = ±5 are the ones that leave the likelihood (5.25) invariant
5.8.2 Results
We have used for the likelihood a realization of the Gaussian mixture model obtained by gen-
erating synthetic data from the same underlying model of equation (5.35), we know therefore
that the following combinations will leave it invariant (see also Figure 5.14):
• θ1 + θ2 = −5 and θ3 + θ4 = 5
152
• θ1 + θ2 = 5 and θ3 + θ4 = −5
To illustrate that the SMC is able to explore both modes of the target without a carefully
chosen proposal, we have used for all algorithms a covariance matrix that is considerably
smaller than the optimal covariance [Roberts and Rosenthal, 2001], for the proposals both
of the standard MCMC and of the MCMC part in AS-Gibbs, AS-MH, AS-MwPG-i. This
setting will allow us to see if any method, even with a covariance of the proposal that is
technically very challenging, is capable of reconstructing the bi-modal posterior correctly
without remaining stuck in one of the modes.
We have run standard MCMC Algorithm 4, AS-MH Algorithm 8, AS-Gibbs Algorithm 13
and AS-MwPG-i Algorithm 15, for each method we have used 440000 likelihood evaluations
to have fair comparisons. Metropolis within Particle Gibbs AS-MwPG-i is the only algorithm
that has been able to correctly reconstruct the posterior, as we can appreciate from Figure
5.16 and 5.17. Which highlights the strengths of the AS-MwPG-i algorithm: even with a
covariance of the proposal that is technically “challenging”, it is the only algorithm that is
capable of reconstructing the bi-modal posterior correctly.
Results AS-MwPG-i
Figure 5.16: AS-MwPG-i: reconstruction of the posterior for the system having likelihood
(5.35). We see that AS-MwPG-i correctly reconstructs the bimodal posterior, both modes
(θ1 + θ2 = −5 and θ3 + θ4 = 5) and (θ1 + θ2 = 5 and θ3 + θ4 = −5) are found (Note: in the
figure “Sum components first mean” on the x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum
components second mean” on the y-axis is the sum µ2 = θ3 + θ4 ).
153
Figure 5.17: AS-MwPG-i: reconstruction of the posterior for the system having likelihood
(5.35). We see in the figure θ1 (named Component 1 in the figure) vs θ2 (named Component
2): we can appreciate that MwPG correctly reconstructs the bimodal posterior, in fact both
combinations θ1 + θ2 = −5 and θ1 + θ2 = 5 are found (Note: in the figure “Component 1” on
the x-axis is θ1 , whereas “Component 2” on the y-axis is θ2 ).
Figure 5.18: Standard MCMC: incorrect reconstruction of the posterior for the system
having likelihood (5.35). We see that MCMC incorrectly reconstructs the bimodal posterior:
only the mode (θ1 + θ2 = −5 and θ3 + θ4 = 5) is found, whereas the mode (θ1 + θ2 = 5
and θ3 + θ4 = −5) is missing, see comparison with Figure 5.16. (Note: in the figure “Sum
components first mean” on the x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum components
second mean” on the y-axis is the sum µ2 = θ3 + θ4 ).
154
Figure 5.19: Standard MCMC: incorrect reconstruction of the posterior for the system
having likelihood (5.35). We see in the figure θ1 (named Component 1 in the figure) vs θ2
(named Component 2): we can appreciate that MCMC gets stuck in one mode and only the
combination θ1 +θ2 = −5 is found, while the mode θ1 +θ2 = 5 is missing, see comparison with
Figure 5.17 (Note: in the figure “Component 1” on the x-axis is θ1 , whereas “Component 2”
on the y-axis is θ2 ).
Results AS-MH
Figure 5.20: AS-MH algorithm 8 of Section 4.8.2: incorrect reconstruction of the posterior
for the system having likelihood (5.35). We see that AS-MH algorithm incorrectly reconstructs
the bimodal posterior: only the mode (θ1 + θ2 = −5 and θ3 + θ4 = 5) is found, whereas the
mode (θ1 + θ2 = 5 and θ3 + θ4 = −5) is missing, see comparison with Figure 5.16. (Note: in
the figure “Sum components first mean” on the x-axis is the sum µ1 = θ1 + θ2 , whereas “Sum
components second mean” on the y-axis is the sum µ2 = θ3 + θ4 ).
155
Figure 5.21: AS-MH algorithm 8 of Section 4.8.2: incorrect reconstruction of the posterior
for the system having likelihood (5.35). We see in the figure θ1 (named Component 1 in the
figure) vs θ2 (named Component 2): we can appreciate that AS-Gibbs gets stuck in one mode
and only the combination θ1 + θ2 = −5 is found, while the mode θ1 + θ2 = 5 is missing, see
comparison with Figure 5.17 (Note: in the figure “Component 1” on the x-axis is θ1 , whereas
“Component 2” on the y-axis is θ2 ).
Results AS-Gibbs
Figure 5.22: AS-Gibbs algorithm 13 of Section 5.6: incorrect reconstruction of the pos-
terior for the system having likelihood (5.35). We see that AS-Gibbs algorithm incorrectly
reconstructs the bimodal posterior: only the mode (θ1 + θ2 = −5 and θ3 + θ4 = 5) is found,
whereas the mode (θ1 + θ2 = 5 and θ3 + θ4 = −5) is missing, see comparison with Figure 5.16.
(Note: in the figure “Sum components first mean” on the x-axis is the sum µ1 = θ1 + θ2 ,
whereas “Sum components second mean” on the y-axis is the sum µ2 = θ3 + θ4 ).
156
Figure 5.23: AS-Gibbs algorithm 13 of Section 5.6: incorrect reconstruction of the posterior
for the system having likelihood (5.35). We see in the figure θ1 (named Component 1 in the
figure) vs θ2 (named Component 2): we can appreciate that AS-Gibbs gets stuck in one mode
and only the combination θ1 + θ2 = −5 is found, while the mode θ1 + θ2 = 5 is missing, see
comparison with Figure 5.17 (Note: in the figure “Component 1” on the x-axis is θ1 , whereas
“Component 2” on the y-axis is θ2 ).
the simplifications in equation (5.36) come from the fact that in case of inactive variables
the likelihood does not depend on i l(a, i) = l(a) and active and inactive are independent
pi (i|a) = pi (i).
The novel proposed approach determines the dimension of the inactive subspace as the
largest dimension that yields an ESS that does not drop below some threshold, by using the
prior as importance proposal (consequently the dimension of the Active Subspace is fixed as
the complement to n, where n is the dimension of the full space). We compare in the sections
157
below the novel ESS method with the traditional eigenvalues method explained in Section
4.2.2 and we see that they bring the same results, in case of the Gaussian model of Section
5.3.2 and for the Banana model of Section 5.4.2.
Gaussian model
As a recap, we report below in Figure 5.24 a copy of the spectral gap in the Gaussian 10D
model, identified using the traditional method which identifies the size of the Active Subspace
as na = 1
Figure 5.24: Eigenvalues of 10D Gaussian model, we see that the estimate AS size is
1, considering the spectral gap between eigenvalues 1 and 2. The dimension of the Active
Subspaces is na = 1.
We report below in Figure 5.25 a visual representation of the novel ESS method: rec-
ollecting that the method consists in finding the largest dimension that yields an ESS that
does not drop below some threshold, we see in the figure that the dimension of the inactive
subspace in the Gaussian 10D model is ni = 9, in fact ESS is close to 100% until ni = 9
(the number of full bars), and therefore the Active Subspace dimension is na = 10 − 9 = 1,
compare with Figure 5.3 where with the traditional eigenvalue method the active dimension
result is na = 1 as well, so the two methods show identical results.
158
Figure 5.25: ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Gaussian 10D model. We see from the figure that the largest dimen-
sion ni of the inactive subspace that brings a high ESS is 9 (the number of full bars) and we
therefore set ni = 9 and the dimension of Active Subspace is therefore na = 10 − ni = 1.
See comparison with Figure 5.24 where with the traditional eigenvalue method the active
dimension result is na = 1 as well.
With similar reasoning, we see that in the 25D Gaussian model both the traditional
spectral gap in Figure 5.26 and the novel ESS method in Figure 5.27 bring identical results
of dimension of the Active Subspace na = 1.
Figure 5.26: Eigenvalues of 25D Gaussian model, we see that the estimate AS size is
1, considering the spectral gap between eigenvalues 1 and 2. The dimension of the Active
Subspaces is na = 1.
159
Figure 5.27: ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Gaussian 25D model. We see from figure that the largest dimension
ni of the inactive subspace that brings a high ESS is 24 (the number of full bars) and we
therefore set ni = 24 and the dimension of Active Subspace is therefore na = 25 − ni = 1.
See comparison with Figure 5.26 where with the traditional eigenvalue method the active
dimension result is identically na = 1.
Banana model
Repeating the same exercise of the previous section for the Banana model, we see that
identical results are in the Banana 10D model with dimension of Active Subspace na = 4
identically in both the traditional spectral gap method of Figure 5.28 and in the novel ESS
of Figure 5.29
160
Figure 5.28: Eigenvalues of 10D Banana model, we see that the estimate AS size is 4, con-
sidering the spectral gap between eigenvalues 4 and 5. The dimension of the Active Subspaces
is na = 4.
Figure 5.29: ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Banana 10D model. We see from figure that the largest dimension ni
of the inactive subspace that brings a high ESS is 6 (the number of full bars) and we therefore
set ni = 6 and the dimension of Active Subspace is therefore na = 10−ni = 4. See comparison
with Figure 5.28 where with the traditional eigenvalue method the active dimension result is
na = 4 as well.
For a 25D system the charts are in Figure 5.30 and 5.31, both identically indicating a
dimension of the Active Subspace of na = 4.
161
Figure 5.30: Eigenvalues of 25D Banana model, we see that the estimate AS size is 4, con-
sidering the spectral gap between eigenvalues 4 and 5. The dimension of the Active Subspaces
is na = 4.
Figure 5.31: ESS (%) using the prior as importance proposal for different dimensions of
inactive subspace in the Banana 25D model. We see from figure that the largest dimension
ni of the inactive subspace that brings a high ESS is 21 (the number of full bars) and we
therefore set ni = 21 and the dimension of Active Subspace is therefore na = 25 − ni = 4.
See comparison with Figure 5.30 where with the traditional eigenvalue method the active
dimension result is na = 4 as well.
5.9.1 Review
We have introduced an alternative method to determine the dimension of the Active Subspace
using the ESS (instead of the traditional spectral gap method explained in Section 4.2.2) by
162
finding the largest dimension of the inactive subspace that yields an ESS that does not drop
below some threshold, and we have shown that it the Gaussian model of Section 5.3.2 and for
the Banana model of Section 5.4.2 it brings identical results in determining the dimension of
the Active Subspace. We consider the use of the ESS a relevant method, considering that, by
definition, the inactive subspace should not affect the likelihood, and therefore should only
influence the prior. There is indeed a relationship between the Effective Sample Size (ESS)
method and the eigenvalues method. As Constantine notes in his work [Constantine et al.,
2016], the eigenvalues represent ’the average squared directional derivative of the negative log-
likelihood along the corresponding eigenvector.’ Consequently, a small eigenvalue indicates
minimal variation in that direction, suggesting that the likelihood provides little additional
information along that dimension. Similarly, if the ESS method suggests that the prior
is an effective proposal for the posterior, it implies that the likelihood is not particularly
informative. Together, both a low ESS and small eigenvalues suggest that certain directions
contribute minimally to the posterior, highlighting inactive areas.
5.10 Conclusion
We have explored a few methods in this chapter of Active Subspaces (AS) in MCMC. We
started highlighting the main problem we are trying to solve, which is the curse of dimen-
sionality affecting MCMC, and we have explored a few algorithms that have the aim to
improve the performances. We started by describing a proposed novel Algorithm 10 named
AS-PMMH, after the application of PMMH [Andrieu et al., 2010] to AS, which can be seen
as a possible theoretical advancement compared to AS-MH Algorithm 8 as it uses a SMC
sampler instead of Importance Sampling to obtain an estimate of the likelihood, we expect
therefore the AS-PMMH to perform better in cases where the inactive subspace is complex
to explore. We have seen that AS-PMMH brings an additional computational cost that can
make it not always the best choice, and that it also shares with AS-MH the possibility that
a noisy likelihood estimate may cause “sticky” behaviour (see Section 5.4.3).
We then introduced Algorithm 12, named AS-PMMH-i, where the additional i in the
acronym means inverted, because, compared to AS-PMMH, we switch roles between active
and inactive parts and we apply the outer MCMC to the inactive space and the inner SMC
sampler to the active part, with the aim to spend the biggest computational effort of the SMC
on the part we consider most valuable, the inactive, and we have concluded that, specifically
on the models we used to test the algorithm, using estimate of marginals with SMC brings
a trade-off : there is a bigger computational cost needed to have a more accurate estimate of
the likelihood, which adds to the algorithmic complexity, so the potential advantages come
at the cost of adding algorithmic complexity which can scale considerably depending on the
setting (see for example the conclusions in Section 5.4).
We therefore then devised a proposed novel application of Gibbs sampler [Geman and
163
Geman, 1984] to AS, named AS-Gibbs, which has the advantage of not having marginals
estimate and in cases of near-perfect AS we expect AS-Gibbs to outperform algorithms that
use an estimate of marginal because in AS-Gibbs we would have the exact marginal [Andrieu
and Vihola, 2015]. Also in general in cases of independence of active and inactive parts we
expect AS-Gibbs to outperform standard MCMC as we have explained in Section 5.6.2.
We then proposed a natural extension of AS-Gibbs by introducing AS-MwPG Algorithm
15 and AS-MwPG-i Algorithm 17, from the application to AS of MwPG [Andrieu et al.,
2010]. We consider the algorithms an extension of AS-Gibbs since the conditions where
we expect to use them are similar to As-Gibbs, namely when active and inactive parts are
independent. Both algorithms use an SMC sampler to draw particles either from the inactive
(AS-MwPG) or the active (AS-MwPG-i) subspace and are expected to outperform AS-Gibbs
in cases where the inactive or active subspaces respectively are complex, for example multi-
modal. We have shown in Section 5.8 a toy example of a multi-modal distribution where the
AS-MwPG-i was the only algorithm capable of correctly reconstructing both modes, whereas
the other algorithms were stuck in one of the modes.
We also introduced in Section 5.9 a novel alternative method to determine the dimension
of the Active Subspace that uses the ESS. The idea behind this, is measuring how good the
prior will be as importance proposal for the inactive subspace, and in picking as inactive
dimension the largest dimension of inactive subspace that yields an ESS that does not drop
below some threshold (consequently the dimension of the Active Subspace is fixed as the
complement to n, where n is the dimension of the full space). We have seen that the novel
ESS method brings identical results to the traditional spectral gap of Section 4.2.2 in the
Gaussian model and in the Banana model.
We proceed in the next Chapter 6 with the exploration of AS methods in SMC.
164
Chapter 6
6.1 Introduction
We have explored in the two Chapters 4 and 5 AS applications to MCMC in order, mainly,
to decrease the problems caused by the curse of dimensionality. While MCMC remained the
central part, we had additions from other MC methods like Importance Sampling, PMMH,
MwPG and Gibbs sampling. We now shift the focus to algorithms where the core is SMC,
with additional components from other algorithms.
We’ll start the narrative of this chapter with the introduction of AS-SMC, an algorithm
that was born with the intention to create the SMC-variant of the AS-MH Algorithm 8:
while AS-MH uses Importance Sampling to marginalise out the inactive variables and an
outer MCMC to sample from the active subspace, the AS-SMC still uses the IS on the
inactive part, but will have a SMC sampler to generate samples from the active subspace.
AS-SMC addresses the Open Point OP3, we expect AS-SMC to perform better than AS-MH
in cases where the active subspace is complex to explore using MCMC, for example in case of
multimodal distributions, where it is difficult to find a good proposal for the Active Subspace,
and SMC overcomes this limitation by using tempering and intermediate distributions which
ensure that at every step good proposals are available for the successive step.
6.2 AS-SMC
We introduce the AS-SMC Algorithm 18. The AS-SMC can be considered as the SMC
counterpart of the AS-MH Algorithm 8, with the difference that the AS-SMC samples from
the active marginal using SMC, whereas the AS-MH uses MCMC. And we see that actually
an AS-MH move is embedded in the SMC sampler in the rejuvenation step lines 21-32, the
pseudo-marginal in particular is calculated in lines 24-27.
We show the algorithm below, and give formal justification in Section 6.2.1.
165
Alg. 18 AS-SMC
1: Simulate Na points, θ0m N ∼ p and set each weight ω0m = 1/Na ;
a
m=1
. Draw from prior Na particles
2: for m = 1 : Na do . For each active particle, draw Ni inactive particles
1,m
3: Set am T m = BiT θ0m and um
0 = Ba θ0 , i0 0 = 1;
4: for n = 2 : Ni do in,m q0,i · | am := pi · | am
0 ∼ 0 0 ;
5: end for
6: Set w0n,m = 1/Ni for n = 1 : Ni ;
7: end for
8: for t = 1 : T do
9: for m = 1 : Na do . reweight
10: for n = 1 : Ni do
n,m n,m
pi it−1 |am m
t−1 l1:t Ba at−1 +Bi it−1
11: w̃tn,m am n,m
t−1 , it−1 =
n,m m
;
qt,i it−1 |at−1
12: end for
13: QN i P
Ni
in,m
j=1 qt,i
t−1 | am
t−1 n=1 w̃tn,m amt−1 , in,m
t−1
ω̃tm m
= ωt−1 QN i P ;
n,m m Ni n,m m n,m
j=1 qt−1,i it−1 | at−1 n=1 w̃t−1 at−1 , it−1
Na n o
1:N ,j
ωtj Kt,a · | ajt−1 , it−1i
X
,
j=1
n oNa ∗
where Kt,a is an AS-MH move, i.e.: j ∗ ∼ M ωtj ; a∗m
t ∼ qa · | ajt ;
j=1
24: for n = 1 : Ni do . Calculate pseudo-marginal
25: i∗n,m
t ∼ qt,i (· | a∗m
t );
26:
pi i∗n,m | a∗m l1:t Ba a∗m + Bi i∗n,m
∗n,m
w̃tn,m a∗m
t , it,i = t t t t
;
qt,i i∗n,m | a∗m
t t
n,m n,m Ni ∗n,m ∗n,m Ni
29: Set am
t , it , w̃t n=1
, um
t = a∗m
t , it , w̃t n=1
, u∗m
t with probability
∗
PNi n,m ∗n,m qt,a aj | a∗m
pa (a∗m
a ) n=1 w̃t a∗m
t , it t t
αm
t,a =1∧ ∗ P ;
n,j ∗
∗ ∗
∗
Ni
pa ajt n=1 w̃t ajt−1 , in,j
t−1 qt,a a∗m
t | ajt
∗ n ∗ ∗ o Ni ∗
n,m n,m Ni
30: Else let am
t , it , w̃t n=1
, um
t = ajt , in,j
t , w̃tn,j , ujt ;
n=1
31: end for
ωtm = 1/Na for m = 1 : Na ;
32: end if
33: end for
166
6.2.1 Formal justification
Our aim is to use an SMC sampler to simulate from the target distribution π(a, i) of equation
(4.17). We borrow the idea from [Chopin et al., 2012] and we follow the same derivation.
The target distribution πt used at the tth iteration is chosen to be
Ni Ni
!
1 Y 1 X pi (in | a) l1:t (Ba a + Bi in )
πt a, {in }N qt,i ij | a
i
n=1 = pa (a) (6.1)
Zt j=1
Ni n=1 qt,i (in | a)
where Zt is a normalising constant. Following the derivation in [Chopin et al., 2012], we may
rearrange this target as follows:
Ni Ni
!
1 Y 1 X pi (in | a) l1:t (Ba a + Bi in )
πt a, {in }N qt,i ij | a
i
n=1 = pa (a)
Zt j=1
Ni n=1 qt,i (in | a)
Ni Ni
1 X pa (a) pi (in | a) l1:t Ba a + Bi inj Y
= qt,i ij | a
Ni n=1 Zt
j=1
j6=n
Ni Ni
πt,a (a) X Y
= πt,i (i | a) qt,i ij | a
n (6.2)
Ni n=1 j=1
j6=n
Equation (6.2) includes in its marginals the target at iteration t of πt (a, i) = πt,a (a) π t,i (i | a).
The weight update of the SMC sampler, on line 11 of the algorithm, is obtained from
equation (6.1) and by considering that
n,m Ni
πt am
t−1 ,
it−1 n=1
ω̃tm m
= ωt−1 (6.3)
n,m Ni
m
πt−1 at−1 , it−1 n=1
m
QNi n,m m
1 PNi n,m m n,m
m
p a a t−1 j=1 q t,i it−1 | a t−1 Ni n=1 w̃t at−1 , it−1
= ωt−1 Ni n,m
1 Ni n,m n,m
(6.4)
p a am m m
Q P
t−1 j=1 qt−1,i it−1 | at−1 Ni n=1 w̃t−1 at−1 , it−1
QNi n,m m
PNi n,m m n,m
m j=1 q t,i it−1 | a t−1 n=1 w̃t at−1 , it−1
= ωt−1 QN i n,m m
P N n,m m n,m
, (6.5)
j=1 qt−1,i it−1 | at−1
i
n=1 w̃t−1 at−1 , it−1
Where, in (6.3), ω m represents the outer weight of particle m, whereas wn,m is the inner
weight of inactive particle n belonging to particle m. The upper ˜is for un-normalised quan-
tities.
167
Following also [Chopin et al., 2012], in similar fashion to what was done in equations
(4.23) and (4.23) for AS-MH, for estimating the expectation of a function h with respect to
πt , we may use
Na
umt ,m
X
ωtm h Ba am
t + Bi it . (6.6)
m=1
Na
X Ni
X
ωtm wtn,m h (Ba am n,m
t + Bi it ). (6.7)
m=1 n=1
We start with comparisons with the 25D Gaussian model introduced in Section 5.3.2, which,
as a reminder, has dimension of the Active Subspace of 1. We have performed 50 runs each
of the standard SMC algorithm and of the AS-SMC Algorithm 18, in order to have the same
number of likelihood evaluations for a fair comparison, we have used 10000 particles for the
standard SMC and Na = 1000 particles in the AS-SMC with number of inactive variables
Ni = 10, so that Na × Ni = 10000, and we have also used the same tempering path for both
algorithms. Figure 6.1 shows the distribution of the RMSE of the difference between the true
mean and the mean estimated by each of the algorithms across the runs:
168
Figure 6.1: Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs in the 25D
Gaussian model. We see that the performances of the standard SMC appears to be worse than
the AS-SMC, this is probably due to the fact that we have a good estimate of the likelihood
in the AS-SMC (in the Gaussian model the Importance Sampler seems to behave well on the
inactive subspace even in high dimensions), coupled with the fact that the in the AS version
the SMC operates on a 1D subspace instead of the full 25D space as the non-AS SMC.
We see in Figure 6.1 that the performances of the standard SMC appear to be worse than
the AS-SMC, which may be expected if we remember that in the Gaussian model the estimate
of the likelihood via pseudo-marginal seems to behave well even in fairly high dimensions of
the inactive subspace (we see small tails in the distribution of the AS-SMC in Figure 6.1),
coupled with the fact that the SMC sampler in AS-SMC acts on a 1D space, whereas the
SMC sampler in the standard SMC algorithm acts on the full 25D space.
We proceed with comparisons using the 25D Banana model introduced in Section 5.4.2, which,
as a reminder, has dimension of the Active Subspace of 4, and we have also seen in Chapter
5 that the Banana model poses some challenges to the algorithms (we had some cases of long
tails in the distributions of the RMSE, for example Figure 5.13, indicating poor estimate
of the likelihood). We have performed 50 runs each of the standard SMC algorithm and of
the AS-SMC Algorithm 18, with the same conditions of the previous paragraph. Figure 6.2
shows the distribution of the RMSE of the difference between the true mean and the mean
estimated by each of the algorithms across the runs:
169
Figure 6.2: Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs in the 25D
Banana model. We see that the performances of the standard SMC and AS-SMC appear to
be approximately equal in terms of mean and upper and lower quartile, but the AS-SMC is
showing some tails which suggest again that the estimate of the likelihood may be poor in
some cases and the algorithm can get stuck in the tail of the distribution, this is probably due
to the fact that using 10 inactive variables is too little for the exploration of the 21D inactive
subspace of the Banana model, and this causes the noise in the importance sampler estimate
of the likelihood.
We see in Figure 6.2 that the performances of the standard SMC and AS-SMC appear to
be approximately equal in terms of mean and upper and lower quartile, but the AS-SMC is
showing some tails which suggest again that the estimate of the likelihood may be poor in
some cases and the algorithm can get stuck in the tail of the distribution, this is probably
due to the fact that using 10 inactive variables is too little for the exploration of the 21D
inactive subspace of the Banana model, and this causes the noise in the importance sampler
estimate of the likelihood.
6.2.3 Review
We have introduced the AS-SMC algorithm, which is the SMC equivalent of the AS-MH. We
have seen that it outperforms the standard SMC in cases where the Importance Sampling is
a good tool for the exploration of the inactive subspace, see for example Figure 6.1; but in
cases of more complex inactive subspace the AS-SMC can give problems particularly if the
estimate of the likelihood is noisy, see for example Figure 6.2 where the use of 10 particles
to explore the inactive subspace is probably creating long tails in the RMSE distribution
due to the noisy likelihood estimate. This potentially indicates the need for a tool that is
more refined than Importance Sampling in cases where the inactive subspace is challenging
170
to explore, with this aim in mind, we have introduced the AS-SMC2 algorithm in Section
6.3.
6.3 AS-SMC2
We have introduced the SMC2 [Chopin et al., 2012] in Section 2.9, as a recap of the algorithm:
at each step of the tempering of an outer SMC, a particle MCMC is run to obtain an estimate
of the likelihood (so the name SMC2 ), in our case the outer SMC will be on the Active
Subspace, and the inner on the inactive one: it can be though of as the SMC version of the
AS-PMMH introduced in Section 5.2. The purpose of introducing the SMC2 is that we aim to
have an algorithm that can deal more effectively than the AS-SMC with a complex inactive
subspace: we have seen in Section 6.2 that the performances of the internal Importance
Sampler degrade as we go from a relatively simple inactive subspace like the Gaussian model,
see Figure 6.1, to the more complex Banana model, see results in Figure 6.2. We present the
Algorithm 19 below and give the formal justification in Section 6.3.1. We then show some
results in Section 6.3.2.
171
Alg. 19 AS-SMC2
1: Simulate Na points, θ0m N ∼ p and set each weight ω0m = 1/Na ;
a
m=1
2: for m = 1 : Na do
1,m
3: Set am T m = BiT θ0m and um
0 = Ba θ0 , i0 0 = 1;
4: for n = 2 : Ni do
5: in,m ∼ q0,i · | am := pi · | am
0 0 0 ;
6: end for
n,m
7: Set w0 = 1/Ni for n = 1 : Ni ;
8: end for
9: for t = 1 : T do
10: for m = 1 : Na do . reweight
11: if t = 1 then
1:Ni ,m 1:Ni ,m
12: Simulate it , vt−1 using lines 3-30 (ignoring line 11) of Algorithm 9 taking s in these lines equal to t,
then compute
Ni
\
w̃tn,m ;
X
lt,a am
t−1 =
n=1
\
m
ω̃tm = ωt−1 lt,a am
t−1 ;
13: else
1:N ,m 1:N ,m
14: Simulate it i , vt−1i using lines 3-30 (ignoring line 11) of Algorithm 9 taking s in these lines equal to t,
then compute:
\
lt,a am
t−1
Ni
w̃tn,m ;
X
=
lt−1,a amt−1 n=1
\
lt,a am
t−1
ω̃tm m
= ωt−1 ;
lt−1,a amt−1
15: end if
16: end for
17: {ωtm }N m Na
m=1 ← normalise {ω̃t }m=1 ;
a
Na n o
1:N ,j
ωtj Kt,a · | ajt−1 , i1:N u ,j
X
t , vt−1i ,
j=1
172
bution
pa (a) pi (i | a) l1:t (Ba a + Bi i)
πt (a, i) = ,
Zt
with marginal distribution
from which we simulate Na a-points and Ni i-points for every a. At t = 1 each particle is
assigned the weight
Ni
ˆl1,a (a) = 1
X
l1 (Ba a + Bi in0 ) . (6.8)
Ni n=1
We introduce notation
for the
distribution of the i-variables generated at iteration 0 used to
N N
estimate l1,a : ψ0 {in0 }n=1 pi (in0 | a). The target distribution at t = 1 being
Q
i
| a = n=1 i
ˆl (a)
1,a
π1 a, {in0 }Ni
n=1 = pa (a) ψ0 {in0 }Ni
n=1 |a . (6.9)
Z1
This results in the weight update in equation (6.8). We can rewrite the target as
N Ni
!
pa (a) Yi 1 X
π1 a, {in0 }N i
n=1 = pi (in0 | a) l1 (Ba a + Bi in0 )
Z1 n=1
Ni n=1
Ni Ni
1 X pa (a) Y
pi (i0 | a) l1 (Ba a + Bi i0 ) pi ij0 | a
n n
=
Ni n=1 Z1 j=1
j6=n
Ni Ni
π1,a (a) X Y
πt,i (i0 | a) pi ij0 | a
n
= (6.10)
Ni n=1 j=1
j6=n
173
π1,a (a) π1,i (in0 | a).
For t ≥ 2, similarly to equation 6.9, we again have that our target distribution is defined
to be the prior, multiplied by the likelihood estimate, multipled by the distribution of the
variables used in the likelihood estimator, multiplied by the normalising constant (which
again follows from the unbiasedness of the likelihood estimator).
Let ψt−1 be the distribution of all of the random variables generated by the internal SMC
up to time t.
ˆl (a)
Ni Ni t,a
πt a, in0:t−1 , v0:t−1
n
n=1
= pa (a) ψt−1 in0:t−1 , v0:t−1
n
n=1
|a , (6.11)
Zt
Similarly to the t = 1 case, we may rearrange equation (6.11) to see that πt (a, i) is included
in its marginals:
N
Ni
πt,a (a) X i
πt,i in1:t−1 | a
πt a, in0:t−1 , v0:t−1
n
n=1
= ×
Ni n=1
Nut−1
Ni t Ni j
j
Y j
Y Y vs−1 vs−2
ijs−1
pi i0 | a
ws−1 Ks−1,i | is−2 , a
j=1 s=2 j=1
j6=hnt (0) j6=hnt (s−1)
Ni
where in1:t−1 and hnt are deterministic functions of in0:t−1 n=1
and
n Ni
: hnt = (hnt (0) , . . . , hnt (t − 1))
v0:t−1 n=1
n n
denotes the index history of vt−1 , i.e. hnt (t − 1) = n and hnt (s) = v ht (s+1) , recursively for
s = t − 2, ..., 0, and in1:t−1 = in1:t−1 (0), ..., in1:t−1 (t − 1) denotes the state trajectory of particle
hn (s)
int−1 , i.e. in1:t−1 (s) = is t , for s = 0, ..., t − 1.
For the remainder of the proof we follow [Chopin et al., 2012], with the one difference in
the notation from that paper that here the index of the i variable is one fewer: i.e. equation
n n Ni
(6.11) uses ψt−1 i0:t−1 , v0:t−1 n=1 | a , whereas the equivalent in [Chopin et al., 2012] would
Ni
be ψt in1:t , v1:t−1n
n=1
| a . The reason is that here the weight update in the internal SMC
only involves the values of the particles from the previous iteration.
174
10 inactive variables, so that it has the same number of likelihood evaluations (the same
tempering sequence has been used). For the SMC2 , we have used a significantly higher
number of likelihood evaluations of approximately 300000, the move has been necessary for
the structure of the algorithm with the 2 SMC parts: the SMC2 will require future tuning to
reduce the significant additional operational complexity, without compromising the positive
parts. Figure 6.3 shows the distribution of the RMSE of the difference between the true
mean and the mean estimated by each of the algorithms across the runs:
Figure 6.3: Violin plots with the distribution of RMSE of the differences between the true
posterior mean and the mean estimated by each of the algorithms over 50 runs in the 25D
Banana model. The SMC2 has lower mean and upper quartile than all the algorithms, but
it shows longer tails, it is a sign that, although currently the algorithm is using significantly
more likelihood evaluations than the others, still the algorithm may get stuck in one of tails,
probably indicating that additional tuning of the algorithm is necessary.
We see in Figure 6.3 the SMC2 has lower mean and upper quartile than all the algo-
rithms, but it shows longer tails, it is a sign that, although currently the algorithm is using
significantly more likelihood evaluations than the others, and still suffers from the problem
of the AS-PMMH, i.e. we are spending a considerable amount of computational effort on the
inactive part, which is the least interesting to us. Still the algorithm may get stuck in one of
tails, future tuning of the algorithm is necessary.
6.3.3 Review
We have compared the performances of the SMC2 to standard SMC and AS-SMC algorithms.
While for SMC and AS-SMC the same number of likelihood evaluation was used, each of
the SMC2 runs has a significant addition of approximately 300000 likelihood evaluations,
future tuning of the algorithm is expected to bring the additional complexity down, while
175
still keeping the benefits. We have seen that there is a trade-off in the current setting, where
the additional complexity allows for better mean and upper quartile in the estimate of the
expectation (figure 6.3), but showing significantly longer tails of the RMSE distribution than
the other two methods, probably indicating that the additional algorithmic complexity is
still not enough and that future tuning is needed. Anyway, with the models explored so
far, the use of SMC2 would not seem justified as the additional overhead is significant and
does not seem to bring comparable benefits, and we are still spending, like in AS-PMMH
case, a considerable amount of computational effort on the inactive part, which is the least
interesting to us.
176
SMC algorithm, is that, in general, the dimension of the Active Subspace may change at every
tempering step, and so we had to figure a way that could ensure theoretical consistency of
the algorithm (mainly that the algorithm was still approximating the intended posterior).
We found a solution in [Chopin et al., 2012], and we borrowed an idea that is used in the
paper for automatic calibration of the number of particles in the particle filter, where they
use a move of drawing from the current posterior approximation one particle and then using
it to generate conditioned samples in the following approximation step. We will explain
mathematically this move and how we have used it, in the next section.
The meaning of symbols in (6.12) is for θa active particle m, at iteration t, for the active
subspace determined by Bat−1 . For the inactive variables we use the template (6.13)
m,n
θt,i t−1
(6.13)
The meaning of symbols in (6.13) is for θi , the inactive sub-particle number n for the active
particle m, at iteration t, for the active subspace determined by Bit−1 .
For ease of reference, we rewrite Algorithm 18 using the notation of 6.12 and 6.13, the
new notation is below in Algorithm 20.
177
Alg. 20 AS-SMC (same Algorithm as 18, with different notation)
1: Simulate Na points, θ0m N ∼ p and set each weight ω0m = 1/Na ;
a
m=1
2: for m = 1 : Na do
3: m = B T θ m , θ 1,m = B T θ m and um = 1;
Set θ0,a a 0 0,i i 0 0
4: for n = 2 : Ni do
n,m m m ;
5: θ0,i ∼ q0,i · | θ0,a = pi · | θ0,a
6: end for
7: Set w0n,m = 1/Ni for n = 1 : Ni ;
8: end for
9: for t = 1 : T do
10: for m = 1 : Nθ do . reweight
11: for n = 1 : Ni do
n,m m m n
pi θt−1,i |θt−1,a l1:t Ba θt−1,a +Bi θt−1.i
12: w̃tn,m θt−1,a
m n,m
, θt−1,i =
n,m m
;
qt,i θt−1,i |θt−1,a
13: end for
14: QN i P
n,m Ni
j=1 qt,i
θt−1,i m
| θt−1,a n=1 w̃tn,m θt−1,a
m n,m
, θt−1,i
ω̃tm m
= ωt−1 QN i P ;
n,m m Ni n,m m n,m
j=1 qt−1,i θt−1,i | θt−1,a n=1 w̃t−1 θt−1,a , θt−1,i
17: for m = 1 : Na do
n,m Ni
Ni
18: wt n=1
← normalise w̃tn,m m=1
;
19: end for
20: if some degeneracy condition is met then resample and move
21: for m = 1 : Nθ do
22: Simulate θt,am , θ 1:Ni ,m from the mixture distribution
t,i
Na n o
1:Ni ,j
ωtj Kt,θ · | θt−1,a
j
X
, θt−1,i ,
j=1
w̃∗n,m
wt∗n,m = PN t ∗p,m ;
i
p=1 w̃t
n o Ni n o Ni
29: Set m , θ n,m , w̃ n,m
θt,a , um ∗m , θ ∗n,m , w̃ ∗n,m
= θt,a , u∗m with probability
t,i t t t,i t t
n=1 n=1
P ∗
∗m Ni ∗n,m m n,m j ∗m
pa θt,a n=1 w̃t θt−1,a , θt−1,i qt,a θt,a | θt,a
αm
t,a =1∧ ∗ P ;
n,j ∗
∗
n,j ∗ ∗m | θ j ∗
j Ni j
pa θt,a n=1 w̃t θt−1,a , θt−1,i qt,a θt,a t,a
o Ni
∗ o Ni
j∗ n,j ∗ ∗
n n
30: Else let m , θ n,m , w̃ n,m
θt,a , um = θt,a , θt,i , w̃tn,j , ujt ;
t,i t t
n=1 n=1
31: end for
32: ωtm = 1/Na for m = 1 : Na ;
33: end if
34: end for
As said, we borrowed the idea from [Chopin et al., 2012]. At each iteration (we use t
178
as iteration variable) we start by having the adaptive version of equation (4.12), which is
equation (6.14) where we can see the tempered likelihood log l1:t featuring
Na
X Ni
X T
m n,m m n,m m n,m
Ĉt = ωt−1 wt−1 ∇ log l1:t Bat−1 θt−1,at−1
+ Bi θt−1,i t−1
∇ log l1:t Bat−1 θt−1,at−1
+ Bit−1 θt−1,i t−1
m=1 n=1
(6.14)
Where we have used formula (6.7) to calculate the expectation. By using the approximation
Ĉt , instead of Ĉ of (4.12), following the procedure described in Sections 4.2 and 4.3, we
find the matrices current to iteration t, Bat and Bit , which will also give a dimension of the
current active and inactive subspace, dt,a and dt,i , so that dt,a + dt,i = d with d the dimension
of the state space.
The target at time t is in equation (6.15) [Chopin et al., 2012], we can spot the terms of
the pseudo-marginal part, and the proposal of the inactive variables
Ni Ni
Ni
1 Y j
1 X pit θint | θat l1:t Bat θat + Bit θint
θint n=1
πt,at ,it θat , = pa (θa ) qt,i θ | θat
Zt t t j=1 t it
Ni n=1 qt,it θint | θat
Ni N
πt,at (θat ) X Yi j
= πt,it θint | θat q t,i t θi | θat
(6.15)
Ni n=1
j=1
t
j6=n
at iteration t, where the first subscript in πt,at ,it denotes that this is the target for iteration
t, and the second and third denote that the spaces of active and inactive variables are those
determined by the adaptive procedure to be used at iteration t. Since the active subspace
has changed between iterations, we additionally need to define the target from the previous
iteration for the current active subspace. This is given by
Ni Ni
Ni
1 Y j
1 X pit θint | θat l1:t−1 Bat θat + Bit θint
πt−1,at ,it θat , θint
= pa (θa ) qt−1,it θit | θat
Zt t t j=1
n=1
Ni n=1 qt−1,it θint | θat
i N N
πt−1,at (θat ) X n
Yi j
= πt−1,it θit | θat q t−1,i t θit | θ at
. (6.16)
Ni n=1
j=1
j6=n
This target is not the same as πt−1,at−1 ,it−1 : in particular qt−1,it and qt−1,it−1 can be chosen
independently of each other. This additional change in target between iterations t − 1 and t
requires an additional IS step, which is detailed at the end of this section.
179
The weight update at iteration t is then similar to that in section 6.2
n,m Ni
m
πt,at ,it θt−1,a t
, θt−1,it n=1
ω̃tm = n,m Ni
m
πt−1,at ,it θt−1,a t
, θt−1,it n=1
QNi j
PNi n,m m n,m
m j=1 qt,it θit | θat n=1 w̃t θt−1,at , θt−1,i t
= ωt−1 QNi j
PNi n,m m n,m
,
j=1 q t−1,i t θ it | θat w̃
n=1 t−1 θ , θ
t−1,at t−1,it
where n,m
m m n
pit θt−1,i | θt−1,a l1:t Bat θt−1,a + Bit θt−1.i
w̃tn,m m n,m
t t t t
θt−1,at
, θt−1,i = n,m m
,
t
qt,it θt−1,i t
| θt−1,at
At iteration t, for each particle we use an MCMC move with invariant distribution
Ni
πt,at ,it θat , θint
n=1
(6.17)
[Chopin et al., 2012] notes that it is simple to extend a set of weighted particles from πt,at ,it
∗
so that they are from πt,a t ,it
: for each particle we simulate from the conditional distribution
n Ni
1,m Ni ,m
of ut , which is given by ut | θt,at , θt,it n=1 ∼ M wt , ..., wt , where wtn,m is the
normalised version of w̃tn,m θt,a n,m
m
t
, θt,i t
. At the beginning of iteration t, our method performs
this simulation of ut−1 for each particle, then makes use of the following transformation of
the extended state
n Ni ut−1
θt−1,at = Gt−1→t,a ut−1 , θt−1,at−1 , θt−1,i t−1 n=1
= BaTt Bt−1,at−1 θt−1,at−1 + Bt−1,it−1 θt−1,i t−1
180
ut−1 n Ni ut−1
BiTt
θt−1,i t
= Gt−1→t,i ut−1 , θt−1,at−1 , θt−1,i t−1 n=1
= Bt−1,at−1 θt−1,at−1 + Bt−1,it−1 θt−1,i t−1
ut−1 = ut−1
The conditional IS move makes use of this transformation, along with a proposal from [Chopin
∗
Ni
et al., 2012]. Our desired target distribution for the new point is πt−1,at ,it ut−1 , θat , θint n=1 ,
the extension of the target in equation (6.16), just as equation (6.18) is an extension of
n Ni
(6.15). Our proposal uses the current particle with values ut−1 , θt−1,at −1 , θt−1,i t−1 n=1
ut−1
passed through the transformation above to give the point ut−1 , θt−1,at , θt−1,it . We addi-
n Ni
tionally require the variables θt−1,i t n=1,n6=ut−1
and will propose them from the conditional
n Ni ut−1
distribution θt−1,it n=1,n6 =ut−1
| ut−1 , θt−1,at , θt−1,i t
. Using equation (6.18), this conditional
distribution is given by
∗
Ni
πt−1,at ,it
u, θat , θint n=1
Ni
Y
qt−1,it θijt | θat ,
πt−1,at (θat ) =
Ni
πt−1,it θiut | θat j=1
j6=u
For the IS to be valid, we need to artificially extend the target distribution using a backwards
kernel L (as in [Del Moral and Doucet, 2003]) to form a joint distribution over all of the vari-
∗
ables involved in the proposal, such that the desired target πt,a t ,it
is a marginal distribution.
The IS target is then
∗ Ni Ni Ni
θint n=1 L θint−1 θint
πt−1,at ,it
u, θat , n=1,n6=u
| u, θat , n=1
Ni
Y
Ni
∗
θint−1 n=1 qt−1,it θijt | θat .
πt−1,a t−1 ,it−1
u, θat−1 ,
j=1
j6=u
• this additional step is run after determining the active subspace for the next iteration,
for each of the Na particles;
• we sample one of the inactive particles, using the weights of the particles in the inactive
181
space;
• we project the active variable and sampled inactive variable into the new active and
inactive subspaces;
Figure 6.4: Estimate of the mean of component θ1 of the model of Section 4.9.1 across 10
runs of AS-SMC-a. The average across runs is 0.0 with a standard deviation of the measure
of 0.2.
182
Figure 6.5: Estimate of the mean of component θ2 of the model of Section 4.9.1 across 10
runs of AS-SMC-a. The average across runs is 0.0 with a standard deviation of the measure
of 0.1.
We then visualize in Figure 6.6 below, how, during the adaptation in the tempering steps
of the AS-SMC-a algorithm, the direction of the Active Subspace changes from the prior AS
direction seen in Figure 4.7 to being 100% aligned along the posterior AS direction of Figure
4.8 in the final tempering steps. The same tempering steps have been used across the 10
runs.
183
Figure 6.6: Adaptation of the direction of the Active Subspaces in AS-SMC-a algorithm,
measured across 10 different runs: the direction during the adaptation goes from prior AS of
Figure 4.7 at tempering step 0, to posterior AS of Figure 4.8 in the final tempering steps.
The same tempering steps have been used across the 10 runs.
6.4.4 Review
We have introduced AS-SMC-a which brings adaptation of the Active Subspaces structure
to the AS-SMC algorithm that we introduced in Section 6.2. We have seen that while in
traditional AS methods the direction of the Active Subspace for the model of Section 4.9.1
would remain wrongly fixed along the direction of the prior AS of Figure 4.7 for the whole du-
ration, the AS-SMC-a allows for the direction to adapt in the tempering steps, until it aligns
correctly with the posterior AS of Figure 4.8. We expect the AS-SMC-a to possibly bring
improvements into cases where the prior Active Subspaces directions are very different from
the posterior Active Subspaces. It is to be noted that, as is the case for other algorithms,
there is a trade-off, since the adaptation will bring additional computations, for example
the calculation of the structure of the AS at each tempering step. Future experiments are
needed to better understand the performances of the algorithm and how it stands compared
to traditional methods and other AS-based algorithms that we have introduced throughout.
One final note to underline the difference between our method and Spike-and-Slab [Mitchell
184
and Beauchamp, 1988, George and McCulloch, 1997, Ishwaran and Rao, 2005] and to explain
why we did not use such a method in Active Subspaces. Spike-and-Slab is designed primarily
for selecting individual variables by assigning them either a probability of exclusion (spike)
or of inclusion (slab) [Mitchell and Beauchamp, 1988, George and McCulloch, 1997, Ishwaran
and Rao, 2005] which could potentially be very useful in active-inactive setting, considering
that we may want to eliminate inactive variables. However, applying spike-and-slab directly
to the models we have used for our AS-SMC-a method may not always be feasible. For
instance, both in the Gaussian model of section 5.3.2 and in the Banana model of section
5.4.2, AS-SMC-a will operate on linear combinations of variables rather than on each vari-
able independently, see for example equations (5.1) and (5.15). Spike-and-slab, in contrast,
operates separately for each variable from the other. This difference makes Spike-and-slab
not suitable in general to be applied to the models we have considered.
6.5 Conclusion
We have started the chapter with the exploration of the AS-SMC algorithm in Section 6.2
which can be considered as the SMC counterpart of AS-MH of Section 4.8.2, we have seen
that it performs better than standard SMC in cases where the Importance Sampler behaves
well even in high dimension of the inactive space, for example the Gaussian model (see
Figure 6.1), whereas when the inactive subspace becomes more challenging, for example in
the Banana model, the advantage becomes less clear and long tails in the distribution of
the RMSE appear (see Figure 6.2). We then introduced AS-SMC2 in Section 6.3, which can
be seen as the SMC counterpart of AS-PMMH of Section 5.2. We have seen that in the
AS-SMC2 as well, when applied to the Banana model, the extra computation cost does not
seem to bring additional benefits and that more tuning of the algorithm may be needed to
understand better the conditions of optimal performance of the algorithm. The AS-SMC2 has
a similar drawback to AS-PMMH, on spending a considerable amount of computational effort
on the inactive part, which is the least interesting to us. Overall, while the performances on
the Gaussian model are encouraging in saying that in case of perfect Active Subspaces the
AS-SMC seems a clear winner, the results on the Banana model seem to show that in more
complex scenarios when we diverge from cases of perfect Active Subspaces, the case for using
either AS-SMC or AS-SMC2 may be less clear, as the additional complexity brings some
advantages in results but also disadvantages for examples longer tails in the distribution of
the RMSE. We finally introduced in Section 6.4 a version of the AS-SMC that performs an
adaptation of the structure of the AS at each tempering step, and we named the algorithm
AS-SMC-a. We have shown how the adaptation has allowed the Active Subspace direction
of the model of Section 4.9.1 to be correctly identified, while traditional AS methods would
have kept the wrong AS direction throughout the algorithm. The adaptation may produce
benefits in cases where prior and posterior AS are significantly different, however the extra
185
computations carry a trade-off, and the case of using the AS-SMC-a will have to be explored
further.
186
Chapter 7
This thesis contains two major threads: the integration of a Sequential Monte Carlo (SMC)
algorithm into the BEAST2 platform for phylogenetic analysis and the development of novel
Active Subspace (AS) methods for Monte Carlo methods. Both threads are linked in the
aim to improve accuracy and efficiency of existing algorithms in the fields, and to reduce
complexity.
187
Particle Marginal Metropolis Hastings (PMMH) Andrieu et al. [2010]. Our AS-PMMH algo-
rithm showed significant trade-offs between accuracy and computational costs, and sensitivity
to noisy likelihood estimates. Also a second version of the algorithm, AS-PMMH-i, obtained
by swapping the roles of active and inactive subspaces, still demonstrated trade-offs in com-
putational complexity. We introduced AS-Gibbs, which embeds Gibbs sampling [Geman
and Geman, 1984] into AS, avoiding marginal likelihood estimation. We have seen that AS-
Gibbs proves particularly effective in cases of near-perfect Active Subspaces or independence
between active and inactive parts. Extending this idea, we embedded Metropolis Within Par-
ticle Gibbs MwPG algorithms [Andrieu et al., 2010] in AS and AS-MwPG and AS-MwPG-i
were proposed, using SMC samplers for either the inactive or active subspace. AS-MwPG-i
in particular, showed superior performance in multimodal distributions, as demonstrated by
a toy example where AS-MwPG-i was the only method capable of correctly reconstructing
both modes. In determining the dimension of the Active Subspace, we proposed a novel
ESS-based method. This method can be seen as an alternative to the spectral gap method,
traditionally used in AS, and has shown to give identical results in the examples where we
have tested it.
188
7.4 Potential future research directions
While the two research threads of phylogenetics and Active Subspaces address different chal-
lenges, they share a unique underlying theme: enhancing the computational efficiency and
reducing complexity of Bayesian methods. Both approaches take on high-dimensional state
spaces and aim to address the curse of dimensionality. A potential thread of future research
lies in combining these threads. Incorporating Active Subspaces into phylogenetic analysis
could bring more improvements in the efficiency of SMC methods in this field. By identifying
Active Subspaces in high-dimensional phylogenetics models, AS could focus computational
efforts on the most informative parts, potentially enabling SMC to handle even larger and
more complex datasets.
189
Appendix A
Following the procedure and the structure of Section 3.14 where we presented the results
for 10 taxa, we report in this section the results of comparison of the Annealed Adaptive
SMC vs BEAST2 MCMC for 5 and 20 taxa. Please refer to Section 3.12 for full details of
implementation.
Generator tree
The first step has been as described in 3.13.1 to generate a coalescent tree with 5 leaves, and
the tree has been randomly generated as below in Figure A.1
190
Figure A.1: Random coalescent tree with 5 leaves generated using the procedure outlined in
the first part of Section 3.13.1. This has been the generating tree for the synthetic data of the
test described in this section. Visualization via FigTree [Rambaut, 2023]
Using the tree generated in the previous step, synthetic sequences have been generated using
’seq-gen’ program, as explained in Section 3.13.1
191
A.2.1 SMC set up
The SMC has been set up with 1000 particles, and 5 MCMC moves per each annealing step.
The number of annealing steps adaptively determined by the CESS (see Section 2.6.3 for
details on CESS) has been 33, as can be seen in Figure A.2
Figure A.2: Annealing steps in the SMC run for the 5-taxa example studied in this section.
Therefore the total number of likelihood evaluation for the algorithm has been 1000 ×
33 × 5 = 165000. The adaptive annealing steps have been determined using CESS with a
threshold of 90%, and resampling of particles is done when ESS falls below 50% of particles,
we can see below in Figure A.3 the ESS chart related to the SMC run
192
Figure A.3: ESS versus annealing SMC step for the 5-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000 particles are
used for the simulation).
• Gamma shape
• Coalescent Tree
193
scattered as we can appreciate from Figures A.4 for MCMC and A.5 for SMC, and this is
due to the fact that the probability of MCMC moves on the gamma shape parameter has
been kept to the default value that the configuration software BEAUTi (see Section 3.9.1)
gives, and MCMC moves are less likely to happen than moves on effective population size
and trees (as an example, an MCMC move on the gamma shape is 30 times less likely than a
move on the Effective Population Size parameter), therefore the low ESS and the scattered
distributions are a result of this.
The mean of the MCMC run is close to the true value of 1, we can see the full statistics in the
following table. Like in the 10 taxa case for gamma shape (see the MCMC part of Section
3.16.1), the ESS is very low and the distribution, as we can appreciate in the following Figure
A.4, is rather scattered:
Statistic Value
Mean 0.9
Standard Deviation 0.0656
Value Range [0.7356, 1.0667]
95% HPD Interval [0.7729, 1.0024]
Effective Sample Size (ESS) 99
Figure A.4: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Gamma shape with 5 taxa. Visualization with the software Tracer.
194
Ammealed Adaptive SMC results for Gamma shape
The Statistics for the SMC run are similar to the MCMC run:
Statistic Value
Mean 1.17
Standard Deviation 0.132
Value Range [0.697, 1.256]
95% HPD Interval [0.798, 1.25]
Figure A.5: Frequency distribution for the parameter Gamma shape with 10 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization with
python matplotlib.
195
MCMC results for Effective Population Size
Statistic Value
Mean 2.7633
Standard Deviation 1.4387
Value Range [0.4307, 13.3307]
95% HPD Interval [0.6802, 5.4841]
Effective Sample Size (ESS) 519
Figure A.6: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Effective Population Size with 5 taxa. Visualization with the software Tracer.
The Statistics for the SMC run are in general better than the MCMC run, we can see a lower
variance for example. And we can see from Figure A.7, that it has the same peak of the cor-
respective MCMC Figure A.6, but in the MCMC case the bigger variance and right-skewness
is causing a slightly higher value of the mean:
Statistic Value
Mean 2.223
Standard Deviation 0.759
Value Range [0.602, 6.272]
95% HPD Interval [1.063, 3.903]
196
And the distribution of the Effective Population Size is in Figure A.7
Figure A.7: Frequency distribution for the parameter Effective Population Size with 5 taxa,
using Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization
with python matplotlib.
We can also appreciate the SMC algorithm at work by looking at Figure A.8 below,
showing the evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size, and we see how the standard deviation drops
significantly through the annealing journey
197
Figure A.8: Evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size in the SMC algorithm: we see how the standard
deviation drops significantly through the annealing journey.
A.3.3 Tree
For the tree analysis we use a methodology similar to [Wang et al., 2019] and we compare
trees using the majority-rule consensus. So we will have a the consensus-tree, which is a
“summary” tree for the MCMC run and one for the SMC run, and we will compare them
to the generating tree shown in Section A.1 to assess how each of the two algorithms has
performed. In addition to visualizing the “summary” trees for the two runs, we will also give
a basic topological metric of performance, the Robinson-Foulds (RF) “symmetric difference”
metric [Robinson and Foulds, 1981], which will identify possible topology mismatch with the
reference tree. The consensus tree has been generated using TreeAnnotator and then the
visualization using FigTree. For the SMC algorithm, the particles have been resampled in
order to be able to compare SMC tree samples without the need to consider the particle
weights when building the consensus, for ease of calculation.
The RF metric result for the run is 0, meaning a match from a topological point of view with
the reference tree of Section A.1, and we can see from the picture below A.9 the consensus
tree created with visualization of the 95% confidence range in the coalescent times (see the
comparison with the generator tree of Figure A.1)
198
Figure A.9: Consensus tree for the MCMC run with visualization of 95% range for coalescent
times. See comparison with the generating tree (which the MCMC run tries to reconstruct)
in Figure A.1. Consensus tree has been generated with TreeAnnotator and the visualization
is with FigTree (both softwasre from BEAST2 package).
The RF metric result for the run is 0, meaning a match from a topological point of view with
the reference tree of Section A.1, and we can see from the picture below A.10 the consensus
tree created with visualization of the 95% confidence range in the coalescent times (see the
comparison with the generator tree of Figure A.1, and with the MCMC-generated consensus
tree of Figure A.9)
199
Figure A.10: Consensus tree for the Annealed Adaptive SMC run with visualization of 95%
range for coalescent times. See comparison with the generating tree (which the SMC run
tries to reconstruct) in Figure A.1, and also with the tree reconstructed using MCMC in
Figure A.9: we see that the SMC is able to reconstruct the generating tree well and with
a smaller uncertainty (the 95% uncertainty ranges in the coalescent times are in general
smaller compared to the MCMC of Figure A.9). Consensus tree has been generated with
TreeAnnotator and the visualization is with FigTree (both softwasre from BEAST2 package).
By comparing Figure A.10 with Figure A.9 we can see that the SMC algorithm has been
able to reconstruct the generating tree with smaller uncertainty than the MCMC algorithm,
in fact the 95% uncertainty ranges in the coalescent times are in general smaller in SMC
compared to MCMC.
Generator tree
The first step has been as described in 3.13.1 to generate a coalescent tree with 20 leaves,
and the tree has been randomly generated as below in Figure A.11
200
Figure A.11: Random coalescent tree with 20 leaves generated using the procedure outlined
in the first part of Section 3.13.1. This has been the generating tree for the synthetic data of
the test described in this section. Visualization via FigTree [Rambaut, 2023]
Using the tree generated in the previous step, synthetic sequences have been generated using
’seq-gen’ program, as explained in Section 3.13.1
201
Figure A.12: Annealing steps in the SMC run for the 20-taxa example studied in this section.
Therefore the total number of likelihood evaluation for the algorithm has been 1000 ×
96 × 5 = 480000. The adaptive annealing steps have been determined using CESS with a
threshold of 90%, and resampling of particles is done when ESS falls below 50% of particles,
we can see below in Figure A.13 the ESS chart related to the SMC run
Figure A.13: ESS versus annealing SMC step for the 20-taxa example studied in this section.
Resampling is performed whenever the ESS falls below 50% of particles (1000 particles are
used for the simulation).
202
A.5.2 MCMC set up
Considering that the total number of likelihood evaluations from the Annealed Adaptive
SMC was 480000, we have used a comparison similar to [Wang et al., 2019] and therefore we
have used a number of MCMC iterations greater than +20% compared to the SMC run, in
our MCMC simulation we have used 600000 iterations.
• Gamma shape
• Coalescent Tree
The mean of the MCMC run is close to the true value of 1, we can see the full statistics in
the following table. Like in the 5 and 10 taxa cases for gamma shape (see the MCMC part
of Section A.3.1 and 3.16.1), the ESS is very low and the distribution, as we can appreciate
in the following Figure A.14, is rather scattered:
203
Statistic Value
Mean 0.92
Standard Deviation 0.066
Value Range [0.776, 1.107]
95% HPD Interval [0.7829, 1.0403]
Effective Sample Size (ESS) 142
Figure A.14: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Gamma shape with 20 taxa. Visualization with the software Tracer.
The Statistics for the SMC run are similar to the MCMC:
Statistic Value
Mean 1.04
Standard Deviation 0.038
Value Range [0.709, 1.096]
95% HPD Interval [0.95, 1.08]
204
Figure A.15: Frequency distribution for the parameter Gamma shape with 20 taxa, using
Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization with
python matplotlib.
The mean of the MCMC run is 1.73, we can see the full statistics in the following table
Statistic Value
Mean 0.93
Standard Deviation 0.244
Value Range [0.344, 2.383]
95% HPD Interval [0.52, 1.395]
Effective Sample Size (ESS) 2109
205
Figure A.16: Frequency distribution using the native MCMC run with BEAST2 for the
parameter Effective Population Size with 20 taxa. Visualization with the software Tracer.
The Statistics for the SMC run are comparable to the MCMC run, we can notice a peak in
the distribution of figure , it may be due to a mixture of not enough MCMC moves on the
parameter and some degree of degeneracy of particles:
Statistic Value
Mean 1.075
Standard Deviation 0.21
Value Range [0.44, 1.91]
95% HPD Interval [0.62, 1.52]
206
Figure A.17: Frequency distribution for the parameter Effective Population Size with 20 taxa,
using Annealed Adaptive SMC algorithm that we have embedded into BEAST2. Visualization
with python matplotlib.
We can also appreciate the SMC algorithm at work by looking at Figure A.18 below,
showing the evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size, and we see how the standard deviation drops
significantly through the annealing journey
Figure A.18: Evolution of the estimated standard deviation of particles vs the annealing step
for the parameter Effective Population Size in the SMC algorithm: we see how the standard
deviation drops significantly through the annealing journey.
207
A.6.3 Tree
For tree analysis please refer to the 5-taxa Section A.3.3 where full technical details are given.
For the SMC algorithm, the particles have been resampled in order to be able to compare SMC
tree samples without the need to consider the particle weights when building the consensus,
for ease of calculation (unlike what has been done in gamma shape and Effective Population
Size where particles have been weighted to calculate the Statistics). In addition, differently
from the 5 and 10 taxa cases of Sections A.3.3 and 3.16.3 respectively, we have omitted the
95% ranges in Figures A.19 for MCMC and A.20 for SMC for ease of visualization as it would
have been difficult to visualize with many nodes and branches, anyway similar conclusions
to the 5 and 10 taxa cases can be drawn for the 20 taxa case as well.
The RF metric result for the run is 0, meaning a match from a topological point of view with
the reference tree of Section A.4, and we can see from the picture below A.19 the consensus
tree (see the comparison with the generator tree of Figure A.11)
Figure A.19: Consensus tree for the MCMC run. See comparison with the generating tree
(which the MCMC run tries to reconstruct) in Figure A.11. Consensus tree has been gener-
ated with TreeAnnotator and the visualization is with FigTree (both softwasre from BEAST2
package).
208
Ammealed Adaptive SMC results for Tree
The RF metric result for the run is 0, meaning a match from a topological point of view
with the reference tree of Section A.4, and we can see from the picture below A.20 the
consensus tree (see the comparison with the generator tree of Figure A.11, and with the
MCMC-generated consensus tree of Figure A.19)
Figure A.20: Consensus tree for the Annealed Adaptive SMC run. See comparison with
the generating tree (which the SMC run tries to reconstruct) in Figure A.11, and also with
the tree reconstructed using MCMC in Figure A.19: we see that the SMC is able to recon-
struct the generating tree with similar level of accuracy as the MCMC. Consensus tree has
been generated with TreeAnnotator and the visualization is with FigTree (both softwasre from
BEAST2 package).
By comparing Figure A.20 with Figure A.19 we can see that the SMC algorithm has
been able to reconstruct the generating tree with similar level of accuracy as the MCMC
algorithm.
209
Appendix B
We use quite a few examples of Bayesian inverse problems in the Active Subspace Chapters
4 and subsequent, we give here the basic concepts. We consider a classical Bayesian inverse
problem reported in [Constantine et al., 2016], and we follow the same notation of the paper:
we have a model with additive noise, as follows
It is assumed for simplicity that the Gaussian noise e has uncorrelated components and
therefore diagonal covariance matrix.
Given N independent measurements of (B.1)
di = m(x) + ei (B.2)
The likelihood of the system is (see for example [Najm, 2018] for a nice and clear derivation)
N
Y
ρlik (d, x) = p(di |x) (B.3)
i=1
We show the one-dimensional case below of the the density p(di |x), where we recognize the
familiar Gaussian form
(di − m(x))2
1
p(di |x) = √ exp − (B.4)
2πσ 2σ 2
The product (B.3) is a result of the independence of the measurements, and that it results
in a product of Gaussians with members as in (B.4) is a consequence of the Gaussian noise
assumption and of the components of noise being uncorrelated.
In similar fashion and for a more general multi-dimensional parameter space, the likelihood
210
can be expressed in the compact form (neglecting for brevity the multiplying constants)
||d − m(x)||2
ρlik (d, x) ∝ exp − (B.5)
2σ 2
We choose as function f of (4.8) the negative log-likelihood as seen in equation (4.16), which
will be, as it is clear from its formulation in equation (B.6), a measure of the data misfit. By
applying the negative-log to (B.5), we have
||d − m(x)||2
f (x) = (B.6)
2σ 2
We know that the first necessary pre-processing step is to use the gradient of f to find the
directions along which f varies the most (see equation (4.2) and subsequent). The gradient
of f in (B.6) can be easily calculated
1
∇f (x) = ∇m(x)T (d − m(x)) (B.7)
σ2
Now that we have the gradient from equation (B.7), in order to build the matrix C of
(4.2) we have to perform an integration of ∇f (x)∇f (x)T against the posterior density of
our problem. Since we have to assume that integrating on the posterior, and even drawing
directly iid samples from the posterior and use Ĉ of (4.12), is not convenient or tractable
(otherwise we would not use the MCMC to approximate it, in the first place), we choose the
approximation that we called Ĉpri in (4.28), obtained by (4.12) when we sample from the
prior distribution.
211
Appendix C
This section relates to the simplified version of Algorithm 8 of Section 4.8.2. In case of per-
fectly inactive variables, equation (4.25) becomes an exact marginal, not a pseudo-marginal,
and in this case Algorithm 8 simplifies in Algorithm 21 which becomes a MCMC on the
targeting the marginal posterior πa = pa la
see the differences between line 5 of Algorithm 21 where an exact marginal is used and line
5 of Algorithm 8 where an estimate was used through the calculation of a pseudo-marginal.
212
Appendix D
When comparing the performances of the Active Subspaces algorithms presented in Chapter
5, we have tried to keep the number of likelihood evaluations constant in each run, to ensure
a fair comparison. Due to the structure of the algorithms, the same number of likelihood
evaluations may result in a different number of output samples. Taking a reference figure of
100000 likelihood evaluations, it will result in:
• AS-PMMH: if we use 10 inner inactive variables and 6 tempering steps of the inner
SMC sampler,then we have 1666 outer MCMC steps and therefore as many output
samples (1666 × 10 × 6 = 99960 which is the closest integer to 100000);
• AS-PMMH-i: same of AS-PMMH, only roles of active and inactive are inverted;
• AS-Gibbs: considering that we first operate on the inactive and then active, 100000
likelihood evaluations are done in 50000 iterations and therefore produce 50000 output
samples.
213
Method Nr of output samples
MCMC 100000
∗
AS-MH 10000
AS-PMMH 1666
AS-PMMH-i 1666
AS-Gibbs 50000
Table D.1: Comparison of number of output samples when performing 100000 likelihood
evaluations in different MCMC methods. ∗ For AS-MH the figure in the table indicates number
of samples when one inactive sample is used per active variable, like in formula (4.23). If
instead all inactive particles are used, like in formula (4.24), the relative figure must be
multiplied by the number of inactive variables used, 10 in this case.
In Table 5.1, for AS-MH the figure in the table indicates number of samples when one
inactive sample is used per active variable, like in formula (4.23). If instead all inactive
particles are used, like in formula (4.24), the relative figure must be multiplied by 10 (number
of inactive variables used) to consider all the samples.
For ease of reference in some of the sections, we also report the table for 200000 likelihood
evaluations (it is the above Table D.1 with numbers ×2 )
Table D.2: Comparison of number of output samples when performing 200000 likelihood
evaluations in different MCMC methods, this is the equivalent of Table 5.1, adapted for
200000. ∗ For AS-MH the figure in the table indicates number of samples when one inactive
sample is used per active variable, like in formula (4.23). If instead all inactive particles are
used, like in formula (4.24), the relative figure must be multiplied by the number of inactive
variables used, 10 in this case.
214
Bibliography
Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter
Walter. Molecular Biology of the Cell. Garland Science, 4 edition, 2002.
M Amaya, N Linde, and E Laloy. Adaptive sequential Monte Carlo for posterior inference
and model selection among complex geological priors. Geophysical Journal International,
226(2):1220–1238, 04 2021. ISSN 0956-540X. doi: 10.1093/gji/ggab170. URL https:
//doi.org/10.1093/gji/ggab170.
Macarena Amaya, Niklas Linde, and Eric Laloy. Hydrogeological multiple-point statis-
tics inversion by adaptive sequential monte carlo. Advances in Water Resources, 166:
104252, 2022. ISSN 0309-1708. doi: https://doi.org/10.1016/j.advwatres.2022.104252.
URL https://www.sciencedirect.com/science/article/pii/S0309170822001221.
C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient monte carlo
computations. The Annals of Statistics, 37(2), 2009. doi: 10.1214/07-AOS574.
Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte
carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
72(3):269–342, 2010. doi: 10.1111/j.1467-9868.2009.00736.x.
A. Beskos, D. Crisan, and A. Jasra. On the stability of sequential monte carlo methods in
high dimensions. Ann. Appl. Probab., 24(4):1396–1445, 2014. doi: 10.1214/13-AAP951.
Alexandros Beskos, Dan Crisan, Ajay Jasra, and Nick Whiteley. Error bounds and normal-
izing constants for sequential monte carlo in high dimensions. 2011.
215
R. Bouckaert and P. Lockhart. Capturing heterotachy through multi-gamma site models.
bioRxiv, 2015. doi: 10.1101/018101.
R. Caflisch. Monte carlo and quasi-monte carlo methods. Acta Numerica, 7:1–49, 1998. doi:
10.1017/S0962492900002804.
N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–551,
2002.
Nicolas Chopin, Pierre E. Jacob, and Omiros Papaspiliopoulos. Smcˆ2: an efficient algorithm
for sequential analysis of state-space models, 2012.
P. G. Constantine, C. Kent, and T. Bui-Thanh. Accelerating markov chain monte carlo with
active subspaces. SIAM Journal on Scientific Computing, 2016. doi: 10.1137/15m1042127.
P. Del Moral and A. Doucet. Sequential monte carlo samplers. 2003. Condensed Matter
(cond-mat) CUED-F-INFENG-443, arXiv:cond-mat/0212648.
Alexei J. Drummond and Remco R. Bouckaert. Bayesian Evolutionary Analysis with BEAST.
Cambridge University Press, Cambridge, 2015.
Alexei J Drummond and Andrew Rambaut. Beast: Bayesian evolutionary analysis by sam-
pling trees. BMC evolutionary biology, 7(1):214, 2007.
216
Alexei J Drummond, Marc A Suchard, Dong Xie, and Andrew Rambaut. Bayesian phylo-
genetics with beauti and the beast 1.7. Molecular biology and evolution, 29(8):1969–1973,
2012.
V. Elvira, L. Martino, and C. P. Robert. Rethinking the effective sample size. 2018. arXiv:
1809.04129.
Andrew Gelman, Walter R. Gilks, and Gareth O. Roberts. Weak convergence and optimal
scaling of random walk metropolis algorithms. Annals of Applied Probability, 7(1):110–120,
1997.
Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B.
Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 3 edition, 2013. doi: 10.1201/
b16018.
Donald Geman and Stefanie Geman. Stochastic relaxation, gibbs distributions, and the
bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 6(6):721–741, 1984.
Edward I. George and Robert E. McCulloch. Approaches for bayesian variable selection.
Statistica Sinica, 7(2):339–373, 1997.
W. K. Hastings. Monte carlo sampling methods using markov chains and their applications.
Biometrika, 57:97109, 1970.
J. Hein, M. Schierup, and C. Wiuf. Gene genealogies, variation and evolution: a primer in
coalescent theory. 2004.
Joseph Heled and Alexei J. Drummond. Bayesian inference of population size history from
multiple loci. BMC Evolutionary Biology, 8:289, Oct 2008. ISSN 1471-2148. doi: 10.1186/
1471-2148-8-289. URL https://pubmed.ncbi.nlm.nih.gov/18947398.
Zheng Hou, Xiaoya Ma, Xuan Shi, Xi Li, Lingxiao Yang, Shuhai Xiao, Olivier
De Clerck, Frederik Leliaert, and Bojian Zhong. Phylotranscriptomic insights into
a mesoproterozoicneoproterozoic origin and early radiation of green seaweeds (ulvo-
phyceae). Nature Communications, 13(1):1610, 2022. ISSN 2041-1723. doi: 10.1038/
s41467-022-29282-9. URL https://doi.org/10.1038/s41467-022-29282-9. Open Ac-
cess. This article is licensed under a Creative Commons Attribution 4.0 International Li-
cense (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adap-
tation, distribution, and reproduction in any medium or format, as long as appropriate
217
credit is given to the original author(s) and the source, a link to the Creative Commons
license is provided, and it is indicated if changes were made.
Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: Frequentist and bayesian
strategies. The Annals of Statistics, 33(2):730–773, 2005.
Adam M. Johansen and L. Evers. Monte carlo methods lecture notes. 2007. University Of
Bristol.
Thomas H Jukes and Charles R Cantor. Evolution of protein molecules. Mammalian Protein
Metabolism, 1969.
J. Kingman. The coalescent. Stochastic processes and their applications, 13(3):235248, 1982.
A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and bayesian missing data
problems. Journal of the American Statistical Association, 9:278–288, 1994.
N. Metropolis and S. Ulam. The monte carlo method. Journal of the American Statistical
Association, 44(247), 1949.
Toby J. Mitchell and John J. Beauchamp. Bayesian variable selection in linear regression.
Journal of the American Statistical Association, 83(404):1023–1032, 1988.
H. N. Najm. Statistical inverse problems and bayesian inference. 2018. Sandia National Lab-
oratories, Livermore, CA, USA. UQ Lecture Series at the American University of Beirut,
Beirut, Lebanon, April 23-27, 2018. URL: https://www.osti.gov/servlets/purl/1508912.
A. Owen. Monte carlo theory, methods and examples. 2013. Available at:
http://statweb.stanford.edu/owen/mc/.
218
Technologies Inc. Plotly. Collaborative data science, 2015. URL https://plot.ly.
Andrew Rambaut and Nicholas C Grassly. Seq-gen: an application for the monte carlo
simulation of dna sequence evolution along phylogenetic trees. Computer applications in
the biosciences: CABIOS, 13(3):235–238, 1997.
Douglas Reynolds. Gaussian Mixture Models, pages 659–663. Springer US, Boston, MA,
2009. ISBN 978-0-387-73003-5. doi: 10.1007/978-0-387-73003-5 196. URL https://doi.
org/10.1007/978-0-387-73003-5_196.
Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate bayesian inference for latent
gaussian models by using integrated nested laplace approximations. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 71(2):319–392, 2009.
Simon Tavaré. Some probabilistic and statistical problems in the analysis of dna sequences.
Lectures on Mathematics in the Life Sciences, 17:57–86, 1986.
Luke Tierney. Markov chains for exploring posterior distributions. Annals of Statistics, 22
(4):1701–1728, 1994.
Luke Tierney and Joseph B Kadane. Accurate approximations for posterior moments and
marginal densities. Journal of the American Statistical Association, 81(393), 1986.
Dootika Vats, James M Flegal, and Galin L Jones. Multivariate output analysis for Markov
chain Monte Carlo. Biometrika, 106(2):321–337, 04 2019. ISSN 0006-3444. doi: 10.1093/
biomet/asz002. URL https://doi.org/10.1093/biomet/asz002.
219
Liangliang Wang, Shijia Wang, and Alexandre Bouchard-Ct. An Annealed Sequential Monte
Carlo Method for Bayesian Phylogenetics. Systematic Biology, 69(1):155–183, 06 2019.
ISSN 1063-5157. doi: 10.1093/sysbio/syz028. URL https://doi.org/10.1093/sysbio/
syz028.
Z. Yang. Maximum likelihood phylogenetic estimation from dna sequences with variable rates
over sites: Approximate methods. J. Mol Evol, 39:306–314, 1994.
Yan Zhou, Adam M Johansen, and John A D Aston. Towards automatic model comparison:
An adaptive sequential monte carlo approach. 2013. doi: 10.48550/ARXIV.1303.3123.
220